hhmx.de

Joseph Cox

Föderation EN Mi 27.11.2024 00:29:42

New from 404 Media: Bluesky may have said it won't use user data to train generative AI, but someone else just published a dataset of million Bluesky posts for "machine learning research". Already very popular dataset, your data may be scraped 404media.co/someone-made-a-dat

Stu

Föderation EN Mi 27.11.2024 00:48:52

@josephcox I guess you could also pull this together from Mastodon, but Bluesky is going to make readily available data much faster.

Tom Walker

Föderation EN Mi 27.11.2024 01:10:36

@tehstu @josephcox I'm pretty critical of Bluesky (see my timeline) but I don't see why this would be any harder or slower to do from Mastodon/fedi

Stu

Föderation EN Mi 27.11.2024 02:11:57

@tomw @josephcox I meant in the sense that Bluesky has more users and so generates the content faster. I was trying to guess why there haven't been very public datasets to this effect from Mastodon.

Evan Prodromou

Föderation EN Mi 27.11.2024 03:38:47

@tomw @tehstu @josephcox all Bluesky data is public. Many ActivityPub posts are private or followers-only.

amd

Föderation EN Mi 27.11.2024 05:45:38

@evan Small clarification as I know you’ve avoided the Bluesky literature: Bluesky DMs are not public because they’re not part of ATProto. They’re a separate service.

Evan Prodromou

Föderation EN Mi 27.11.2024 05:53:36

@amd thanks. Updated.

Robert Link

Föderation EN Mi 27.11.2024 01:19:18

@josephcox Couldn't care less what happens there. I can't quite bring myself to block 404, but, dammit, I'm sick of hearing about that service. I just couldn't fucking care less.

Robert Thau

Föderation EN Mi 27.11.2024 01:29:49

@josephcox Necessary consequence of having an open firehose (which they're committed to, as letting others start an interoperating peer network is part of their future-proofing strategy). Researchers on, e.g., abusive behavior love it; they have full access. But there are downsides.

(And scraping the local and network feeds of a few fairly central Mastodon servers would have a similar effect... not the whole fediverse, but with rebroadcasts from network feeds, quite a lot of it.)

JohnW

Föderation EN Mi 27.11.2024 02:10:16

@josephcox

Sounds to me like a back-alley cash deal that can't be traced back to BS... so they can preserve their integrity, maybe even act like the victim.

matlag

Föderation · Mi 27.11.2024 03:12:08

@josephcox Technically, most of the Fediverse's content is also in public access, so nothing could stop anyone from scraping and using it for machine learning either, right?

AlgoCompSynth by znmeb 🇺🇦

Föderation EN Mi 27.11.2024 03:22:56

@josephcox I don't trust Bluesky with or without Jack Dorsey. To be blunt, TANSTAAFL!

Konomi Kitten

Föderation EN Mi 27.11.2024 03:33:41

@josephcox ah yes the "trust us we're a good corporation" method of data protection.

I'm sure it won't end up like Twitter in a few years.