Föderation EN Mi 27.11.2024 00:29:42 New from 404 Media: Bluesky may have said it won't use user data to train generative AI, but someone else just published a dataset of million Bluesky posts for "machine learning research". Already very popular dataset, your data may be scraped https://www.404media.co/someone-made-a-dataset-of-one-million-bluesky-posts-for-machine-learning-research/ |
Föderation EN Mi 27.11.2024 00:48:52 @josephcox I guess you could also pull this together from Mastodon, but Bluesky is going to make readily available data much faster. |
Föderation EN Mi 27.11.2024 01:10:36 @tehstu @josephcox I'm pretty critical of Bluesky (see my timeline) but I don't see why this would be any harder or slower to do from Mastodon/fedi |
Föderation EN Mi 27.11.2024 02:11:57 @tomw @josephcox I meant in the sense that Bluesky has more users and so generates the content faster. I was trying to guess why there haven't been very public datasets to this effect from Mastodon. |
Föderation EN Mi 27.11.2024 03:38:47 @tomw @tehstu @josephcox all Bluesky data is public. Many ActivityPub posts are private or followers-only. |
Föderation EN Mi 27.11.2024 05:45:38 @evan Small clarification as I know you’ve avoided the Bluesky literature: Bluesky DMs are not public because they’re not part of ATProto. They’re a separate service. |
Föderation EN Mi 27.11.2024 05:53:36 @amd thanks. Updated. |
Föderation EN Mi 27.11.2024 01:19:18 @josephcox Couldn't care less what happens there. I can't quite bring myself to block 404, but, dammit, I'm sick of hearing about that service. I just couldn't fucking care less. |
Föderation EN Mi 27.11.2024 01:29:49 @josephcox Necessary consequence of having an open firehose (which they're committed to, as letting others start an interoperating peer network is part of their future-proofing strategy). Researchers on, e.g., abusive behavior love it; they have full access. But there are downsides. (And scraping the local and network feeds of a few fairly central Mastodon servers would have a similar effect... not the whole fediverse, but with rebroadcasts from network feeds, quite a lot of it.) |
Föderation EN Mi 27.11.2024 02:10:16 Sounds to me like a back-alley cash deal that can't be traced back to BS... so they can preserve their integrity, maybe even act like the victim. |
Föderation EN Mi 27.11.2024 02:39:00 |
Föderation · Mi 27.11.2024 03:12:08 @josephcox Technically, most of the Fediverse's content is also in public access, so nothing could stop anyone from scraping and using it for machine learning either, right? |
Föderation EN Mi 27.11.2024 03:22:56 @josephcox I don't trust Bluesky with or without Jack Dorsey. To be blunt, TANSTAAFL! |
Föderation EN Mi 27.11.2024 03:33:41 @josephcox ah yes the "trust us we're a good corporation" method of data protection. I'm sure it won't end up like Twitter in a few years. |