Hhmx.de * hhmx.de

Clive Thompson

Clive Thompson
@clive@saturation.social

Föderation EN Do 23.01.2025 19:06:27

A hacker developed an "infinite maze" to trap web-crawlers/scrapers from AI companies

basically, if the server code detects that a web crawler from an AI firm is trying to scrape the site ...

... the code begins spinning up an infinite, nesting warren of new sham pages, filled with random text

so the crawler gets stuck crawling and scraping endless and meaningless pages

fun @jasonkoebler piece at @404mediaco

https://www.404media.co/email/7a39d947-4a4a-42bc-bbcf-3379f112c999/?ref=daily-stories-newsletter

0x 59 1x

Valentino Gagliardi

Valentino Gagliardi
@gagliardi_vale@fosstodon.org

Föderation EN Do 23.01.2025 19:08:29

@clive what a waste from both sides

0x 2 0x

Clive Thompson

Clive Thompson
@clive@saturation.social

Föderation EN Do 23.01.2025 19:10:11

@gagliardi_vale

yep, I think that's basically the point of it

0x 1 0x

Bornach

Bornach
@bornach@fosstodon.org

Föderation EN Fr 24.01.2025 08:38:14

@clive @gagliardi_vale
Job creation for data annotators in India, Nigeria, Vietnam, etc who have been given the microtasks of removing any junk like this from the AI training data.

0x 1 0x

Clive Thompson

Clive Thompson
@clive@saturation.social

Föderation EN Fr 24.01.2025 17:48:48

@bornach @gagliardi_vale

yes indeed

0x 0 0x

potpie
@potpie@mastodon.social

Föderation EN Fr 24.01.2025 00:49:24

@gagliardi_vale @clive this is what we're doing, instead of scrambling to salvage our odds of surviving this century as a species

0x 1 0x

Valentino Gagliardi

Valentino Gagliardi
@gagliardi_vale@fosstodon.org

Föderation EN Fr 24.01.2025 07:19:10

@potpie @clive right?

0x 0 0x

quangobaud
@miguelpergamon@kolektiva.social

Föderation EN Do 23.01.2025 19:16:02

@clive @jasonkoebler @404mediaco
Can this be used as a method for creating blockdrain creeptocoin con? 🤔

0x 0 0x

Tom Bortels

Tom Bortels
@tbortels@infosec.exchange

Föderation EN Do 23.01.2025 19:19:40

@clive @jasonkoebler @404mediaco

Was just on a thread a week or so ago about what to do with aggressive AI web scrapers that won't self-limit or respect robots.txt.

This is evolution in action.

Nature is healing.

0x 2 0x

Clive Thompson

Clive Thompson
@clive@saturation.social

Föderation EN Do 23.01.2025 21:55:27

@tbortels @jasonkoebler @404mediaco

it's pretty wild

0x 1 0x

Chris Real
@_chris_real@kolektiva.social

Föderation EN Do 23.01.2025 22:47:01

@clive @tbortels @jasonkoebler @404mediaco

It's a practical application of "GIGO".

Ahh, there's a place for everything—and GIGO has finally found its place!

0x 0 0x

🐧DaveNull🐧 ☣️pResident Evil☣

🐧DaveNull🐧 ☣️pResident Evil☣
@devnull@mamot.fr

Föderation EN Do 23.01.2025 23:47:20

@tbortels Is there even such thing as "non-aggressive AI web scrapers" that will self-limit and respect robots\.txt?

At least google's and micro$hit's ignore robots\.txt. It downloaded photos from my gallery, up to 6000 requests a day, more than once… I bet not even 10 of them are legit users.

I've only 38 photos… stupid bots download the same photos over and over again…

I've blocked 4 IP ranges. It probably includes indexation bots' IP but I don't give an F.

@clive @jasonkoebler @404mediaco

0x 2 0x

Tom Bortels

Tom Bortels
@tbortels@infosec.exchange

Föderation EN Do 23.01.2025 23:50:25

@devnull @clive @jasonkoebler @404mediaco

I felt obligated to disclaim my fantasy well-behaved AI scrapers just in case. The actual headcount there may well be zero.

0x 2 0x

Bornach

Bornach
@bornach@fosstodon.org

Föderation EN Fr 24.01.2025 08:44:49

@tbortels @devnull @clive @jasonkoebler @404mediaco
There is such a thing as a non-aggressive respectful AI scrapper. It's called asking for permission from the copyright owner and obtaining an appropriate license if their AI system can generate derivative works using your content.
https://youtu.be/PeKZvUcr0-M

0x 1 0x

Tom Bortels

Tom Bortels
@tbortels@infosec.exchange

Föderation EN Fr 24.01.2025 09:23:41

@bornach @devnull @clive @jasonkoebler @404mediaco

Alas - those scrapers are out of scope because they're not the ones causing problems and driving this conversation. Indeed - if someone licensed content legitimately, the need to scrape the web would be absent - there are far more efficient ways to say "here are all of the new posts in the last N hours".

You can safely assume any automation ignoring your robots.txt is a pest to be ruthlessly crushed in whatever manner amuses you most.

0x 1 0x

Clive Thompson

Clive Thompson
@clive@saturation.social

Föderation EN Fr 24.01.2025 17:43:35

@tbortels @bornach @devnull @jasonkoebler @404mediaco

yep -- licensing would obviate the hassles of scraping

"here's our API, enjoy"

0x 0 0x

Clive Thompson

Clive Thompson
@clive@saturation.social

Föderation EN Fr 24.01.2025 17:42:38

@tbortels @devnull @jasonkoebler @404mediaco

yeah

0x 0 0x

Clive Thompson

Clive Thompson
@clive@saturation.social

Föderation EN Fr 24.01.2025 17:42:23

@devnull @tbortels @jasonkoebler @404mediaco

bleah, what a mess!

0x 0 0x

Kevin Freitas
@KevinFreitas@mastodon.social

Föderation EN Do 23.01.2025 19:59:38

@404mediaco @clive @jasonkoebler Love this! I built a simple #WordPress plugin that garbles your web content to serve them up garbage:

https://kevinfreitas.net/tools-experiments/

#AI #GPT #LLMs

0x 1 0x

Clive Thompson

Clive Thompson
@clive@saturation.social

Föderation EN Do 23.01.2025 21:55:49

@KevinFreitas @404mediaco @jasonkoebler

oh damn that is cool

0x 0 0x

Koos Looijesteijn
@koos@octodon.social

Föderation EN Do 23.01.2025 20:09:30

@clive if LLMs are going to be half as good as they're promising they should be already, then millions of websites will serve endless LLM-generated content like that. Creating a really expensive infinite loop. Way to spend 500B $.

0x 2 0x

Clive Thompson

Clive Thompson
@clive@saturation.social

Föderation EN Do 23.01.2025 21:56:51

@koos

truly

0x 0 0x

Bornach

Bornach
@bornach@fosstodon.org

Föderation EN Fr 24.01.2025 08:52:46

@koos @clive
This is already happening, and has already happened
https://mastodon.social/@acegikmo/113763950485888985
AI has created its own tarpit. Without human annotators to filter the crap out, they risk becoming the yeast in a petridish that slowly poisons itself with the very same alcohol it generates.

0x 1 0x

Clive Thompson

Clive Thompson
@clive@saturation.social

Föderation EN Fr 24.01.2025 17:48:36

@bornach @koos

yep yep

0x 0 0x

Lucas C. Wheeler
@lcwheeler@ecoevo.social

Föderation EN Do 23.01.2025 20:19:51

@clive @jasonkoebler @404mediaco I love that it's called Nepenthes. One of the coolest plant genera!

0x 2 0x

{Insert Pasta Pun}

{Insert Pasta Pun}
@risottobias@tech.lgbt

Föderation EN Do 23.01.2025 21:01:26

@lcwheeler @clive @jasonkoebler @404mediaco

https://starwars.fandom.com/wiki/Nepenth%C3%A9 - Nepenthé is programming fluid for robots

0x 3 0x

Lucas C. Wheeler
@lcwheeler@ecoevo.social

Föderation EN Do 23.01.2025 21:08:23

@risottobias @clive @jasonkoebler @404mediaco In the article it says "The program, called Nepenthes after the genus of carnivorous pitcher plants which trap and consume their prey, can be deployed by webpage owners to protect their own content from being scraped or can be deployed “offensively” as a honeypot trap to waste AI companies’ resources." It's a fitting name.

0x 1 0x

Clive Thompson

Clive Thompson
@clive@saturation.social

Föderation EN Do 23.01.2025 21:57:26

@lcwheeler @risottobias @jasonkoebler @404mediaco

yep yep

0x 0 0x

Clive Thompson

Clive Thompson
@clive@saturation.social

Föderation EN Do 23.01.2025 21:57:14

@risottobias @lcwheeler @jasonkoebler @404mediaco

did not know!

0x 0 0x

Aaron

Aaron
@aaron@chirp.zadzmo.org

Föderation EN Fr 24.01.2025 00:43:08

@risottobias I'm not much of a Starwars fan, and was unaware of that reference when I named it.

https://chirp.zadzmo.org/@aaron/statuses/01JHKHCW6KXVMGRMDRA4AR9TAJ

@lcwheeler @clive @jasonkoebler @404mediaco

0x 0 0x

Clive Thompson

Clive Thompson
@clive@saturation.social

Föderation EN Do 23.01.2025 21:57:01

@lcwheeler @jasonkoebler @404mediaco

yessss

0x 0 0x

2xfo

2xfo
@RnDanger@infosec.exchange

Föderation EN Do 23.01.2025 20:23:31

@clive @jasonkoebler @404mediaco

I've seen stories about people hosting sites that got hit by robots and they had to pay a bunch of money in data costs. I wonder how this works, if it can help in that regard when the whole point is to keep them pointed at your site.

I'm all for wasting their time, i just wonder how much it costs.

0x 3 0x

Lord Thomas Klopf of Bohemia

Lord Thomas Klopf of Bohemia
@thomas_klopf@dobbs.town

Föderation EN Do 23.01.2025 20:52:59

@RnDanger @clive @jasonkoebler @404mediaco yeah, you’d have to host this on a service that doesn’t charge by network traffic

0x 2 0x

OCTADE

OCTADE
@octade@soc.octade.net

Föderation · Do 23.01.2025 20:58:44

@RnDanger@infosec.exchange @clive@saturation.social @jasonkoebler@mastodon.social @404mediaco@mastodon.social

Employ bandwidth throttling at about 16K with a few hundred thousand link trees to follow. That will really teach them and save your bandwidth bill.

0x 0 0x

Clive Thompson

Clive Thompson
@clive@saturation.social

Föderation EN Do 23.01.2025 21:57:52

@thomas_klopf @RnDanger @jasonkoebler @404mediaco

true

0x 0 0x

Clive Thompson

Clive Thompson
@clive@saturation.social

Föderation EN Do 23.01.2025 21:57:44

@RnDanger @jasonkoebler @404mediaco

yeah good question!

0x 0 0x

Trebach

Trebach
@trebach@functional.cafe

Föderation EN Do 23.01.2025 23:36:56

@RnDanger @clive @jasonkoebler @404mediaco This would need to be deployed to a server with a fixed cost, not one with extra costs for execution time or bandwidth

0x 0 0x

Einfach Nein
@nein@social.cologne

Föderation EN Do 23.01.2025 20:26:44

@clive

Finally the equivalent of the mail tar pit!

Hooray!

0x 0 0x

gdtrfb57

gdtrfb57
@gdtrfb57@mastodon.social

Föderation EN Do 23.01.2025 20:38:05

@clive @jasonkoebler @404mediaco Tip of the Cub cap to the hacker!

0x 0 0x

Meercat ✅
@meercat0@mastodon.social

Föderation EN Do 23.01.2025 20:39:21

@clive @jasonkoebler @404mediaco good idea👍

0x 0 0x

Kamikaze

Kamikaze
@Kamikaze@comics.town

Föderation EN Do 23.01.2025 20:51:32

@clive @jasonkoebler @404mediaco Are we really getting Barrier Mazes from Ghost In the Shell??

0x 1 0x

Clive Thompson

Clive Thompson
@clive@saturation.social

Föderation EN Do 23.01.2025 21:58:58

@Kamikaze @jasonkoebler @404mediaco

it would appear so

0x 0 0x

Michael Hartle

Michael Hartle
@mhartle@mastodon.online

Föderation EN Do 23.01.2025 20:54:03

@clive @jasonkoebler @404mediaco There are a number of "infinite maze" generators like #Nepenthes (https://zadzmo.org/code/nepenthes/) or #Iocaine (https://pages.madhouse-project.org/algernon/infrastructure.org/eru_services_iocaine) that help #poisonthewell for AI companies training their LLMs on your content, complete with guides on integration with #Caddy (https://pages.madhouse-project.org/algernon/infrastructure.org/common_services_caddy_snippets_poison_ai)

0x 1 0x

Clive Thompson

Clive Thompson
@clive@saturation.social

Föderation EN Do 23.01.2025 21:59:16

@mhartle @jasonkoebler @404mediaco

aha, damn interesting

0x 0 0x

Megan Lynch (she/her)

Megan Lynch (she/her)
@meganL@mas.to

Föderation EN Do 23.01.2025 20:55:08

@clive @jasonkoebler @404mediaco I think this stuff has been catching archiving sites too, though. So you try to archive a site for accountability purposes and it ends up not being able to be archived because it just churns forever.

0x 1 0x

Clive Thompson

Clive Thompson
@clive@saturation.social

Föderation EN Do 23.01.2025 21:59:31

@meganL @jasonkoebler @404mediaco

that danger leapt out at me too

0x 1 0x

Frank Heijkamp

Frank Heijkamp
@alterelefant@mastodontech.de

Föderation NL Do 23.01.2025 23:52:25

@clive @meganL @jasonkoebler @404mediaco It depends on the way the maze gets triggered. If the robots.txt explicitly excludes a certain url that is not directly linked from anywhere and that url sees a get request you can be 100% sure you have trapped a bogus crawler. Definitely go to town with it. Most crawler are usually not that stupid so therefore the triggers get slightly less reliable and there is a chance you trap a legitimate crawler like an archiving site for instance.

0x 0 0x

Janneke

Janneke
@janneke@todon.nl

Föderation EN Do 23.01.2025 21:00:04

@clive @jasonkoebler @404mediaco
Cc: @corbet

0x 0 0x

Luna chan

Luna chan
@Luna@mastodon.world

Föderation EN Do 23.01.2025 21:03:02

@clive @jasonkoebler @404mediaco What a great idea.

0x 0 0x

lin11c
@lin11c@toad.social

Föderation EN Do 23.01.2025 21:03:04

@clive @jasonkoebler @404mediaco
Very cool! I like it.

0x 0 0x

Sue Briccay :verifiedace:

Sue Briccay :verifiedace:
@essjayjay@tech.lgbt

Föderation EN Do 23.01.2025 21:08:47

@clive

This is beautiful.
Good work by them!

@jasonkoebler @404mediaco

0x 0 0x

Skybrook
@skybrook@pone.social

Föderation EN Do 23.01.2025 21:20:01

@clive I once made a webpage that would continually slowly send random words and links to itself, never quite closing the connection. It's honestly not worth the trouble. It'd be nice if it interfered with AI training, though.

0x 1 0x

Frank Heijkamp

Frank Heijkamp
@alterelefant@mastodontech.de

Föderation NL Do 23.01.2025 23:57:19

@skybrook @clive The idea is that a human will have to review it and flips a switch that will exclude the entire site. This exclusion will keep the actual content on the site safe from being ingested.

0x 1 0x

Skybrook
@skybrook@pone.social

Föderation NL Fr 24.01.2025 00:50:26

@alterelefant That's tricky, since the crawlers have vast amounts of IP addresses. I just set traps to detect web spiders automatically, if traffic gets to be a problem.

0x 1 0x

Frank Heijkamp

Frank Heijkamp
@alterelefant@mastodontech.de

Föderation NL Fr 24.01.2025 07:20:17

@skybrook Don't filter by IP-address, but filter by behavior. I know, that's sometimes easier said than done.

The following one is straight forward. A get request to a bogus link in the infinit labyrinth qualifies for a labyrinth response, whether the IP-address is known or a new one.

With a labyrinth response I would throw in a random delay between 100 ms and 5 s, and a one in fifty chance of a 30 s delay before responding with a http 503. That should usually be enough to slow down crawlers.

0x 1 0x

Skybrook
@skybrook@pone.social

Föderation NL Fr 24.01.2025 17:53:01

@alterelefant Well right, that's what I meant by "traps to detect." I didn't think of setting it so every URL for any detected IP address would become a labyrinth response... not a bad idea really.

0x 1 0x

Frank Heijkamp

Frank Heijkamp
@alterelefant@mastodontech.de

Föderation EN Fr 24.01.2025 20:43:22

@skybrook Crawler that use multiple endpoints to distribute the crawl load will handout urls to be crawled to those endpoints. Their freshly acquired labyrinth links will make a new endpoint immediately identifiable.

0x 0 0x

Dr Power Nap, DDS ✅️

Dr Power Nap, DDS ✅️
@ThePowerNap@mefi.social

Föderation EN Do 23.01.2025 21:39:16

@clive @jasonkoebler @404mediaco

Might be nice to add something to poison the data, contradictory statements, things that break the tokenizer, maybe subtle statistical tricks to inject gnarly statements.

0x 1 0x

Alistair K

Alistair K
@libroraptor@mastodon.nz

Föderation EN Do 23.01.2025 22:29:04

@ThePowerNap @clive @jasonkoebler @404mediaco Like a little Markov chain text generator? Cheaper than an LLM yet maybe good enough to pass for 'real' text in a training set.

0x 2 0x

Dr Power Nap, DDS ✅️

Dr Power Nap, DDS ✅️
@ThePowerNap@mefi.social

Föderation EN Do 23.01.2025 23:07:45

@libroraptor @clive @jasonkoebler @404mediaco

I like where your head is at

0x 1 0x

Alistair K

Alistair K
@libroraptor@mastodon.nz

Föderation EN Fr 24.01.2025 02:53:11

@ThePowerNap @clive @jasonkoebler @404mediaco my head's a lot like that pink labyrinth in the picture, to be honest, but with more dimensions and no indication of whether entrance or exit exist or make sense

I wish that I still had the brain capacity and physical energy to implement even half of the ideas that I come up with.

0x 1 0x

Clive Thompson

Clive Thompson
@clive@saturation.social

Föderation EN Fr 24.01.2025 03:31:28

@libroraptor @ThePowerNap @jasonkoebler @404mediaco

markov mazes

0x 0 0x

Bornach

Bornach
@bornach@fosstodon.org

Föderation EN Fr 24.01.2025 09:03:26

@libroraptor @ThePowerNap @clive @jasonkoebler @404mediaco

[f4mi] used wiki pages to which simplistic synonym substitution has been applied using a Python script
https://youtu.be/NEDFUjqA1s8
Confused the hell out of some of the AI scraper/summarisers.

0x 1 0x

Alistair K

Alistair K
@libroraptor@mastodon.nz

Föderation EN Fr 24.01.2025 09:56:33

@bornach @ThePowerNap @clive @jasonkoebler @404mediaco That's very funny! Turning classic SEO pervert techniques to greater good.

Also a good presenter. I rarely manage to listen to youtube talks – too much irrelevant babbling and metatalk, but this person has a clear narrative and stays on track.

0x 0 0x

chikl
@chikl@digitalcourage.social

Föderation EN Do 23.01.2025 21:42:35

@clive @jasonkoebler @404mediaco On one hand if you want to protect your art or something similar from being scraped because you sell it or just don't want your style to be stolen it's nice having such tools. But on the other hand if you value human rights and let your values influence your texts and pictures then you can influence AIs with your input.

I hate that AI has bad influence on the environment because it needs so much computing resources but we cannot stop AI so this can be a small influence from ourselves.

0x 1 0x

Aaron

Aaron
@aaron@chirp.zadzmo.org

Föderation EN Fr 24.01.2025 01:03:51

@chikl There is nothing stopping us, as a species, from deciding this isn't a good technological path and just unplugging it all. We made this thing and we can unmake it.

But the only people with power are "Number must go up" type billionaires who are doubling down on it.

You bring up human rights, too - from that angle, people should have a right to consent to be included in these things. I don't consent to it. So I grew myself some spikes and put Nepenthes out there for others to grow some as well.

@clive @jasonkoebler @404mediaco

0x 0 0x

★ blue-caller ☆
@bluecaller@urusai.social

Föderation EN Do 23.01.2025 21:52:39

@clive @jasonkoebler @404mediaco the house of leaves but you can't leave

0x 1 0x

Trebach

Trebach
@trebach@functional.cafe

Föderation EN Do 23.01.2025 23:38:21

@bluecaller @clive @jasonkoebler @404mediaco The Hotel California

0x 0 0x

seibelsays

seibelsays
@seibelsays@mstdn.party

Föderation EN Do 23.01.2025 22:06:16

@404mediaco @clive @jasonkoebler Love this for the bots and scrapers. Choke on it.

0x 0 0x

Damiano Gacík
@Damiano_Chech@techhub.social

Föderation EN Do 23.01.2025 22:13:09

@clive @jasonkoebler @404mediaco Brilliant idea! 😂 I can just imagine AI scrapers struggling to process an endless stream of random pages. It's like trolling on level 80 — mad respect to the hacker for the creativity!

0x 0 0x

shoop

shoop
@stevehooper@indieweb.social

Föderation EN Do 23.01.2025 22:14:22

@clive @jasonkoebler @404mediaco This makes me wonder if it would be possible to insert garbage into rendered HTML (to confuse bots) and something like Nightshade into the rendered page (to poison image downloading and screenshot OCR) both in ways that aren't distracting to human readers.

0x 0 0x

Matasoft

Matasoft
@matasoft@mastodon.world

Föderation HR Do 23.01.2025 22:28:05

@clive @jasonkoebler @404mediaco I have difficulty to understand why someone wants to protect own web site from scraping. Isn't primarily reason of having a web site desire to be shown to world and particular information spread? I don't understand the fuss.

0x 4 0x

Frank Heijkamp

Frank Heijkamp
@alterelefant@mastodontech.de

Föderation NL Fr 24.01.2025 00:03:23

@matasoft @clive @jasonkoebler @404mediaco If all LLM platforms always provide direct links it would indeed bring people to your site. But the fact is that most LLM's just steal your content without giving any credit. That's what this is for.

0x 0 0x

tellyworth

tellyworth
@tellyworth@ioc.exchange

Föderation EN Fr 24.01.2025 00:37:20

@matasoft @clive @jasonkoebler @404mediaco there are masses of AI crawler bots that can easily overwhelm a web site with traffic. They don’t throttle or follow limits or respect robots.txt or other conventions. They’ll easily overwhelm available bandwidth and take down a small web site.

It’s 100% reasonable (and necessary) to consider them hostile.

0x 0 0x

kobajo

kobajo
@kobajo@kind.social

Föderation EN Fr 24.01.2025 07:49:36

@matasoft disrespectful scraping for LLM removes attribution, is non-consentual, has no methods for rectification (misinfo), mixes your data with others' (sometime criminally sourced) and often is implemented so badly that it causes load/stability/cost issues to the sites. And this is not "one shot", but happens all the time. No one wants this except the grifters.

0x 0 0x

econads

econads
@econads@mendeddrum.org

Föderation HR Fr 24.01.2025 08:55:38

@matasoft
Guess people want the world to know it's their work. Also there was a post from an admin here complaining that AI scrapers aren't as smart as normal e.g. search engine scrapers and come back every 10 minutes to scrape the exact same data, costing his small instance ridiculous money. And stuff like this here, no I don't need my random musings distributed elsewhere.

Assuming you were asking in good faith.

@clive @jasonkoebler @404mediaco

0x 1 0x

Clive Thompson

Clive Thompson
@clive@saturation.social

Föderation HR Fr 24.01.2025 17:47:37

@econads @matasoft @jasonkoebler @404mediaco

Yep

and some of the objection to mass-scraping-by-AI-firms is that the AI firms are not helping to provide new audiences for one's online writing

quite the opposite, possibly ...

... given that as more people start "chatting with an LLM AI" instead of searching the web ...

... they become happy with good-enough answer they get from the LLM, and never bother to go to the sites the answers are based upon

0x 1 0x

Matasoft

Matasoft
@matasoft@mastodon.world

Föderation HR Fr 24.01.2025 17:58:39

@clive @econads @jasonkoebler @404mediaco well, Perplexity AI, for example, provides citations links by default.
So, I am not sure that it is wise for website owner to block it. More and more LLMs will replace classical search engines.
On other hand, if you have a secret to hide, why then publishing it on web site in first place?

0x 0 0x

📄 Mehdi.doc

📄 Mehdi.doc
@mehdi_benadel@mastodon.balamb.fr

Föderation EN Do 23.01.2025 22:29:04

@clive @jasonkoebler @404mediaco not all heroes wear cape

0x 0 0x

David Grieve

David Grieve
@davidgrieve@mastodon.world

Föderation EN Do 23.01.2025 22:38:14

@clive @jasonkoebler @404mediaco
This is how we defeat Skynet.
If you are hearing this message, you are the resistance.

0x 0 0x

David B. Himself

David B. Himself
@DavidBHimself@firefish.city

Föderation EN Do 23.01.2025 22:41:19

@clive@saturation.social @jasonkoebler@mastodon.social @404mediaco@mastodon.social Yes, please.,

0x 0 0x

2¢
@Qbitzerre@unbound.social

Föderation EN Do 23.01.2025 22:56:50

@clive @jasonkoebler @404mediaco there was a time when people wanted their pages to be scraped and indexed. Balkanization of the Web. The battle for hegemony of information. Now we're injecting poison into the process. It's like chemotherapy.

0x 1 0x

Frank Heijkamp

Frank Heijkamp
@alterelefant@mastodontech.de

Föderation NL Fr 24.01.2025 00:05:14

@Qbitzerre @clive @jasonkoebler @404mediaco Indeed a good analogy, to get rid of the cancer that LLM trainingsets are to copyright.

0x 0 0x

Whiskers
@ecoscore@aus.social

Föderation EN Do 23.01.2025 23:02:10

@clive @jasonkoebler @404mediaco Daisy Daisy give me your ans w e r d o o

0x 0 0x

piperef
@piperef@mastodon.social

Föderation EN Do 23.01.2025 23:28:17

@clive

Does anyone know, if these web-crawlers also scrape the content of kindle e-books? Can they enter these kind of products?
Is anything safe from this kind of scraping?
How can we protect internet content against it in general? A standard website will be scraped easily, or?

@jasonkoebler @404mediaco

0x 2 0x

Frank Heijkamp

Frank Heijkamp
@alterelefant@mastodontech.de

Föderation NL Fr 24.01.2025 00:06:41

@piperef @clive @jasonkoebler @404mediaco Bezos probably already sold everything to those AI houses.

0x 0 0x

@StarkRG@myside-yourside.net

Föderation EN Fr 24.01.2025 00:23:59

@piperef @clive @jasonkoebler @404mediaco If it can be read, it can be scraped. You can mitigate the issue (often by putting it behind an account wall), but not eliminate it entirely. The film industry has been desperately trying to stop piracy and I have yet to see a situation where a movie was released but wasn't available on piracy sites.

But also, yeah, if it's kindle, it's probably already part of Amazon's AI dataset.

0x 1 0x

Clive Thompson

Clive Thompson
@clive@saturation.social

Föderation EN Fr 24.01.2025 00:37:06

@StarkRG @piperef @jasonkoebler @404mediaco

yeah, I am sure this is true

I read recently, though I can't find the source (still looking, will update if I can find it) that US AI firms used corpuses of cracked western ebooks that circulate in Russia etc, for training

0x 1 0x

piperef
@piperef@mastodon.social

Föderation EN Fr 24.01.2025 09:57:05

@clive

This is so ironic. Definitely plausible, but also scary.

@StarkRG @jasonkoebler @404mediaco

0x 1 0x

Clive Thompson

Clive Thompson
@clive@saturation.social

Föderation EN Fr 24.01.2025 17:39:36

@piperef @StarkRG @jasonkoebler @404mediaco

yeah, alas

0x 0 0x

stux⚡
@stux@mstdn.social

Föderation EN Do 23.01.2025 23:28:59

@clive

love it

0x 0 0x

Coach Pāṇini ®

Coach Pāṇini ®
@paninid@mastodon.world

Föderation EN Do 23.01.2025 23:47:09

@clive @jasonkoebler @404mediaco

0x 0 0x

JustRosy

JustRosy
@JustRosy@universeodon.com

Föderation EN Do 23.01.2025 23:59:58

@clive @jasonkoebler @404mediaco @piperef Maybe cross this tech with The Mandelbrot Set to keep the AI web crawlers from detecting that they're being trapped. Is that possible, and would it work?

https://duckduckgo.com/?t=ffab&q=The+Mandelbrot+Set&iax=images&ia=images

Also, maybe what's needed is a way to make this technology more user friendly and safe for average website owners/developers to use.

What if there were a single "container" website used for this purpose, that the links would all point to, which would then have the AI trap hosted on that site? What if the way that site was programmed, it would then point the AI web crawlers to the sites owned by the websites that deployed them, to serve up infinite content from the sites the AI web crawlers were sent from in the first place?

0x 0 0x

CaD017

CaD017
@CaD017@mastodon.social

Föderation EN Fr 24.01.2025 00:09:33

@clive finally some good news

0x 0 0x

Bernd Herd
@herdsoft@gruene.social

Föderation EN Fr 24.01.2025 00:18:47

@clive @jasonkoebler @404mediaco And google?

Why do some people think that offering their server data to google is ok, so they will be found, but concurrent search engines are evil?

0x 0 0x

peterfr

peterfr
@peterfr@mastodon.art

Föderation EN Fr 24.01.2025 00:22:59

@clive @jasonkoebler @404mediaco

A curated list of strategies, offensive methods, and tactics for (algorithmic) sabotage, disruption, and deliberate poisoning.

https://tldr.nettime.org/@asrg/113867412641585520

0x 1 0x

Clive Thompson

Clive Thompson
@clive@saturation.social

Föderation EN Fr 24.01.2025 00:24:04

@peterfr @jasonkoebler @404mediaco

damn, I hadn't seen that, super fascinating! thank you for pointing it out

0x 0 0x

Hybrid 🐘 Elephant

Hybrid 🐘 Elephant
@HybridElephant@musicians.today

Föderation EN Fr 24.01.2025 00:23:59

@clive @jasonkoebler @404mediaco

i recall a similar device i came across 30+ years ago, which was a device that made dynamic lists of fake email addresses on a page that had lots of internal links, designed to ensnare spam-harvesting bots. this looks very similar... 👍👍

https://spampoison.com/

0x 1 0x

Clive Thompson

Clive Thompson
@clive@saturation.social

Föderation EN Fr 24.01.2025 00:37:44

@HybridElephant @jasonkoebler @404mediaco

yes!!

Some folks mentioned this elsewhere in the thread

I'd not heard about this back in the day, but it makes sense someone did this

0x 0 0x

Different Drummer

Different Drummer
@DifferentDrummer@syzito.xyz

Föderation EN Fr 24.01.2025 00:28:12

@clive "so the crawler gets stuck crawling and scraping endless and meaningless pages" So, like if it landed on the Daily Mail site on any normal day...

0x 1 0x

Clive Thompson

Clive Thompson
@clive@saturation.social

Föderation EN Fr 24.01.2025 00:35:16

@DifferentDrummer

lol yes

or, really, Tiktok or Facebook or any algorithmically-juked social media feed

0x 0 0x

MigMit

MigMit
@migmit@mstdn.social

Föderation EN Fr 24.01.2025 00:48:32

@clive @jasonkoebler @404mediaco So, let me get this straight: he created a possible way to stop AIs, that would also stop search engines from indexing his site, even though being indexed is usually what a publicly accessible site wants. Somebody who, by their claim, is an AI CEO says that it'd be easy to avoid, and he counters that Google did not avoid it. So, basically, he succeeded in stopping search engines, but not AI bots. Nice work.

0x 0 0x

Maddad ☑️

Maddad ☑️
@maddad@mastodon.world

Föderation EN Fr 24.01.2025 00:55:39

@clive @jasonkoebler @404mediaco

Thats a great idea 👍

0x 0 0x

Oliver Vanderb
@Ollivdb@nrw.social

Föderation EN Fr 24.01.2025 00:58:13

@clive @jasonkoebler @404mediaco

I just serve them a 20 GB textfile with some sort of Lorem Ipsum.

0x 0 0x

feld

feld
@feld@friedcheese.us

Föderation · Fr 24.01.2025 01:21:13

@clive @jasonkoebler @404mediaco what a waste of CPU and bandwidth and electricity

0x 0 0x

SpaceLifeForm

SpaceLifeForm
@SpaceLifeForm@infosec.exchange

Föderation EN Fr 24.01.2025 01:33:05

@clive @jasonkoebler @404mediaco

Feed the LLM scrapers only the finest bullshit.

Let them shine it.

#Nepenthes

0x 0 0x

JohnW

JohnW
@TheEffekt@indieweb.social

Föderation EN Fr 24.01.2025 01:37:39

@clive @jasonkoebler @404mediaco

I love this from the perspective of tanking AI, but hate this in the perspective of the massive amounts of water a process like this must use.

0x 0 0x

Maria Langer | 📝 🎬 ⚒️🛥️

Maria Langer | 📝 🎬 ⚒️🛥️
@mlanger@mastodon.world

Föderation EN Fr 24.01.2025 01:53:42

@clive @maximum_mew @jasonkoebler @404mediaco I'd like to be able to install this on every website I run.

0x 1 0x

Clive Thompson

Clive Thompson
@clive@saturation.social

Föderation EN Fr 24.01.2025 01:58:28

@mlanger @maximum_mew @jasonkoebler @404mediaco

it'd be pretty funny to see it offered as a one-click option on a hosting provider

0x 1 0x

Maria Langer | 📝 🎬 ⚒️🛥️

Maria Langer | 📝 🎬 ⚒️🛥️
@mlanger@mastodon.world

Föderation EN Fr 24.01.2025 03:07:37

@clive @maximum_mew @jasonkoebler @404mediaco Sure would. I'd pay to click.

0x 1 0x

Clive Thompson

Clive Thompson
@clive@saturation.social

Föderation EN Fr 24.01.2025 03:31:06

@mlanger @maximum_mew @jasonkoebler @404mediaco

right?

0x 0 0x

S38

S38
@sab38@infosec.exchange

Föderation EN Fr 24.01.2025 02:07:10

@clive

I did something like this 20+ years ago. Simple perl cgi to generate completely random text. It was just a tiny personal site but occasionally something would walk into it.

I'm recreating the site now and just have a small loop of pages. There is a period at the end of one sentence that links into a slightly different loop. Occasionally I see what looks like a real user going through the loop. Other times the hidden url will get hit likely from a web crawler.

I have slightly different loops for my default page versus my named web sites. Most scanning seems to come via ip address. A tiny bit of traffic comes using the name in a self signed cert which was briefly used as the default cert.

@jasonkoebler @404mediaco

0x 1 0x

Clive Thompson

Clive Thompson
@clive@saturation.social

Föderation EN Fr 24.01.2025 03:33:40

@sab38 @jasonkoebler @404mediaco

right on!

0x 0 0x

NoctisEqui 🇺🇦🇵🇸🇪🇹🏳️‍🌈

NoctisEqui 🇺🇦🇵🇸🇪🇹🏳️‍🌈
@NoctisEqui@mastodonapp.uk

Föderation EN Fr 24.01.2025 02:36:54

@clive @jasonkoebler @404mediaco

How do I get one? I love the mere concept!

0x 1 0x

Clive Thompson

Clive Thompson
@clive@saturation.social

Föderation EN Fr 24.01.2025 03:32:13

@NoctisEqui @jasonkoebler @404mediaco

I think it requires some knowledge of server-side coding to really implement these

but a hosting provider could, if it wanted to, make it one-click installable

0x 1 0x

NoctisEqui 🇺🇦🇵🇸🇪🇹🏳️‍🌈

NoctisEqui 🇺🇦🇵🇸🇪🇹🏳️‍🌈
@NoctisEqui@mastodonapp.uk

Föderation EN Mo 27.01.2025 23:23:06

@clive @jasonkoebler @404mediaco

I figured that. It is a wonderful notion tho! Here’s to our heroes, fighting on the Tech Front!

0x 0 0x

HTPC NZ
@htpcnz@mastodon.social

Föderation EN Fr 24.01.2025 03:23:45

@clive @jasonkoebler @404mediaco won't be long before AI tecbros pay the governments to get laws passed to make it illegal to stop them scraping or anyone poisoning their data sets etc except if we try scraping their platforms. Similar to say how it is illegal for us to use copyrighted data, but for them it is a free pass even without any exemptions.

0x 1 0x

Clive Thompson

Clive Thompson
@clive@saturation.social

Föderation EN Fr 24.01.2025 03:30:57

@htpcnz @jasonkoebler @404mediaco

yeah, I can see that happening

0x 0 0x

HTPC NZ
@htpcnz@mastodon.social

Föderation EN Fr 24.01.2025 03:26:06

@clive @jasonkoebler @404mediaco won't be long before AI tecbros pay the governments to get laws passed to make it illegal to stop them scraping sites or anyone poisoning their data sets etc except if we try scraping their platforms. Similar to say how it is illegal for us to use copyrighted data, but for them it is a free pass even without any exemptions.

0x 0 0x

Chris Bussard

Chris Bussard
@cwbussard@ioc.exchange

Föderation EN Fr 24.01.2025 04:35:18

@clive @jasonkoebler @404mediaco

So it's wpoison for AI training bots? Truly there's nothing new under the sun.

https://web.archive.org/web/20160821195248/http://www.monkeys.com:80/wpoison/

0x 1 0x

Clive Thompson

Clive Thompson
@clive@saturation.social

Föderation EN Fr 24.01.2025 17:51:45

@cwbussard @jasonkoebler @404mediaco

yep

0x 0 0x

iveyline

iveyline
@iveyline@mastodon.world

Föderation EN Fr 24.01.2025 05:20:10

@clive How cool is that!

0x 1 0x

Clive Thompson

Clive Thompson
@clive@saturation.social

Föderation EN Fr 24.01.2025 17:51:39

@iveyline

trippy, eh?

0x 0 0x

Mani and the Nonos

Mani and the Nonos
@maniandthenonos@mas.to

Föderation EN Fr 24.01.2025 06:00:25

@clive @jasonkoebler @404mediaco If I could love this post one million times, I would. It is pretty clear now that beyond the pathetic and socially awkward filthy rich human oligarchs you saw on TV this week, AI has become the real enemy. Overfeed it and confuse it. Sounds like we're going to have to recreate Jorge Luis Borges' Library of Babel. An infinite word labyrinth where the machine loses its mind, alone and vanquished at last. How poetic. And what a trip.

0x 1 0x

Clive Thompson

Clive Thompson
@clive@saturation.social

Föderation EN Fr 24.01.2025 17:51:10

@maniandthenonos @jasonkoebler @404mediaco

this really is poetry, right here

digital-age poetry, autogenerated, as a defense mechanism

Borges would be giving this stuff a double-shooty-fingers

0x 1 0x

Mani and the Nonos

Mani and the Nonos
@maniandthenonos@mas.to

Föderation EN Fr 24.01.2025 20:41:21

@clive @jasonkoebler @404mediaco I would love to see this headline, "Borges gives the double bird to Sam Altman from beyond". My grandfather fought against the Nazis. We'll be fighting against AI.

0x 0 0x

teledyn ð“‚€