
Clive Thompson

A hacker developed an "infinite maze" to trap web-crawlers/scrapers from AI companies

basically, if the server code detects that a web crawler from an AI firm is trying to scrape the site ...

... the code begins spinning up an infinite, nesting warren of new sham pages, filled with random text

so the crawler gets stuck crawling and scraping endless and meaningless pages

fun @jasonkoebler piece at @404mediaco


Valentino Gagliardi

@clive what a waste from both sides

Clive Thompson

yep, I think that's basically the point of it


@clive @gagliardi_vale
Job creation for data annotators in India, Nigeria, Vietnam, etc who have been given the microtasks of removing any junk like this from the AI training data.


@gagliardi_vale @clive this is what we're doing, instead of scrambling to salvage our odds of surviving this century as a species


@clive @jasonkoebler @404mediaco
Can this be used as a method for creating blockdrain creeptocoin con? 🤔

Tom Bortels

@clive @jasonkoebler @404mediaco

Was just on a thread a week or so ago about what to do with aggressive AI web scrapers that won't self-limit or respect robots.txt.

This is evolution in action.

Nature is healing.

Chris Real

@clive @tbortels @jasonkoebler @404mediaco

It's a practical application of "GIGO".

Ahh, there's a place for everything—and GIGO has finally found its place!

🐧DaveNull🐧 ☣️pResident Evil☣

@tbortels Is there even such thing as "non-aggressive AI web scrapers" that will self-limit and respect robots\.txt?

At least google's and micro$hit's ignore robots\.txt. It downloaded photos from my gallery, up to 6000 requests a day, more than once… I bet not even 10 of them are legit users.

I've only 38 photos… stupid bots download the same photos over and over again…

I've blocked 4 IP ranges. It probably includes indexation bots' IP but I don't give an F.

@clive @jasonkoebler @404mediaco

Tom Bortels

@devnull @clive @jasonkoebler @404mediaco

I felt obligated to disclaim my fantasy well-behaved AI scrapers just in case. The actual headcount there may well be zero.


@tbortels @devnull @clive @jasonkoebler @404mediaco
There is such a thing as a non-aggressive respectful AI scrapper. It's called asking for permission from the copyright owner and obtaining an appropriate license if their AI system can generate derivative works using your content.

Tom Bortels

@bornach @devnull @clive @jasonkoebler @404mediaco

Alas - those scrapers are out of scope because they're not the ones causing problems and driving this conversation. Indeed - if someone licensed content legitimately, the need to scrape the web would be absent - there are far more efficient ways to say "here are all of the new posts in the last N hours".

You can safely assume any automation ignoring your robots.txt is a pest to be ruthlessly crushed in whatever manner amuses you most.

Clive Thompson

@tbortels @bornach @devnull @jasonkoebler @404mediaco

yep -- licensing would obviate the hassles of scraping

"here's our API, enjoy"

Kevin Freitas

@404mediaco @clive @jasonkoebler Love this! I built a simple plugin that garbles your web content to serve them up garbage:


Koos Looijesteijn

@clive if LLMs are going to be half as good as they're promising they should be already, then millions of websites will serve endless LLM-generated content like that. Creating a really expensive infinite loop. Way to spend 500B $.


@koos @clive
This is already happening, and has already happened
AI has created its own tarpit. Without human annotators to filter the crap out, they risk becoming the yeast in a petridish that slowly poisons itself with the very same alcohol it generates.

Lucas C. Wheeler

@clive @jasonkoebler @404mediaco I love that it's called Nepenthes. One of the coolest plant genera!

Lucas C. Wheeler

@risottobias @clive @jasonkoebler @404mediaco In the article it says "The program, called Nepenthes after the genus of carnivorous pitcher plants which trap and consume their prey, can be deployed by webpage owners to protect their own content from being scraped or can be deployed “offensively” as a honeypot trap to waste AI companies’ resources." It's a fitting name.


@risottobias I'm not much of a Starwars fan, and was unaware of that reference when I named it.


@lcwheeler @clive @jasonkoebler @404mediaco


@clive @jasonkoebler @404mediaco

I've seen stories about people hosting sites that got hit by robots and they had to pay a bunch of money in data costs. I wonder how this works, if it can help in that regard when the whole point is to keep them pointed at your site.

I'm all for wasting their time, i just wonder how much it costs.

Lord Thomas Klopf of Bohemia

Föderation EN Do 23.01.2025 20:52:59

@RnDanger @clive @jasonkoebler @404mediaco yeah, you’d have to host this on a service that doesn’t charge by network traffic


@RnDanger@infosec.exchange @clive@saturation.social @jasonkoebler@mastodon.social @404mediaco@mastodon.social

Employ bandwidth throttling at about 16K with a few hundred thousand link trees to follow. That will really teach them and save your bandwidth bill.


@RnDanger @clive @jasonkoebler @404mediaco This would need to be deployed to a server with a fixed cost, not one with extra costs for execution time or bandwidth

Einfach Nein :verified:

Finally the equivalent of the mail tar pit!



@clive @jasonkoebler @404mediaco Tip of the Cub cap to the hacker!


@clive @jasonkoebler @404mediaco Are we really getting Barrier Mazes from Ghost In the Shell??

Megan Lynch (she/her)

@clive @jasonkoebler @404mediaco I think this stuff has been catching archiving sites too, though. So you try to archive a site for accountability purposes and it ends up not being able to be archived because it just churns forever.

Clive Thompson

@meganL @jasonkoebler @404mediaco

that danger leapt out at me too

Frank Heijkamp

@clive @meganL @jasonkoebler @404mediaco It depends on the way the maze gets triggered. If the robots.txt explicitly excludes a certain url that is not directly linked from anywhere and that url sees a get request you can be 100% sure you have trapped a bogus crawler. Definitely go to town with it. Most crawler are usually not that stupid so therefore the triggers get slightly less reliable and there is a chance you trap a legitimate crawler like an archiving site for instance.

Luna chan

@clive @jasonkoebler @404mediaco What a great idea.


@clive @jasonkoebler @404mediaco
Very cool! I like it.

Sue Briccay  :verifiedace:

This is beautiful.
Good work by them!

@jasonkoebler @404mediaco


@clive I once made a webpage that would continually slowly send random words and links to itself, never quite closing the connection. It's honestly not worth the trouble. It'd be nice if it interfered with AI training, though.

Frank Heijkamp

@skybrook @clive The idea is that a human will have to review it and flips a switch that will exclude the entire site. This exclusion will keep the actual content on the site safe from being ingested.


@alterelefant That's tricky, since the crawlers have vast amounts of IP addresses. I just set traps to detect web spiders automatically, if traffic gets to be a problem.

Frank Heijkamp

@skybrook Don't filter by IP-address, but filter by behavior. I know, that's sometimes easier said than done.

The following one is straight forward. A get request to a bogus link in the infinit labyrinth qualifies for a labyrinth response, whether the IP-address is known or a new one.

With a labyrinth response I would throw in a random delay between 100 ms and 5 s, and a one in fifty chance of a 30 s delay before responding with a http 503. That should usually be enough to slow down crawlers.


@alterelefant Well right, that's what I meant by "traps to detect." I didn't think of setting it so every URL for any detected IP address would become a labyrinth response... not a bad idea really.

Frank Heijkamp

@skybrook Crawler that use multiple endpoints to distribute the crawl load will handout urls to be crawled to those endpoints. Their freshly acquired labyrinth links will make a new endpoint immediately identifiable.

Dr Power Nap, DDS ✅️

@clive @jasonkoebler @404mediaco

Might be nice to add something to poison the data, contradictory statements, things that break the tokenizer, maybe subtle statistical tricks to inject gnarly statements.

Alistair K

@ThePowerNap @clive @jasonkoebler @404mediaco Like a little Markov chain text generator? Cheaper than an LLM yet maybe good enough to pass for 'real' text in a training set.

Alistair K

@ThePowerNap @clive @jasonkoebler @404mediaco my head's a lot like that pink labyrinth in the picture, to be honest, but with more dimensions and no indication of whether entrance or exit exist or make sense

I wish that I still had the brain capacity and physical energy to implement even half of the ideas that I come up with.


@libroraptor @ThePowerNap @clive @jasonkoebler @404mediaco

[f4mi] used wiki pages to which simplistic synonym substitution has been applied using a Python script
Confused the hell out of some of the AI scraper/summarisers.

Alistair K

@bornach @ThePowerNap @clive @jasonkoebler @404mediaco That's very funny! Turning classic SEO pervert techniques to greater good.

Also a good presenter. I rarely manage to listen to youtube talks – too much irrelevant babbling and metatalk, but this person has a clear narrative and stays on track.


@clive @jasonkoebler @404mediaco On one hand if you want to protect your art or something similar from being scraped because you sell it or just don't want your style to be stolen it's nice having such tools. But on the other hand if you value human rights and let your values influence your texts and pictures then you can influence AIs with your input.

I hate that AI has bad influence on the environment because it needs so much computing resources but we cannot stop AI so this can be a small influence from ourselves.


@chikl There is nothing stopping us, as a species, from deciding this isn't a good technological path and just unplugging it all. We made this thing and we can unmake it.

But the only people with power are "Number must go up" type billionaires who are doubling down on it.

You bring up human rights, too - from that angle, people should have a right to consent to be included in these things. I don't consent to it. So I grew myself some spikes and put Nepenthes out there for others to grow some as well.

@clive @jasonkoebler @404mediaco

★ blue-caller ☆

@clive @jasonkoebler @404mediaco the house of leaves but you can't leave


@404mediaco @clive @jasonkoebler Love this for the bots and scrapers. Choke on it.

Damiano Gacík

@clive @jasonkoebler @404mediaco Brilliant idea! 😂 I can just imagine AI scrapers struggling to process an endless stream of random pages. It's like trolling on level 80 — mad respect to the hacker for the creativity!


@clive @jasonkoebler @404mediaco This makes me wonder if it would be possible to insert garbage into rendered HTML (to confuse bots) and something like Nightshade into the rendered page (to poison image downloading and screenshot OCR) both in ways that aren't distracting to human readers.


@clive @jasonkoebler @404mediaco I have difficulty to understand why someone wants to protect own web site from scraping. Isn't primarily reason of having a web site desire to be shown to world and particular information spread? I don't understand the fuss.

Frank Heijkamp

@matasoft @clive @jasonkoebler @404mediaco If all LLM platforms always provide direct links it would indeed bring people to your site. But the fact is that most LLM's just steal your content without giving any credit. That's what this is for.


@matasoft @clive @jasonkoebler @404mediaco there are masses of AI crawler bots that can easily overwhelm a web site with traffic. They don’t throttle or follow limits or respect robots.txt or other conventions. They’ll easily overwhelm available bandwidth and take down a small web site.

It’s 100% reasonable (and necessary) to consider them hostile.


@matasoft disrespectful scraping for LLM removes attribution, is non-consentual, has no methods for rectification (misinfo), mixes your data with others' (sometime criminally sourced) and often is implemented so badly that it causes load/stability/cost issues to the sites. And this is not "one shot", but happens all the time. No one wants this except the grifters.


Guess people want the world to know it's their work. Also there was a post from an admin here complaining that AI scrapers aren't as smart as normal e.g. search engine scrapers and come back every 10 minutes to scrape the exact same data, costing his small instance ridiculous money. And stuff like this here, no I don't need my random musings distributed elsewhere.

Assuming you were asking in good faith.

@clive @jasonkoebler @404mediaco

Clive Thompson

@econads @matasoft @jasonkoebler @404mediaco


and some of the objection to mass-scraping-by-AI-firms is that the AI firms are not helping to provide new audiences for one's online writing

quite the opposite, possibly ...

... given that as more people start "chatting with an LLM AI" instead of searching the web ...

... they become happy with good-enough answer they get from the LLM, and never bother to go to the sites the answers are based upon


@clive @econads @jasonkoebler @404mediaco well, Perplexity AI, for example, provides citations links by default.
So, I am not sure that it is wise for website owner to block it. More and more LLMs will replace classical search engines.
On other hand, if you have a secret to hide, why then publishing it on web site in first place?

David Grieve

@clive @jasonkoebler @404mediaco
This is how we defeat Skynet.
If you are hearing this message, you are the resistance.


@clive @jasonkoebler @404mediaco there was a time when people wanted their pages to be scraped and indexed. Balkanization of the Web. The battle for hegemony of information. Now we're injecting poison into the process. It's like chemotherapy.

Frank Heijkamp

@Qbitzerre @clive @jasonkoebler @404mediaco Indeed a good analogy, to get rid of the cancer that LLM trainingsets are to copyright.


@clive @jasonkoebler @404mediaco Daisy Daisy give me your ans w e r d o o


Does anyone know, if these web-crawlers also scrape the content of kindle e-books? Can they enter these kind of products?
Is anything safe from this kind of scraping?
How can we protect internet content against it in general? A standard website will be scraped easily, or?

@jasonkoebler @404mediaco

Frank Heijkamp

@piperef @clive @jasonkoebler @404mediaco Bezos probably already sold everything to those AI houses.

@piperef @clive @jasonkoebler @404mediaco If it can be read, it can be scraped. You can mitigate the issue (often by putting it behind an account wall), but not eliminate it entirely. The film industry has been desperately trying to stop piracy and I have yet to see a situation where a movie was released but wasn't available on piracy sites.

But also, yeah, if it's kindle, it's probably already part of Amazon's AI dataset.

Clive Thompson

@StarkRG @piperef @jasonkoebler @404mediaco

yeah, I am sure this is true

I read recently, though I can't find the source (still looking, will update if I can find it) that US AI firms used corpuses of cracked western ebooks that circulate in Russia etc, for training


This is so ironic. Definitely plausible, but also scary.

@StarkRG @jasonkoebler @404mediaco


love it :ed_grin:


@clive @jasonkoebler @404mediaco @piperef Maybe cross this tech with The Mandelbrot Set to keep the AI web crawlers from detecting that they're being trapped. Is that possible, and would it work?


Also, maybe what's needed is a way to make this technology more user friendly and safe for average website owners/developers to use.

What if there were a single "container" website used for this purpose, that the links would all point to, which would then have the AI trap hosted on that site? What if the way that site was programmed, it would then point the AI web crawlers to the sites owned by the websites that deployed them, to serve up infinite content from the sites the AI web crawlers were sent from in the first place?


@clive finally some good news

Bernd Herd

@clive @jasonkoebler @404mediaco And google?

Why do some people think that offering their server data to google is ok, so they will be found, but concurrent search engines are evil?


@clive @jasonkoebler @404mediaco

A curated list of strategies, offensive methods, and tactics for (algorithmic) sabotage, disruption, and deliberate poisoning.


Clive Thompson

@peterfr @jasonkoebler @404mediaco

damn, I hadn't seen that, super fascinating! thank you for pointing it out

Hybrid 🐘 Elephant

@clive @jasonkoebler @404mediaco

i recall a similar device i came across 30+ years ago, which was a device that made dynamic lists of fake email addresses on a page that had lots of internal links, designed to ensnare spam-harvesting bots. this looks very similar... 👍👍


Clive Thompson

@HybridElephant @jasonkoebler @404mediaco


Some folks mentioned this elsewhere in the thread

I'd not heard about this back in the day, but it makes sense someone did this

Different Drummer

@clive "so the crawler gets stuck crawling and scraping endless and meaningless pages" So, like if it landed on the Daily Mail site on any normal day...

Clive Thompson

lol yes

or, really, Tiktok or Facebook or any algorithmically-juked social media feed


@clive @jasonkoebler @404mediaco So, let me get this straight: he created a possible way to stop AIs, that would also stop search engines from indexing his site, even though being indexed is usually what a publicly accessible site wants. Somebody who, by their claim, is an AI CEO says that it'd be easy to avoid, and he counters that Google did not avoid it. So, basically, he succeeded in stopping search engines, but not AI bots. Nice work.

Maddad ☑️

@clive @jasonkoebler @404mediaco

Thats a great idea 👍

Oliver Vanderb

@clive @jasonkoebler @404mediaco

I just serve them a 20 GB textfile with some sort of Lorem Ipsum.


@clive @jasonkoebler @404mediaco what a waste of CPU and bandwidth and electricity


Föderation EN Fr 24.01.2025 01:33:05

Feed the LLM scrapers only the finest bullshit.

Let them shine it.


@clive @jasonkoebler @404mediaco

I love this from the perspective of tanking AI, but hate this in the perspective of the massive amounts of water a process like this must use.

Maria Langer | 📝 🎬 ⚒️🛥️

@clive @maximum_mew @jasonkoebler @404mediaco I'd like to be able to install this on every website I run.

Clive Thompson

@mlanger @maximum_mew @jasonkoebler @404mediaco

it'd be pretty funny to see it offered as a one-click option on a hosting provider


I did something like this 20+ years ago. Simple perl cgi to generate completely random text. It was just a tiny personal site but occasionally something would walk into it.

I'm recreating the site now and just have a small loop of pages. There is a period at the end of one sentence that links into a slightly different loop. Occasionally I see what looks like a real user going through the loop. Other times the hidden url will get hit likely from a web crawler.

I have slightly different loops for my default page versus my named web sites. Most scanning seems to come via ip address. A tiny bit of traffic comes using the name in a self signed cert which was briefly used as the default cert.

@jasonkoebler @404mediaco

Clive Thompson

@NoctisEqui @jasonkoebler @404mediaco

I think it requires some knowledge of server-side coding to really implement these

but a hosting provider could, if it wanted to, make it one-click installable

NoctisEqui 🇺🇦🇵🇸🇪🇹🏳️‍🌈

@clive @jasonkoebler @404mediaco

I figured that. It is a wonderful notion tho! Here’s to our heroes, fighting on the Tech Front!


@clive @jasonkoebler @404mediaco won't be long before AI tecbros pay the governments to get laws passed to make it illegal to stop them scraping or anyone poisoning their data sets etc except if we try scraping their platforms. Similar to say how it is illegal for us to use copyrighted data, but for them it is a free pass even without any exemptions.

Clive Thompson

@htpcnz @jasonkoebler @404mediaco

yeah, I can see that happening


@clive @jasonkoebler @404mediaco won't be long before AI tecbros pay the governments to get laws passed to make it illegal to stop them scraping sites or anyone poisoning their data sets etc except if we try scraping their platforms. Similar to say how it is illegal for us to use copyrighted data, but for them it is a free pass even without any exemptions.

Chris Bussard

@clive @jasonkoebler @404mediaco

So it's wpoison for AI training bots? Truly there's nothing new under the sun.



@clive How cool is that!

Mani and the Nonos

@clive @jasonkoebler @404mediaco If I could love this post one million times, I would. It is pretty clear now that beyond the pathetic and socially awkward filthy rich human oligarchs you saw on TV this week, AI has become the real enemy. Overfeed it and confuse it. Sounds like we're going to have to recreate Jorge Luis Borges' Library of Babel. An infinite word labyrinth where the machine loses its mind, alone and vanquished at last. How poetic. And what a trip.

Clive Thompson

@maniandthenonos @jasonkoebler @404mediaco

this really is poetry, right here

digital-age poetry, autogenerated, as a defense mechanism

Borges would be giving this stuff a double-shooty-fingers

Mani and the Nonos

@clive @jasonkoebler @404mediaco I would love to see this headline, "Borges gives the double bird to Sam Altman from beyond". My grandfather fought against the Nazis. We'll be fighting against AI.

teledyn ð“‚€

@clive @syncros @jasonkoebler @404mediaco

This is our future: road cones set on the hood of waymos

Toot Terrorist

@clive @jasonkoebler @404mediaco

I can't find the link right now, but there are tools that instead of that, generate garbage for AI scrapers, feeding them nonsense "human" text.

That way, they don't see anything wrong (it doesn't slow or block the spiders) but the data they get is bullshit.

Jeffrey Rogers 🏴󠁧󠁢󠁷󠁬󠁳󠁿

@clive @jasonkoebler @404mediaco Put “No unauthorised access” on the landing page and just sit back as they ignore it.