hhmx.de

Skybrook

· Föderation EN Do 23.01.2025 21:20:01

@clive I once made a webpage that would continually slowly send random words and links to itself, never quite closing the connection. It's honestly not worth the trouble. It'd be nice if it interfered with AI training, though.

Frank Heijkamp

Föderation NL Do 23.01.2025 23:57:19

@skybrook @clive The idea is that a human will have to review it and flips a switch that will exclude the entire site. This exclusion will keep the actual content on the site safe from being ingested.

Skybrook

Föderation NL Fr 24.01.2025 00:50:26

@alterelefant That's tricky, since the crawlers have vast amounts of IP addresses. I just set traps to detect web spiders automatically, if traffic gets to be a problem.

Frank Heijkamp

Föderation NL Fr 24.01.2025 07:20:17

@skybrook Don't filter by IP-address, but filter by behavior. I know, that's sometimes easier said than done.

The following one is straight forward. A get request to a bogus link in the infinit labyrinth qualifies for a labyrinth response, whether the IP-address is known or a new one.

With a labyrinth response I would throw in a random delay between 100 ms and 5 s, and a one in fifty chance of a 30 s delay before responding with a http 503. That should usually be enough to slow down crawlers.

Skybrook

Föderation NL Fr 24.01.2025 17:53:01

@alterelefant Well right, that's what I meant by "traps to detect." I didn't think of setting it so every URL for any detected IP address would become a labyrinth response... not a bad idea really.

Frank Heijkamp

Föderation EN Fr 24.01.2025 20:43:22

@skybrook Crawler that use multiple endpoints to distribute the crawl load will handout urls to be crawled to those endpoints. Their freshly acquired labyrinth links will make a new endpoint immediately identifiable.