hhmx.de

Frank Heijkamp

· Föderation NL Do 23.01.2025 23:57:19

@skybrook @clive The idea is that a human will have to review it and flips a switch that will exclude the entire site. This exclusion will keep the actual content on the site safe from being ingested.

Skybrook

Föderation NL Fr 24.01.2025 00:50:26

@alterelefant That's tricky, since the crawlers have vast amounts of IP addresses. I just set traps to detect web spiders automatically, if traffic gets to be a problem.

Frank Heijkamp

Föderation NL Fr 24.01.2025 07:20:17

@skybrook Don't filter by IP-address, but filter by behavior. I know, that's sometimes easier said than done.

The following one is straight forward. A get request to a bogus link in the infinit labyrinth qualifies for a labyrinth response, whether the IP-address is known or a new one.

With a labyrinth response I would throw in a random delay between 100 ms and 5 s, and a one in fifty chance of a 30 s delay before responding with a http 503. That should usually be enough to slow down crawlers.

Skybrook

Föderation NL Fr 24.01.2025 17:53:01

@alterelefant Well right, that's what I meant by "traps to detect." I didn't think of setting it so every URL for any detected IP address would become a labyrinth response... not a bad idea really.

Frank Heijkamp

Föderation EN Fr 24.01.2025 20:43:22

@skybrook Crawler that use multiple endpoints to distribute the crawl load will handout urls to be crawled to those endpoints. Their freshly acquired labyrinth links will make a new endpoint immediately identifiable.