AI haters build tarpits to trap and trick AI scrapers that ignore robots.txt
-
[email protected]replied to [email protected] last edited by
They are no loops and repeated links to avoid. Every link leads to a brand new, freshly generated page with another set of brand new, never before seen links. You can go deeper and deeper forever without any loops.
-
[email protected]replied to [email protected] last edited by
Spiders already detect link bombs, recursion bombs, they're capable of rendering the page out in memory to see what's truly visible.
It's a great idea but it's a really old trick and it's already been covered.
-
[email protected]replied to [email protected] last edited by
Well the hits start coming and they dont start ending
-
[email protected]replied to [email protected] last edited by
sure, if you have enough memory to store a list of all guids.
-
The internet being what it is, I'd be more surprised if there wasn't already a website set up somewhere with a malicious robots.txt file to screw over ANY crawler regardless of providence.
-
[email protected]replied to [email protected] last edited by
It's not that we "hate them" - it's that they can entirely overwhelm a low volume site and cause a DDOS.
I ran a few very low visit websites for local interests on a rural. residential line. It wasn't fast but was cheap and as these sites made no money it was good enough Before AI they'd get the odd badly behaved scraper that ignored robots.txt and specifically the rate limits.
But since? I've had to spend a lot of time trying to filter them out upstream. Like, hours and hours. Claudebot was the first - coming from hundreds of AWS IPs and dozens of countries, thousands of times an hour, repeatedly trying to download the same urls - some that didn't exist. Since then it's happened a lot. Some of these tools are just so ridiculously stupid, far more so than a dumb script that cycles through a list. But because it's AI and they're desperate to satisfy the "need for it", they're quite happy to spend millions on AWS costs for negligable gain and screw up other people.
Eventually I gave up and redesigned the sites to be static and they're now on cloudflare pages. Arguably better, but a chunk of my life I'd rather not have lost.
-
[email protected]replied to [email protected] last edited by
You missed out the important bit.
You need to make sure you film yourself doing this and then post it on social media to an account linked to your real identity.
-
[email protected]replied to [email protected] last edited by
You can limit the visits to a domain. The honeypot doesn't register infinite new domains.
-
[email protected]replied to [email protected] last edited by
It doesn't have to memorize all possible guids, it just has to limit visits to base urls.
-
[email protected]replied to [email protected] last edited by
Notice how it's "AI haters" and not "people trying to protect their IP" as it would be if it were say...China instead of AI companies stealing the IP.
-
[email protected]replied to [email protected] last edited by
what part of "they do not repeat" do you still not get? You can put them in a list, but you won't ever get a hit ic it'd just be wasting memory