Advice on how to deal with AI bots/scrapers?
-
In my experience, git forges are especially hit hard
Is that why my Forgejo instance has been hit twice like crazy before...
Why can't we have nice things. Thank you! -
Might be worth patching fail2ban to recognize the scrapers and block them in iptables.
-
Yeah, Forgejo and Gitea. I think it is partially a problem of insufficient caching on the side of these git forges that makes it especially bad, but in the end that is victim blaming π«
Mlmym seems to be the target because it is mostly Javascript free and therefore easier to scrape I think. But the other Lemmy frontends are also not well protected. Lemmy-ui doesn't even allow to easily add a custom robots.txt, you have to manually overwrite it in the reverse-proxy.
-
What are you hosting and who are your users? Do you receive any legitimate traffic from AWS or other cloud provider IP addresses? There will always be edge cases like people hosting VPN exit nodes on a VPS etc, but if its a tiny portion of your legitimate traffic I would consider blocking all incoming traffic from cloud providers and then whitelisting any that make sense like search engine crawlers if necessary.
-
Build tar pits.
-
They want to reduce the bandwidth usage. Not increase it!
-
Cool, lots of information provided!
-
Too bad you can't post a usage notice that anything scrapped to train an AI will be charged and will owe $some-huge-money, then pepper the site with bogus facts, occasionally ask various AI about the bogus fact and use that to prove scraping and invoice the AI's company.
-
Bots will blacklist your IP if you make it hostile to bots
This will save you bandwidth
-
A good tar pit will reduce your bandwidth. Tarpits aren't about shoving useless data at bots; they're about responding as slow as possible to keep the bot connected for as long as possible while giving it nothing.
Endlessh accepts the connection and then... does nothing. It doesn't even actually perform the SSL negotiation. It just very... slowly... sends... an endless preamble, until the bot gives up.
As I write, my Internet-facing SSH tarpit currently has 27 clients trapped in it. A few of these have been connected for weeks. In one particular spike it had 1,378 clients trapped at once, lasting about 20 hours.
-
there is also https://forge.hackers.town/hackers.town/nepenthes
-
Read access logs and 403 user agents or IPs
-
I guess sending tar bombs can be fun
-
Now I just wan't to host a web page and expose it with nepenthes...
First, because I'm a big fan of carnivorous plants.
Second, because it let's you poison LLMs, AI and fuck with their data.
Lastly, because I can do my part and say F#CK Y0U to those privacy data hungry a$$holes !
I don't even expose anything directly to the web (always accessible through a tunnel like wireguard) or have any important data to protect from AI or LLMs. But just giving the opportunity to fuck with them while they continuously harvest data from everyone is something I was already thinking off but didn't knew how.
Thanks for the link !
-
Fair. But I havenβt seen any anti-ai-scraper tarpits that do that. The ones Iβve seen mostly just pipe 10MB of /dev/urandom out there.
Also I assume that the programmers working at ai companies are not literally mentally deficient. They certainly would add
.timeout(10)
or whatever to their scrapers. They probably have something more dynamic than that.