Advice on how to deal with AI bots/scrapers?
-
How would I go about doing that? This seems to be the challenging part. You don't want false positives and you also want replayability.
-
The paid plans get you the "premium" blocklists, which includes one specially made to prevent AI scrapers, but a free account will still get you the actual software, the community blocklist, plus up to three "basic"lists.
-
If you've already noticed incoming traffic is weird, you try to look for what distinguishes the sources you don't want. You write rules looking at the behaviors like user agent, order of requests, IP ranges, etc, and put it in your web server and tells it to check if the incoming request matches the rules as a session starts.
Unless you're a high value target for them, they won't put endless resources into making their systems mimic regular clients. They might keep changing IP ranges, but that usually happens ~weekly and you can just check the logs and ban new ranges within minutes. Changing client behavior to blend in is harder at scale - bots simply won't look for the same things as humans in the same ways, they're too consistent, even when they try to be random they're too consistently random.
When enough rules match, you throw in either a redirect or an internal URL rewrite rule for that session to point them to something different.
-
Even if that was possible, I don't want to crash innocents peoples browsers. My tar pits are deployed on live environments that normal users could find themselves navigating to and it's overkill when if you simply respond to 404 Not Found with 200 OK and serve 15MB on the "error" page then bots will stop going to your site because you're not important enough to deal with. It's a low bar, but your data isn't worth someone looking at your tactics and even thinking about circumventing it. They just stop attacking you.
-
There's more than one style of tar pit. In this case you obviously wouldn't want to use an endless maze style.
What you want to do in this case is send them through an HA proxy that would redirect them on user agent, whenever they come in as Claude you send them over to a box running on a Wanem process at modem speeds.
They'll immediately realize they've got a hug of death going on and give up.
-
And the comminity blocklists are updated when more than a couple ( I think the number is something like 10-50 ) instances of crowdsec block an ip in some fast timeframe.
The ai blocklist just adds IP when even one instance finds an AI trying to scrase right from the username.
So even if the community blocklist has fewer ai ip's, it does eventually include them.
-
Which Crowd-Sec blocklists are you using?
-
I'm using the default list alongside Firehol BotScout list and Firehol cybercrime tracker list set to ban.
Also using the Firehol cruzit.com list set to do captcha, just in case it's not actually a bot.
I'm also using the cs-firewall-bouncer and a custom bouncer that's shown on crowdsecs tutorials to detect privilege escalation for if anybody actually manages to get inside.
Alongside that I'm using a lot of scenario collection's for specific software I'm using like nextcloud, grafana, ssh, ... which helps a lot with attacks directly done on a service and not just general scraping or both path traversing.
All free and have been using it for a year, only complaint I have is that I had to make a cronjob to restart the crowdsec service every day because it would stop working after a couple days because of the amount of requests it has to process.
-
I have tried this with some success:
https://github.com/mitchellkrogza/nginx-ultimate-bad-bot-blocker
-
Crowdsec has default scenarios and lists that might block a lot of it, and you can pretty easily make a custom scenario to block IPs that cause large spikes of traffic to your applications if needed.
-
You said
I'm only really running a caddy reverse proxy on the VPS which forwards my home server's services through Tailscale. "
It seems then that you are using a Tailscale Funnel to expose your services to the public web. Is this the case? I ask because the basic premise of Tailscale is that you have to be logged into your Tailscale network to access the services and if you are not logged in, then the site you try to access won't even appear to exist. Unless it's setup via the Funnel.
Assuming then that you setup a funnel, then you are now 100% exposed to the WWW. AI Bots and bots in general crawl the WWW daily and eventually your site will be found. You have a few choices here, rely on a Web Application Firewall (WAF) such as Bunkerweb which would replace Caddy, but would provide a decent firewall of sorts. Or..you can use something like Config Server Firewall but I'm not sure if they have AI Bot protection. The last I used them was before AI was a thing.