Advice on how to deal with AI bots/scrapers?
-
If you had read the OP, they don’t want the scrapers using up all their traffic.
-
It’s government reporting data. If you find a better source I say go for it. But I used that data for salary negotiations in the past successfully.
I’m not talking about take home. I’m talking about total annual compensation including things like RSU payouts etc.
-
This is the most realistic solution. Adding a 0.5/1s PoW to hosted services isn't gonna be a big deal for the end user, but offers a tiny bit of protection against bots, especially if the work factor is variable and escalates.
-
yes i did read OP.
-
I don't quiet understand how this is deployed. Hosting this behind a dedicated subdomain or path kind of defeats the purpose as the bots are still able to access the actual website no problem.
-
It also is practical for bots. It forces people to not abuse resources.
-
There are a lot of crypto which increase workfactor PoW to combat spam. Nano is one of them, so it's a pretty proven technology, too.
-
I'm putting crypto on my website. However, I think it would be feasible to do Argon2.
-
and what happens then? land on a blacklist? for some that's the best outcome
-
You don't have to pay to use it
-
The trick is distinguishing them by behavior and switching what you serve them
-
this might not be what you meant, but the word "tar" made me think of tar.gz. Don't most sites compress the HTTP response body with gzip? What's to stop you from sending a zip bomb over the network?
-
How would I go about doing that? This seems to be the challenging part. You don't want false positives and you also want replayability.
-
The paid plans get you the "premium" blocklists, which includes one specially made to prevent AI scrapers, but a free account will still get you the actual software, the community blocklist, plus up to three "basic"lists.
-
If you've already noticed incoming traffic is weird, you try to look for what distinguishes the sources you don't want. You write rules looking at the behaviors like user agent, order of requests, IP ranges, etc, and put it in your web server and tells it to check if the incoming request matches the rules as a session starts.
Unless you're a high value target for them, they won't put endless resources into making their systems mimic regular clients. They might keep changing IP ranges, but that usually happens ~weekly and you can just check the logs and ban new ranges within minutes. Changing client behavior to blend in is harder at scale - bots simply won't look for the same things as humans in the same ways, they're too consistent, even when they try to be random they're too consistently random.
When enough rules match, you throw in either a redirect or an internal URL rewrite rule for that session to point them to something different.
-
Even if that was possible, I don't want to crash innocents peoples browsers. My tar pits are deployed on live environments that normal users could find themselves navigating to and it's overkill when if you simply respond to 404 Not Found with 200 OK and serve 15MB on the "error" page then bots will stop going to your site because you're not important enough to deal with. It's a low bar, but your data isn't worth someone looking at your tactics and even thinking about circumventing it. They just stop attacking you.
-
There's more than one style of tar pit. In this case you obviously wouldn't want to use an endless maze style.
What you want to do in this case is send them through an HA proxy that would redirect them on user agent, whenever they come in as Claude you send them over to a box running on a Wanem process at modem speeds.
They'll immediately realize they've got a hug of death going on and give up.