Advice on how to deal with AI bots/scrapers?
-
Read access logs and 403 user agents or IPs
-
I guess sending tar bombs can be fun
-
Now I just wan't to host a web page and expose it with nepenthes...
First, because I'm a big fan of carnivorous plants.
Second, because it let's you poison LLMs, AI and fuck with their data.
Lastly, because I can do my part and say F#CK Y0U to those privacy data hungry a$$holes !
I don't even expose anything directly to the web (always accessible through a tunnel like wireguard) or have any important data to protect from AI or LLMs. But just giving the opportunity to fuck with them while they continuously harvest data from everyone is something I was already thinking off but didn't knew how.
Thanks for the link !
-
Fair. But I haven’t seen any anti-ai-scraper tarpits that do that. The ones I’ve seen mostly just pipe 10MB of /dev/urandom out there.
Also I assume that the programmers working at ai companies are not literally mentally deficient. They certainly would add
.timeout(10)
or whatever to their scrapers. They probably have something more dynamic than that. -
There's one I saw that gave the bot a long circular form to fill out or something, I can't exactly remember
-
Yeah, that's a good one.
-
Ah, that's where tuning comes in. Look at the logs, take the average time-out, and tune the tarpit to return a minimum payload consisting of a minimal HTML containing a single, slightly different URL back to the tar pit. Or, better yet, JavaScript that loads a single page of tarpit URLs very slowly. Bots have to be able to run JS, or else they're missing half the content on the web. I'm sure someone has created a JS forkbomb.
Variety is the spice of life. AI botnet blacklists are probably the better solution for web content; you can run ssh on a different port and run a tarpit on the standard port, and it will barely affect you. But for the web, if you're running a web server you probably want visitors, and tarpits would be harder to set up to catch only bots.
-
Honestly we need some sort of proof of work (PoW)
-
I see your point but like I think you underestimate the skill of coders. You make sure your timeout is inclusive of JavaScript run times. Maybe set a memory limit too. Like imagine you wanted to scrape the internet. You could solve all these tarpits. Any capable coder could. Now imagine a team of 20 of the best coders money can buy each paid 500.000€. They can certainly do the same.
Like I see the appeal of running a tar pit. But like I don’t see how they can “trap” anyone but script kiddies.
-
Nobody is paying software developers 500.000€. It might cost the company that much, but no developers are making that much. The highest software engineer salaries are still in the US, and the average is $120k. High-end salaries are $160k; you might creep up a little more than that, but that's also location specific. Silicon Valley salaries might be higher, but then, it costs far more to live in that area.
In any case, the question is ROI. If you have to spend $500,000 to address some sites that are being clever about wasting your scrapers' time, is that data worth it? Are you going to make your $500k back? And you have to keep spending it, because people keep changing tactics and putting in new mechanisms to ruin your business model. Really, the only time this sort of investment makes sense is when you're breaking into a bank and are going to get a big pay-out in ransomware or outright theft. Getting the contents of my blog is never going to be worth the investment.
Your assumption is that slowly served content is considered not worth scraping. If that's the case, then it's easy enough for people to prevent their content from being scraped: put in sufficient delays. This is an actual a method for addressing spam: add a delay in each interaction. Even relatively small delays add up and cost spammers money, especially if you run a large email service and do it at scale.
Make the web a little slower. Add a few seconds to each request, on every web site. Humans might notice, but probably not enough to be a big bother, but the impact on data harvesters will be huge.
If you think this isn't the defense, consider how almost every Cloudflare interaction - and an increasingly large number of other sites - are including time-wasting front pages. They usually say something like "making sure you're human" with a spinning disk, but really all they need to be doing is adding 10 seconds to each request. If a scraper of trying to indeed only a million pages a day, and each page adds a 10s delay, that's wasting 2,700 hours of scraper computer time. And they're trying to scrape far more than a million pages a day; it's estimated (they don't reveal the actual number) that Google indexes billions of pages every day.
This is good, though; I'm going to go change the rate limit on my web server; maybe those genius software developers will set a timeout such that they move on before they get any content from my site.
-
When I worked in the U.S. I was well above $160k.
When you look at leaks you can see $500k or more for principal engineers. Look at valves lawsuit information. https://www.theverge.com/2024/7/13/24197477/valve-employs-few-hundred-people-payroll-redacted
Meta is paying $400k BASE for AI Reserch engineers with stock options on top which in my experience is an additional 300% - 600%. Vesting over 2 to 4 years. This is to H1B workers who traditionally are paid less.
https://h1bdata.info/index.php?em=meta+platforms+inc&job=&city=&year=all+years
ROI does not matter when companies are telling investors that they might be first to AGI. Investors lose their minds over this. At least they will until the AI bubble pops.
I support people resisting if they want by setting up tar pits. But it’s a hobby and isn’t really doing much.
The sheer amount of resources going into this is beyond what people think.
-
you whisky couldn't solve tarpits completely. they may hold up the scrapers for less time, but they will still do that for the amount of the timeout
-
Maybe not with just if statements. But with a heuristic system I bet any site that runs a tar pit will be caught out very quickly.
-
In the Verge article, are you talking about the table the the "presumably" qualifier in the table column headers? If so, not only is it clear they don't know what, exactly, is a attributable to the costs, but also that they mention "gross pay", which is AKA "compensation." When a company refers to compensation, they include all benefits: 401k contributions, the value of health insurance, vacation time, social security, bonuses, and any other benefits. When I was running development organizations, a developer who cost me $180k was probably only taking $90k of that home. The rest of it went to benefits. The rule of thumb was for every dollar of salary negotiated, I had to budget 1.5-2x that amount. The numbers in "Presumably: Gross pay" column are very likely cost-to-company, not take-home pay.
I have some serious questions about the data from "h1bdata.info". It claims one software engineer has a salary of $25,304,885? They've got some pretty outlandish salaries in there; a program manager in NY making $2,400,000? I'm sceptical about the source of the data on that website. The vast number of the salaries for engineers, even in that table, are in the range of $100k - 180k, largely dependent on location, and a far cry from a take-home salary of 500,000€.