Advice on how to deal with AI bots/scrapers?
-
When I worked in the U.S. I was well above $160k.
When you look at leaks you can see $500k or more for principal engineers. Look at valves lawsuit information. https://www.theverge.com/2024/7/13/24197477/valve-employs-few-hundred-people-payroll-redacted
Meta is paying $400k BASE for AI Reserch engineers with stock options on top which in my experience is an additional 300% - 600%. Vesting over 2 to 4 years. This is to H1B workers who traditionally are paid less.
https://h1bdata.info/index.php?em=meta+platforms+inc&job=&city=&year=all+years
ROI does not matter when companies are telling investors that they might be first to AGI. Investors lose their minds over this. At least they will until the AI bubble pops.
I support people resisting if they want by setting up tar pits. But it’s a hobby and isn’t really doing much.
The sheer amount of resources going into this is beyond what people think.
-
you whisky couldn't solve tarpits completely. they may hold up the scrapers for less time, but they will still do that for the amount of the timeout
-
Maybe not with just if statements. But with a heuristic system I bet any site that runs a tar pit will be caught out very quickly.
-
In the Verge article, are you talking about the table the the "presumably" qualifier in the table column headers? If so, not only is it clear they don't know what, exactly, is a attributable to the costs, but also that they mention "gross pay", which is AKA "compensation." When a company refers to compensation, they include all benefits: 401k contributions, the value of health insurance, vacation time, social security, bonuses, and any other benefits. When I was running development organizations, a developer who cost me $180k was probably only taking $90k of that home. The rest of it went to benefits. The rule of thumb was for every dollar of salary negotiated, I had to budget 1.5-2x that amount. The numbers in "Presumably: Gross pay" column are very likely cost-to-company, not take-home pay.
I have some serious questions about the data from "h1bdata.info". It claims one software engineer has a salary of $25,304,885? They've got some pretty outlandish salaries in there; a program manager in NY making $2,400,000? I'm sceptical about the source of the data on that website. The vast number of the salaries for engineers, even in that table, are in the range of $100k - 180k, largely dependent on location, and a far cry from a take-home salary of 500,000€.
-
That would be extremely tedious. There are hundrets of thousands of scrapers out there.
-
If you’re looking to stop them from wasting your traffic, do not use a tarpit. The whole point of it is that it makes the scraper get stuck on your server forever. That means you pay for the traffic the scraper uses, and it will continually rack up those charges until the people running it wise up and ban your server. The question you gotta ask yourself is, who has more money, you or the massive AI corp?
Tarpits are the dumbest bit of anti-AI tech to come out yet.
-
If you had read the OP, they don’t want the scrapers using up all their traffic.
-
It’s government reporting data. If you find a better source I say go for it. But I used that data for salary negotiations in the past successfully.
I’m not talking about take home. I’m talking about total annual compensation including things like RSU payouts etc.
-
This is the most realistic solution. Adding a 0.5/1s PoW to hosted services isn't gonna be a big deal for the end user, but offers a tiny bit of protection against bots, especially if the work factor is variable and escalates.
-
yes i did read OP.
-
I don't quiet understand how this is deployed. Hosting this behind a dedicated subdomain or path kind of defeats the purpose as the bots are still able to access the actual website no problem.
-
It also is practical for bots. It forces people to not abuse resources.
-
There are a lot of crypto which increase workfactor PoW to combat spam. Nano is one of them, so it's a pretty proven technology, too.
-
I'm putting crypto on my website. However, I think it would be feasible to do Argon2.
-
and what happens then? land on a blacklist? for some that's the best outcome
-
You don't have to pay to use it
-
The trick is distinguishing them by behavior and switching what you serve them
-
this might not be what you meant, but the word "tar" made me think of tar.gz. Don't most sites compress the HTTP response body with gzip? What's to stop you from sending a zip bomb over the network?
-
How would I go about doing that? This seems to be the challenging part. You don't want false positives and you also want replayability.
-
The paid plans get you the "premium" blocklists, which includes one specially made to prevent AI scrapers, but a free account will still get you the actual software, the community blocklist, plus up to three "basic"lists.