lads
-
knowing the small number of companies that have access to the computer power to actually do a training with that data
the 70,717 AI startups worldwide
https://edgedelta.com/company/blog/ai-startup-statistics
Not every company will be training a model as big as the big names, but combined that's a hell of a lot.
wrote last edited by [email protected]Most of those companies are what's called "gpt wrappers". They don't train anything. They just wrap an existing model or service into their software. AI is a trendy word that gets quick funds, many companies will say they are AI related even if they are just making an API call to chatGPT.
For the few that will attempt to train something, there are already a wide variety of datasets for AI training. Or they will may try to get data of a very specific topic. But in order to be scraping the bottom of the pan so hard that you need to scrap some little website you need to be talking about a model with a massive amount of parameters. Something that only like 5 companies in the world would actually need to improve their models. The rest of the people trying to train a model is not going to go try to scrap the whole internet, because they have no way to process and train that.
Also if some company is willing to waste a ton of energy training some data, doing some PoW to obtain that data, while it would be an inconvenient I don't think it will stop them. They are literally building nuclear plants for training, a little crypto challenge is nothing in comparison. But it can be quite intrusive for legitimate users. For starters it forbids navigation with js deactivated.
-
Hail Anubis-chan.
wrote last edited by [email protected]I did notbknow FSF is complaining about anubis doesn't a bunch of fsf-ally organizations use it?
-
I don't think is millions. Take into account that a ddos attacker is not going to execute JavaScript code, at least not any competent one, so they are not going to run the PoW.
In fact the unsolicited and unwarned PoW does not provide more protection than a captcha again ddos.
The mitigation comes from the smaller and easier requests response by the server, so the number of requests to saturate the service must increase. How much? Depending how demanding the "real" website would be in comparison.
I doubt the answer is millions. And they would achieve the exact same result with a captcha without running literal malware on the clients.Depending how demanding the “real” website would be in comparison. I doubt the answer is millions.
The one service I regularly see using something like this is Invidious. I can totally get how even a bit of bot traffic would make the host's life really hard.
It's true a captcha would achieve something similar, if we assume a captcha-solving AI has a certain minimum cost. That means typical users will have to do a lot more work, though, which is why creepy things like Cloudflare have become popular, and I'm not sure what the advantages are.
-
I mean number of pirates correlates with global temperature. That doesn't mean causation.
The rest of the indices would aso match for any archiving bot, or with any bit in search of big data. We must remember that big data is used for much more than AI. At the end of the day scraping is cheap, but very few companies in the world have access to the processing power to train that amount of data. That's why it seems so illogical to me.
We are seeing how many LLM models which are results of a full train, per year? Ten? twenty? Even if they update and retrain often it's not compatible with the amount of request people are implying as AI scraping that would put services into dos risk. Specially when I would think that any AI company would not try to scrap the same data twice.
I have also experience an increase in bot requests in my host. But I just think is a result of internet getting bigger, more people using internet with more diverse intentions, some ill some not. I've also experience a big increase on probing and attack attempts on general, and I don't think it's OpenAI trying some outdated Apache vulnerability on my server. Internet is just a bigger sea with more fish in it.
I just looked at my log for this morning. 23% of my total requests were from the useragent GoogleOther. Other visitors include GPTBot, SemanticScholarBot, and Turnitin. That's the crawlers that are still trying after I've had Anubis on the site for over a month. It was much, much worse before, when they could crawl the site, instead of being blocked.
That doesn't include the bots that lie about being bots. Looking back at an older screenshot of a monitors---I don't have the logs themselves anymore---I seriously doubt I had 43,000 unique visitors using Windows per day in March.
-
Depending how demanding the “real” website would be in comparison. I doubt the answer is millions.
The one service I regularly see using something like this is Invidious. I can totally get how even a bit of bot traffic would make the host's life really hard.
It's true a captcha would achieve something similar, if we assume a captcha-solving AI has a certain minimum cost. That means typical users will have to do a lot more work, though, which is why creepy things like Cloudflare have become popular, and I'm not sure what the advantages are.
Cloudfare have a clear advantage in the sense that can put the door away from the host and can redistribute the attacks between thousands of servers. Also it's able to analyze attacks from their position of being able to see half the internet so they can develop and implement very efficient block lists.
I'm the first one who is not fan of cloudfare though. So I use crowdsec which builds community blocklists based on user statistics.
PoW as a bot detection is not new. It has been around for ages, but it has never been popular because there have always been better ways to achieve the same or even better results. Captcha may be more user intrusive, but it can actually deflect bots completely (even the best AI could be unable to solve a well made captcha), while PoW only introduces a energy penalty expecting to act as deterrent.
My bet is that invidious is under constant Google attack by obvious reasons. It's a hard situation to be overall. It's true that they are a very particular usercase, with both a lot of users and bots interested in their content, a very resource heavy content, and also the target of one of the biggest corporations of the world. I suppose Anubis could act as mitigation there, at the cost of being less user friendly. And if youtube goes a do the same it would really made for a shitty experience.
-
I just looked at my log for this morning. 23% of my total requests were from the useragent GoogleOther. Other visitors include GPTBot, SemanticScholarBot, and Turnitin. That's the crawlers that are still trying after I've had Anubis on the site for over a month. It was much, much worse before, when they could crawl the site, instead of being blocked.
That doesn't include the bots that lie about being bots. Looking back at an older screenshot of a monitors---I don't have the logs themselves anymore---I seriously doubt I had 43,000 unique visitors using Windows per day in March.
wrote last edited by [email protected]Why would they request so many times a day the same data if the objective was AI model training. It makes zero sense.
Also google bots obeys robots.txt so they are easy to manage.
There may be tons of reasons google is crawling your website. From ad research to any kind of research. The only AI related use I can think of is RAG. But that would take some user requests aways because if the user got the info through the AI google response then they would not enter the website. I suppose that would suck for the website owner, but it won't drastically increase the number of requests.
But for training I don't see it, there's no need at all to keep constantly scraping the same web for model training.
-
Why would they request so many times a day the same data if the objective was AI model training. It makes zero sense.
Also google bots obeys robots.txt so they are easy to manage.
There may be tons of reasons google is crawling your website. From ad research to any kind of research. The only AI related use I can think of is RAG. But that would take some user requests aways because if the user got the info through the AI google response then they would not enter the website. I suppose that would suck for the website owner, but it won't drastically increase the number of requests.
But for training I don't see it, there's no need at all to keep constantly scraping the same web for model training.
wrote last edited by [email protected]Like I said, [edit: at one point] Facebook requested my robots.txt multiple times a second. You've not convinced me that bot writers care about efficiency.
[edit: they've since stopped, possibly because now I give a 404 to anything claiming to be from facebook]
-
There a small number of AI companies training full LLM models. And they usually do a few trains per years. What most people see as "AI bots" are not actually that.
The influence of AI over the net is another topic. But anubis is also not doing anything about that as it just makes so the AI bots waste more energy getting the data or at most that data under "anubis protection" does not enter the training dataset. The AI will still be there.
Am I in the list of "good bots" ?sometimes I scrap websites for price tracking or change tracking. If I see a website running malware on my end I would most likely just block that site, one legitimate user less.
That's outdated info. Yes, not a lot of scraping is really necessary for training. But LLMs are currently often coupled with web search to improve results.
So for example if you ask ChatGPT to find a specific product for you, the result doesn't come from the model. Instead it does a web seach, then it loads the results, summarizes them and returns you the summary plus the links. This is a time-critical operation since the user is waiting for the results. It's also a bad operation for the site being scraped in many situations (mostly when looking for info, not for products) since the user might be satisfied with the summary and won't click the source.
So if you can delay scraping like that by a few seconds, that's quite significant.
-
Like I said, [edit: at one point] Facebook requested my robots.txt multiple times a second. You've not convinced me that bot writers care about efficiency.
[edit: they've since stopped, possibly because now I give a 404 to anything claiming to be from facebook]
You've not convinced me that bot writers care about efficiency.
and why should bot writers care about efficiency when what they really care about is time. they'll burn all your resources without regard simply because they're not who's paying
-
You've not convinced me that bot writers care about efficiency.
and why should bot writers care about efficiency when what they really care about is time. they'll burn all your resources without regard simply because they're not who's paying
Yep, they'll just burn taxpayer resources (me and my poor servers) because it's not like they pay taxes anyway (assuming they are either a corporation or not based in the same locality as I am).
There's only one of me and if I'm working on keeping the servers bare minimum functional today I'm not working on making something more awesome for tomorrow. "Linux sysadmin" is only supposed to be up to 30% of my job.
-
Yep, they'll just burn taxpayer resources (me and my poor servers) because it's not like they pay taxes anyway (assuming they are either a corporation or not based in the same locality as I am).
There's only one of me and if I'm working on keeping the servers bare minimum functional today I'm not working on making something more awesome for tomorrow. "Linux sysadmin" is only supposed to be up to 30% of my job.
wrote last edited by [email protected]I mean, I enjoy linux sysadmining, but fighting bots takes time, experimentation, and research, and there's other stuff I should be doing. For example, accessibility updates to our websites. But, accessibility doesn't matter a lick if you can't access the website anyway due to timeouts.
-
There's heavy, and then there's heavy. I don't have any experience dealing with threats like this myself, so I can't comment on what's most common, but we're talking about potentially millions of times more resources for the attacker than the defender here.
There is a lot of AI hype and AI anti-hype right now, that's true.
I do. I have a client with a limited budget whose websites I'm considering putting behind Anubis because it's getting hammered by AI scrapers.
It comes in waves, too, so the website may randomly go down or slow down significantly, which is really annoying because it's unpredictable.
-
It's very intrusive in the sense that it runs a PoW challenge, unsolicited on the client. That's literally like having a cryptominer running on your computer for each challenge.
Each one would do what they want with their server, of course. But for instance I'm very fond of scraping. For instance I have FreshRSS running ok my server, and the way it works is that when the target website doesn't provide a RSS feed ot scrapes it to get the articles. I also have other service that scrapes to get pages changes.
I think part of the beauty of internet is being able to automate processes, software lile Anubis puts a globally significant energy tax on theses automations.
Once again, each one it's able to do with their server whatever they want. But the think I like the least is that they are targeting with some great PR their software as part of some great anti-AI crusade, I don't know if the devs itself or any other party. And I don't like this mostly because I think is disinformation and just manipulative towards people who is maybe easy to manipulate if you say the right words. I also think that it's a discourse that pushes into radicalization from certain topic, and I'm a firm believer that right now we need to overall reduce radicalization, not increase it.
wrote last edited by [email protected]A proof of work challenge is infinitely better than the alternative of "fuck you, you're accessing this through a VPN and the IP is banned for being owned by Amazon (or literally any data center)"
-
Hail Anubis-chan.
any "bot stopper" ends up stopping me somehow. Including anubis. I'm pretty sure ive been cursed by the rng gods because even at 40 KH/s, I get stuck on the pages for like 2 minutes before it tells me success.
Similar things like hcaptcha or cloudflare turnstile either never load or never succeed. Recaptcha gaslights me into thinking I was wrong.
-
Correct. Anubis' goal is to decrease the web traffic that hits the server, not to prevent scraping altogether. I should also clarify that this works because it costs the scrapers time with each request, not because it bogs down the CPU.
Why not then just make it a setTimeout or something so that it doesn't nuke the CPU of old devices?
-
This post did not contain any content.
The block underneath ai is python
-
Why not then just make it a setTimeout or something so that it doesn't nuke the CPU of old devices?
Crawlers don't have to follow conventions or specifications. If one has a
setTimeout
implementation that doesn't wait the specified amount of time and simply executes the callback immediately, it defeats the system. Proof-of-work is meant to ensure that it's impossible to get around the time factor because of computational inefficiency.Anubis is an emergency solution against the flood of scrapers deployed by massive AI companies. Everybody wishes it wasn't necessary.
-
any "bot stopper" ends up stopping me somehow. Including anubis. I'm pretty sure ive been cursed by the rng gods because even at 40 KH/s, I get stuck on the pages for like 2 minutes before it tells me success.
Similar things like hcaptcha or cloudflare turnstile either never load or never succeed. Recaptcha gaslights me into thinking I was wrong.
Have you ever had a bladerunner moment