lads

[email protected]

Most of those companies are what's called "gpt wrappers". They don't train anything. They just wrap an existing model or service into their software. AI is a trendy word that gets quick funds, many companies will say they are AI related even if they are just making an API call to chatGPT.

For the few that will attempt to train something, there are already a wide variety of datasets for AI training. Or they will may try to get data of a very specific topic. But in order to be scraping the bottom of the pan so hard that you need to scrap some little website you need to be talking about a model with a massive amount of parameters. Something that only like 5 companies in the world would actually need to improve their models. The rest of the people trying to train a model is not going to go try to scrap the whole internet, because they have no way to process and train that.

Also if some company is willing to waste a ton of energy training some data, doing some PoW to obtain that data, while it would be an inconvenient I don't think it will stop them. They are literally building nuclear plants for training, a little crypto challenge is nothing in comparison. But it can be quite intrusive for legitimate users. For starters it forbids navigation with js deactivated.

[email protected]

I did notbknow FSF is complaining about anubis doesn't a bunch of fsf-ally organizations use it?

[email protected]

Depending how demanding the “real” website would be in comparison. I doubt the answer is millions.

The one service I regularly see using something like this is Invidious. I can totally get how even a bit of bot traffic would make the host's life really hard.

It's true a captcha would achieve something similar, if we assume a captcha-solving AI has a certain minimum cost. That means typical users will have to do a lot more work, though, which is why creepy things like Cloudflare have become popular, and I'm not sure what the advantages are.

[email protected]

I just looked at my log for this morning. 23% of my total requests were from the useragent GoogleOther. Other visitors include GPTBot, SemanticScholarBot, and Turnitin. That's the crawlers that are still trying after I've had Anubis on the site for over a month. It was much, much worse before, when they could crawl the site, instead of being blocked.

That doesn't include the bots that lie about being bots. Looking back at an older screenshot of a monitors---I don't have the logs themselves anymore---I seriously doubt I had 43,000 unique visitors using Windows per day in March.

[email protected]

Cloudfare have a clear advantage in the sense that can put the door away from the host and can redistribute the attacks between thousands of servers. Also it's able to analyze attacks from their position of being able to see half the internet so they can develop and implement very efficient block lists.

I'm the first one who is not fan of cloudfare though. So I use crowdsec which builds community blocklists based on user statistics.

PoW as a bot detection is not new. It has been around for ages, but it has never been popular because there have always been better ways to achieve the same or even better results. Captcha may be more user intrusive, but it can actually deflect bots completely (even the best AI could be unable to solve a well made captcha), while PoW only introduces a energy penalty expecting to act as deterrent.

My bet is that invidious is under constant Google attack by obvious reasons. It's a hard situation to be overall. It's true that they are a very particular usercase, with both a lot of users and bots interested in their content, a very resource heavy content, and also the target of one of the biggest corporations of the world. I suppose Anubis could act as mitigation there, at the cost of being less user friendly. And if youtube goes a do the same it would really made for a shitty experience.

[email protected]

Why would they request so many times a day the same data if the objective was AI model training. It makes zero sense.

Also google bots obeys robots.txt so they are easy to manage.

There may be tons of reasons google is crawling your website. From ad research to any kind of research. The only AI related use I can think of is RAG. But that would take some user requests aways because if the user got the info through the AI google response then they would not enter the website. I suppose that would suck for the website owner, but it won't drastically increase the number of requests.

But for training I don't see it, there's no need at all to keep constantly scraping the same web for model training.

[email protected]

Like I said, [edit: at one point] Facebook requested my robots.txt multiple times a second. You've not convinced me that bot writers care about efficiency.

[edit: they've since stopped, possibly because now I give a 404 to anything claiming to be from facebook]

[email protected]

That's outdated info. Yes, not a lot of scraping is really necessary for training. But LLMs are currently often coupled with web search to improve results.

So for example if you ask ChatGPT to find a specific product for you, the result doesn't come from the model. Instead it does a web seach, then it loads the results, summarizes them and returns you the summary plus the links. This is a time-critical operation since the user is waiting for the results. It's also a bad operation for the site being scraped in many situations (mostly when looking for info, not for products) since the user might be satisfied with the summary and won't click the source.

So if you can delay scraping like that by a few seconds, that's quite significant.

[email protected]

You've not convinced me that bot writers care about efficiency.

and why should bot writers care about efficiency when what they really care about is time. they'll burn all your resources without regard simply because they're not who's paying

[email protected]

Yep, they'll just burn taxpayer resources (me and my poor servers) because it's not like they pay taxes anyway (assuming they are either a corporation or not based in the same locality as I am).

There's only one of me and if I'm working on keeping the servers bare minimum functional today I'm not working on making something more awesome for tomorrow. "Linux sysadmin" is only supposed to be up to 30% of my job.

[email protected]

I mean, I enjoy linux sysadmining, but fighting bots takes time, experimentation, and research, and there's other stuff I should be doing. For example, accessibility updates to our websites. But, accessibility doesn't matter a lick if you can't access the website anyway due to timeouts.

[email protected]

I do. I have a client with a limited budget whose websites I'm considering putting behind Anubis because it's getting hammered by AI scrapers.

It comes in waves, too, so the website may randomly go down or slow down significantly, which is really annoying because it's unpredictable.

[email protected]

A proof of work challenge is infinitely better than the alternative of "fuck you, you're accessing this through a VPN and the IP is banned for being owned by Amazon (or literally any data center)"

[email protected]

any "bot stopper" ends up stopping me somehow. Including anubis. I'm pretty sure ive been cursed by the rng gods because even at 40 KH/s, I get stuck on the pages for like 2 minutes before it tells me success.

Similar things like hcaptcha or cloudflare turnstile either never load or never succeed. Recaptcha gaslights me into thinking I was wrong.

https://iloveanubis.phtn.app/

[email protected]

Why not then just make it a setTimeout or something so that it doesn't nuke the CPU of old devices?

[email protected]

The block underneath ai is python

[email protected]

Crawlers don't have to follow conventions or specifications. If one has a setTimeout implementation that doesn't wait the specified amount of time and simply executes the callback immediately, it defeats the system. Proof-of-work is meant to ensure that it's impossible to get around the time factor because of computational inefficiency.

Anubis is an emergency solution against the flood of scrapers deployed by massive AI companies. Everybody wishes it wasn't necessary.

[email protected]

Have you ever had a bladerunner moment

agnos.is Forums

lads