The Open-Source Software Saving the Internet From AI Bot Scrapers

[email protected]

TL;DR: You should have both due to the explicit breaking of the robots.txt contract by AI companies.

AI generally doesn't obey robots.txt. That file is just notifying scrapers what they shouldn't scrape, but relies on good faith of the scrapers. Many AI companies have explicitly chosen not no to comply with robots.txt, thus breaking the contract, so this is a system that causes those scrapers that are not willing to comply to get stuck in a black hole of junk and waste their time. This is a countermeasure, but not a solution. It's just way less complex than other options that just block these connections, but then make you get pounded with retries. This way the scraper bot gets stuck for a while and doesn't waste as many of your resources blocking them over and over again.

[email protected]

How will Anubis attack if browsers start acting like manual scrapers used by AI companies to collect information?

Because OpenAI is planning to release an AI-powered browser, what happens if it ends up using it as another way to collect information?

Blocking all Chromium browsers, I don't think, is a good idea.

[email protected]

Everytime I see anubis I get happy because I know the website has some quality information.

[email protected]

At best these browsers are going to have some efficient CPU implementation.

Means absolutely nothing in context to what I said, or any information contained in this article. Does not relate to anything I originally replied to.

Scrapers can send these challenges off to dedicated GPU farms or even FPGAs, which are an order of magnitude faster and more efficient.

Not what's happening here. Be Serious.

I would prefer if you would just engage with my arguments.

I did, your arguments are bad and you're being intellectually disingenuous.

This is very condescending.

Yeah, that's the point. Very Astute

[email protected]

"Yes", for any bits the user sees. The frontend UI can be behind Anubis without issues. The API, including both user and federation, cannot. We expect "bots" to use an API, so you can't put human verification in front of it. These "bots* also include applications that aren't aware of Anubis, or unable to pass it, like all third party Lemmy apps.

That does stop almost all generic AI scraping, though it does not prevent targeted abuse.

[email protected]

The usage of the phone's CPU is usually around 1w, but could jump to 5-6w when boosting to solve a nasty challenge. At 20s per challenge, that's 0.03 watt hours. You need to see a thousand of these challenges to use up 0.03 kwh

My last power bill was around 300 kwh or 10,000 more than what your phone would use on those thousand challenges. Or a million times more than what this 20s challenge would use.

[email protected]

Someone making an argument like that clearly does not understand the situation. Just 4 years ago, a robots.txt was enough to keep most bots away, and hosting personal git on the web required very little resources. With AI companies actively profiting off stealing everything, a robots.txt doesn't mean anything. Now, even a relatively small git web host takes an insane amount of resources. I'd know - I host a Forgejo instance. Caching doesn't matter, because diffs berween two random commits are likely unique. Ratelimiting doesn't matter, they will use different IP (ranges) and user agents. It would also heavily impact actual users "because the site is busy".

A proof-of-work solution like Anubis is the best we have currently. The least possible impact to end users, while keeping most (if not all) AI scrapers off the site.

[email protected]

Scrapers can send these challenges off to dedicated GPU farms or even FPGAs, which are an order of magnitude faster and more efficient.

Lets assume for the sake of argument, an AI scraper company actually attempted this. They don't, but lets assume it anyway.

The next Anubis release could include (for example), SHA256 instead of SHA1. This would be a simple, and basically transparent update for admins and end users. The AI company that invested into offloading the PoW to somewhere more efficient now has to spend significantly more resources changing their implementation than what it took for the devs and users of Anubis.

Yes, it technically remains a game of "cat and mouse", but heavily stacked against the cat. One step for Anubis is 2000 steps for a company reimplementing its client in more efficient hardware. Most of the Anubis changes can even be done without impacting the end users at all. That's a game AI companies aren't willing to play, because they've basically already lost. It doesn't really matter how "efficient" the implementation is, if it can be rendered unusable by a small Anubis update.

[email protected]

I don't understand how/why this got so popular out of nowhere... the same solution has already existed for years in the form of haproxy-protection and a couple others... but nobody seems to care about those.

[email protected]

Probably because the creator had a blog post that got shared around at a point in time where this exact problem was resonating with users.

It's not always about being first but about marketing.

[email protected]

Youre more than welcome to try and implement something better.

[email protected]

"You criticize society yet you participate in it. Curious."

[email protected]

I've seen this pop up on websites a lot lately. Usually it takes a few seconds to load the website but there have been occasions where it seemed to hang as it was stuck on that screen for minutes and I ended up closing my browser tab because the website just wouldn't load.

Is this a (known) issue or is it intended to be like this?

[email protected]

If you're deliberately belittling me I won't engage. Goodbye.

[email protected]

anubis is basically a bitcoin miner, with the difficulty turned way down (and obviously not resulting in any coins), so it's inherently random. if it takes minutes it does seem like something is wrong though. maybe a network error?

[email protected]

then have them pay for it.

[email protected]

Open source is also the AI scraper bots AND the internet itself, it is every character in the story.

[email protected]

Yes, it would make lemmy as unsearchable as discord. Instead of unsearchable as pinterest.

[email protected]

Support, pay, and get it

[email protected]

All this could be avoided by making submit photo id to login into a account.

agnos.is Forums

The Open-Source Software Saving the Internet From AI Bot Scrapers