The Open-Source Software Saving the Internet From AI Bot Scrapers
-
To be clear, I am not minimizing the problems of scrapers. I am merely pointing out that this strategy of proof-of-work has nasty side effects and we need something better.
These issues are not short term. PoW means you are entering into an arms race against an adversary with bottomless pockets that inherently requires a ton of useless computations in the browser.
When it comes to moving towards something based on heuristics, which is what the developer was talking about there, that is much better. But that is basically what many others are already doing (like the "I am not a robot" checkmark) and fundamentally different from the PoW that I argue against.
Go do heuristics, not PoW.
Youre more than welcome to try and implement something better.
-
Youre more than welcome to try and implement something better.
"You criticize society yet you participate in it. Curious."
-
This post did not contain any content.
I've seen this pop up on websites a lot lately. Usually it takes a few seconds to load the website but there have been occasions where it seemed to hang as it was stuck on that screen for minutes and I ended up closing my browser tab because the website just wouldn't load.
Is this a (known) issue or is it intended to be like this?
-
At best these browsers are going to have some efficient CPU implementation.
Means absolutely nothing in context to what I said, or any information contained in this article. Does not relate to anything I originally replied to.
Scrapers can send these challenges off to dedicated GPU farms or even FPGAs, which are an order of magnitude faster and more efficient.
Not what's happening here. Be Serious.
I would prefer if you would just engage with my arguments.
I did, your arguments are bad and you're being intellectually disingenuous.
This is very condescending.
Yeah, that's the point. Very Astute
If you're deliberately belittling me I won't engage. Goodbye.
-
I've seen this pop up on websites a lot lately. Usually it takes a few seconds to load the website but there have been occasions where it seemed to hang as it was stuck on that screen for minutes and I ended up closing my browser tab because the website just wouldn't load.
Is this a (known) issue or is it intended to be like this?
anubis is basically a bitcoin miner, with the difficulty turned way down (and obviously not resulting in any coins), so it's inherently random. if it takes minutes it does seem like something is wrong though. maybe a network error?
-
It’s just not my style ok is all I’m saying and it’s nothing I’d be able to get past all my superiors as a recommendation of software to use.
then have them pay for it.
-
This post did not contain any content.
Open source is also the AI scraper bots AND the internet itself, it is every character in the story.
-
Ooh can this work with Lemmy without affecting federation?
Yes, it would make lemmy as unsearchable as discord. Instead of unsearchable as pinterest.
-
It’s just not my style ok is all I’m saying and it’s nothing I’d be able to get past all my superiors as a recommendation of software to use.
Support, pay, and get it
-
This is fantastic and I appreciate that it scales well on the server side.
Ai scraping is a scourge and I would love to know the collective amount of power wasted due to the necessity of countermeasures like this and add this to the total wasted by ai.
All this could be avoided by making submit photo id to login into a account.
-
Someone making an argument like that clearly does not understand the situation. Just 4 years ago, a robots.txt was enough to keep most bots away, and hosting personal git on the web required very little resources. With AI companies actively profiting off stealing everything, a robots.txt doesn't mean anything. Now, even a relatively small git web host takes an insane amount of resources. I'd know - I host a Forgejo instance. Caching doesn't matter, because diffs berween two random commits are likely unique. Ratelimiting doesn't matter, they will use different IP (ranges) and user agents. It would also heavily impact actual users "because the site is busy".
A proof-of-work solution like Anubis is the best we have currently. The least possible impact to end users, while keeping most (if not all) AI scrapers off the site.
This would not be a problem if one bot scraped once, and the result was then mirrored to all on Big Tech's dime (cloudflare, tailscale) but since they are all competing now, they think their edge is going to be their own more better scraper setup and they won't share.
Maybe there should just be a web to torrent bridge sovtge data is pushed out once by the server and tge swarm does the heavy lifting as a cache.
-
This would not be a problem if one bot scraped once, and the result was then mirrored to all on Big Tech's dime (cloudflare, tailscale) but since they are all competing now, they think their edge is going to be their own more better scraper setup and they won't share.
Maybe there should just be a web to torrent bridge sovtge data is pushed out once by the server and tge swarm does the heavy lifting as a cache.
No, it'd still be a problem; every diff between commits is expensive to render to web, even if "only one company" is scraping it, "only one time". Many of these applications are designed for humans, not scrapers.
-
A javascriptless check was released recently I just read about it. Uses some refresh HTML tag and a delay. Its not default though since its new.
The source I assume: challenges/metarefresh.
-
Probably because the creator had a blog post that got shared around at a point in time where this exact problem was resonating with users.
It's not always about being first but about marketing.
wrote last edited by [email protected]It’s not always about being first but about marketing.
And one has a cute catgirl mascot, the other a website that looks like a blockchain techbro startup.
I'm even willing to bet the amount of people that set up Anubis just to get the cute splash screen isn't insignificant. -
anubis is basically a bitcoin miner, with the difficulty turned way down (and obviously not resulting in any coins), so it's inherently random. if it takes minutes it does seem like something is wrong though. maybe a network error?
wrote last edited by [email protected]adding to this, some sites set the difficulty way higher then others, nerdvpn's invidious and redlib instances take about 5 seconds and some ~20k hashes, while privacyredirect's inatances are almost instant with less then 50 hashes each time
-
Exactly. It's called proof-of-work and was originally invented to reduce spam emails but was later used by Bitcoin to control its growth speed
it wasn't made for bitcoin originally? didn't know that!
-
Yes, it would make lemmy as unsearchable as discord. Instead of unsearchable as pinterest.
That's not true, search indexer bots should be allowed through from what I read here.
-
Ooh can this work with Lemmy without affecting federation?
To be honest, I need to ask my admin about that!
-
All this could be avoided by making submit photo id to login into a account.
That's awful, it means I would get my photo id stolen hundreds of times per day, or there's also thisfacedoesntexists... and won't work. For many reasons. Not all websites require an account. And even those that do, when they ask for "personal verification" (like dating apps) have a hard time to implement just that. Most "serious" cases use human review of the photo and a video that has your face and you move in and out of an oval shape...
-
That's awful, it means I would get my photo id stolen hundreds of times per day, or there's also thisfacedoesntexists... and won't work. For many reasons. Not all websites require an account. And even those that do, when they ask for "personal verification" (like dating apps) have a hard time to implement just that. Most "serious" cases use human review of the photo and a video that has your face and you move in and out of an oval shape...
Also you must drink a verification can !