The Open-Source Software Saving the Internet From AI Bot Scrapers
-
This would not be a problem if one bot scraped once, and the result was then mirrored to all on Big Tech's dime (cloudflare, tailscale) but since they are all competing now, they think their edge is going to be their own more better scraper setup and they won't share.
Maybe there should just be a web to torrent bridge sovtge data is pushed out once by the server and tge swarm does the heavy lifting as a cache.
No, it'd still be a problem; every diff between commits is expensive to render to web, even if "only one company" is scraping it, "only one time". Many of these applications are designed for humans, not scrapers.
-
A javascriptless check was released recently I just read about it. Uses some refresh HTML tag and a delay. Its not default though since its new.
The source I assume: challenges/metarefresh.
-
Probably because the creator had a blog post that got shared around at a point in time where this exact problem was resonating with users.
It's not always about being first but about marketing.
wrote last edited by [email protected]It’s not always about being first but about marketing.
And one has a cute catgirl mascot, the other a website that looks like a blockchain techbro startup.
I'm even willing to bet the amount of people that set up Anubis just to get the cute splash screen isn't insignificant. -
anubis is basically a bitcoin miner, with the difficulty turned way down (and obviously not resulting in any coins), so it's inherently random. if it takes minutes it does seem like something is wrong though. maybe a network error?
wrote last edited by [email protected]adding to this, some sites set the difficulty way higher then others, nerdvpn's invidious and redlib instances take about 5 seconds and some ~20k hashes, while privacyredirect's inatances are almost instant with less then 50 hashes each time
-
Exactly. It's called proof-of-work and was originally invented to reduce spam emails but was later used by Bitcoin to control its growth speed
it wasn't made for bitcoin originally? didn't know that!
-
Yes, it would make lemmy as unsearchable as discord. Instead of unsearchable as pinterest.
That's not true, search indexer bots should be allowed through from what I read here.
-
Ooh can this work with Lemmy without affecting federation?
To be honest, I need to ask my admin about that!
-
All this could be avoided by making submit photo id to login into a account.
That's awful, it means I would get my photo id stolen hundreds of times per day, or there's also thisfacedoesntexists... and won't work. For many reasons. Not all websites require an account. And even those that do, when they ask for "personal verification" (like dating apps) have a hard time to implement just that. Most "serious" cases use human review of the photo and a video that has your face and you move in and out of an oval shape...
-
That's awful, it means I would get my photo id stolen hundreds of times per day, or there's also thisfacedoesntexists... and won't work. For many reasons. Not all websites require an account. And even those that do, when they ask for "personal verification" (like dating apps) have a hard time to implement just that. Most "serious" cases use human review of the photo and a video that has your face and you move in and out of an oval shape...
Also you must drink a verification can !
-
That's not true, search indexer bots should be allowed through from what I read here.
If you allow my searchxng search scraper then an AI scraper is indistinguishable.
If you mean, "google and duckduckgo are whitelisted" then lemmy will only be searchable there, those specific whitelisted hosts. And google search index is also an AI scraper bot.
-
No, it'd still be a problem; every diff between commits is expensive to render to web, even if "only one company" is scraping it, "only one time". Many of these applications are designed for humans, not scrapers.
If the rendering data for scraper was really the problem
Then the solution is simple, just have downloadable dumps of the publicly available information
That would be extremely efficient and cost fractions of pennies in monthly bandwidth
Plus the data would be far more usable for whatever they are using it for.The problem is trying to have freely available data, but for the host to maintain the ability to leverage this data later.
I don't think we can have both of these.
-
it wasn't made for bitcoin originally? didn't know that!
Originally called hashcash: http://hashcash.org/
-
Originally called hashcash: http://hashcash.org/
you know it's old when it doesn't have ssl
-
Support, pay, and get it
Ah so it is possible to change it
-
"You criticize society yet you participate in it. Curious."
wrote last edited by [email protected]You can't freely download and edit society. You can download and edit this piece of software here, because this is FOSS. You could download it now and change it, or improve it however you'd like. But, you can't, because you're just pretending to be concerned about issues that are made up. Or, if being generous from what I can read here, only you have encountered.
-
Non paywalled link https://archive.is/VcoE1
It basically boils down to making the browser do some cpu heavy calculations before allowing access. This is no problem for a single user, but for a bot farm this would increase the amount of compute power they need 100x or more.
It inherently blocks a lot of the simpler bots by requiring JavaScript as well.
-
adding to this, some sites set the difficulty way higher then others, nerdvpn's invidious and redlib instances take about 5 seconds and some ~20k hashes, while privacyredirect's inatances are almost instant with less then 50 hashes each time
So they make the internet worse for poor people? I could get through 20k in a second, but someone with just an old laptop would take a few minutes, no?
-
To be honest, I need to ask my admin about that!
wrote last edited by [email protected]We don't use anubis but we use iocaine (?), see /0 for the announcement post
-
So they make the internet worse for poor people? I could get through 20k in a second, but someone with just an old laptop would take a few minutes, no?
Well, it's the scrapers that are causing the problem.
-
It’s not always about being first but about marketing.
And one has a cute catgirl mascot, the other a website that looks like a blockchain techbro startup.
I'm even willing to bet the amount of people that set up Anubis just to get the cute splash screen isn't insignificant.Compare and contrast.
High-performance traffic management and next-gen security with multi-cloud management and observability. Built for the enterprise — open source at heart.
Sounds like some over priced, vacuous, do-everything solution. Looks and sounds like every other tech website. Looks like it is meant to appeal to the people who still say "cyber". Looks and sounds like fauxpen source.
Weigh the soul of incoming HTTP requests to protect your website!
Cute. Adorable. Baby girl. Protect my website. Looks fun. Has one clear goal.