AI haters build tarpits to trap and trick AI scrapers that ignore robots.txt
-
The internet being what it is, I'd be more surprised if there wasn't already a website set up somewhere with a malicious robots.txt file to screw over ANY crawler regardless of providence.
-
[email protected]replied to [email protected] last edited by
It's not that we "hate them" - it's that they can entirely overwhelm a low volume site and cause a DDOS.
I ran a few very low visit websites for local interests on a rural. residential line. It wasn't fast but was cheap and as these sites made no money it was good enough Before AI they'd get the odd badly behaved scraper that ignored robots.txt and specifically the rate limits.
But since? I've had to spend a lot of time trying to filter them out upstream. Like, hours and hours. Claudebot was the first - coming from hundreds of AWS IPs and dozens of countries, thousands of times an hour, repeatedly trying to download the same urls - some that didn't exist. Since then it's happened a lot. Some of these tools are just so ridiculously stupid, far more so than a dumb script that cycles through a list. But because it's AI and they're desperate to satisfy the "need for it", they're quite happy to spend millions on AWS costs for negligable gain and screw up other people.
Eventually I gave up and redesigned the sites to be static and they're now on cloudflare pages. Arguably better, but a chunk of my life I'd rather not have lost.
-
[email protected]replied to [email protected] last edited by
You missed out the important bit.
You need to make sure you film yourself doing this and then post it on social media to an account linked to your real identity.
-
[email protected]replied to [email protected] last edited by
You can limit the visits to a domain. The honeypot doesn't register infinite new domains.
-
[email protected]replied to [email protected] last edited by
It doesn't have to memorize all possible guids, it just has to limit visits to base urls.
-
[email protected]replied to [email protected] last edited by
Notice how it's "AI haters" and not "people trying to protect their IP" as it would be if it were say...China instead of AI companies stealing the IP.
-
[email protected]replied to [email protected] last edited by
what part of "they do not repeat" do you still not get? You can put them in a list, but you won't ever get a hit ic it'd just be wasting memory
-
[email protected]replied to [email protected] last edited by
You need to make sure you film yourself doing this and then post it on
social mediaTruth Social to an account linked to your real identity.only room-temperature nutjobs like those would act like you described
-
[email protected]replied to [email protected] last edited by
So instead of the AI wasting your resources and money by ignoring your robots.txt, you're going to waste your own resources and money by inviting them to increase their load on your server, but make it permanent and nonstop. Brilliant. Hey, even better, you should host your site on something that charges you based on usage, that'll really show the AI makers who is boss.
-
[email protected]replied to [email protected] last edited by
There are different kinds of AI scraper defenses.
This one is an active strategy. No shit people know that this costs them resources. The point is that they want to punish the owners of bad-behaved scrapers.
There is also another kind which just blocks anything that tries to follow an invisible link that goes to a resource forbidden by robots.txt
-
[email protected]replied to [email protected] last edited by
Is that they are being punished too and will hopefully stop ignoring robot.txt
-
[email protected]replied to [email protected] last edited by
Serving a pipe from ChatGPT into and AI scraping your site uses little server resources.
-
[email protected]replied to [email protected] last edited by
They follow robots.txt
-
[email protected]replied to [email protected] last edited by
The poisoned images work very well. We just haven't hit the problem yet, because a) not many people are poisoning their images yet and b) training data sets were cut off at 2021, before poison pills were created.
But, the easy way to get around this is to respect web standards, like robots.txt
-
[email protected]replied to [email protected] last edited by
You don’t want to NOT show up in search queries right?
At this point?
I am fully ok NOT being in search engines for any of my sites. Organic traffic has always been much more valuable than inorganic traffic.
-
[email protected]replied to [email protected] last edited by
If you're piping ChatGPT into AI scrapers, you're paying ChatGPT for the privilege. So to defeat the AI... you're joining the AI. It all sounds like the plot of a bad sci-fi movie.
-
[email protected]replied to [email protected] last edited by
If so, it appears to the search engine crawler that you have lots of content to be indexed, so it probably would move your page ranking up.
-
[email protected]replied to [email protected] last edited by
Any good web crawler has limits.
Yeah. Like, literally just:
- Keep track of which URLs you've been to
- Avoid going back to the same URL
- Set a soft limit, once you've hit it, start comparing the contents of the page with the previous one (to avoid things like dynamic URLs taking you to the same content)
- Set a hard limit, once you hit it, leave the domain altogether
What kind of lazy ass crawler doesn't even do that?
-
[email protected]replied to [email protected] last edited by
One or two people using this isn't going to punish anything, or make enough of a difference to poison the AI. That's the same phrase all these anti-AI projects for sites and images use, and they forget that, like a vaccine. you have to have the majority of sites using your method in order for it to be effective. And the majority of sysadmins are not going to install what's basically ICE from Cyberpunk on a production server.
Once again, it's lofty claims from the anti-AI crowd, and once again it's much ado about nothing. But I'm sure that won't stop people from believing that they're making a difference by costing themselves money out of spite.
-
[email protected]replied to [email protected] last edited by
One or two sysadmins using this isn't going to be noticeable, and even if it was, the solution would be an inline edit to add a depth limit to links. The fix wouldn't even take thirty seconds to edit your algorithm to completely defeat this.
Not to mention, OpenAI or whatever company that got caught in one of these could sue the site. They might not win, but how many people running hobby sites who are stupid enough to do this are going to have thousands of dollars on hand to fight a lawsuit from a company worth billions with a whole team of lawyers? You gonna start a GoFundMe for them or something?