AI haters build tarpits to trap and trick AI scrapers that ignore robots.txt

[email protected]

It doesn't have to memorize all possible guids, it just has to limit visits to base urls.

[email protected]

Notice how it's "AI haters" and not "people trying to protect their IP" as it would be if it were say...China instead of AI companies stealing the IP.

[email protected]

what part of "they do not repeat" do you still not get? You can put them in a list, but you won't ever get a hit ic it'd just be wasting memory

[email protected]

You need to make sure you film yourself doing this and then post it on ~~social media~~ Truth Social to an account linked to your real identity.

only room-temperature nutjobs like those would act like you described

[email protected]

So instead of the AI wasting your resources and money by ignoring your robots.txt, you're going to waste your own resources and money by inviting them to increase their load on your server, but make it permanent and nonstop. Brilliant. Hey, even better, you should host your site on something that charges you based on usage, that'll really show the AI makers who is boss.

[email protected]

There are different kinds of AI scraper defenses.

This one is an active strategy. No shit people know that this costs them resources. The point is that they want to punish the owners of bad-behaved scrapers.

There is also another kind which just blocks anything that tries to follow an invisible link that goes to a resource forbidden by robots.txt

[email protected]

Is that they are being punished too and will hopefully stop ignoring robot.txt

[email protected]

Serving a pipe from ChatGPT into and AI scraping your site uses little server resources.

[email protected]

They follow robots.txt

[email protected]

The poisoned images work very well. We just haven't hit the problem yet, because a) not many people are poisoning their images yet and b) training data sets were cut off at 2021, before poison pills were created.

But, the easy way to get around this is to respect web standards, like robots.txt

[email protected]

You don’t want to NOT show up in search queries right?

At this point?

I am fully ok NOT being in search engines for any of my sites. Organic traffic has always been much more valuable than inorganic traffic.

[email protected]

If you're piping ChatGPT into AI scrapers, you're paying ChatGPT for the privilege. So to defeat the AI... you're joining the AI. It all sounds like the plot of a bad sci-fi movie.

[email protected]

If so, it appears to the search engine crawler that you have lots of content to be indexed, so it probably would move your page ranking up.

[email protected]

Any good web crawler has limits.

Yeah. Like, literally just:

Keep track of which URLs you've been to
Avoid going back to the same URL
Set a soft limit, once you've hit it, start comparing the contents of the page with the previous one (to avoid things like dynamic URLs taking you to the same content)
Set a hard limit, once you hit it, leave the domain altogether

What kind of lazy ass crawler doesn't even do that?

[email protected]

One or two people using this isn't going to punish anything, or make enough of a difference to poison the AI. That's the same phrase all these anti-AI projects for sites and images use, and they forget that, like a vaccine. you have to have the majority of sites using your method in order for it to be effective. And the majority of sysadmins are not going to install what's basically ICE from Cyberpunk on a production server.

Once again, it's lofty claims from the anti-AI crowd, and once again it's much ado about nothing. But I'm sure that won't stop people from believing that they're making a difference by costing themselves money out of spite.

[email protected]

One or two sysadmins using this isn't going to be noticeable, and even if it was, the solution would be an inline edit to add a depth limit to links. The fix wouldn't even take thirty seconds to edit your algorithm to completely defeat this.

Not to mention, OpenAI or whatever company that got caught in one of these could sue the site. They might not win, but how many people running hobby sites who are stupid enough to do this are going to have thousands of dollars on hand to fight a lawsuit from a company worth billions with a whole team of lawyers? You gonna start a GoFundMe for them or something?

[email protected]

Nah, you just scrape chatgpt.

I don't pay right now to hor their chat app, so I'd just integrate with that.

Not very hard to do, tbh, with curl or a library like libcurl.

[email protected]

Your definition of organic traffic is off-standard. When people say organic, they generally mean non-paid, including returns on web search.

The VAST majority of the web would have almost no traffic without web searches. It's not like people flock to sites from talking about it around the water cooler.

[email protected]

"Web Scrapers: Many web scrapers and bots do not respect robots.txt at all, as they are often designed to extract data regardless of the site's crawling policies. This can include malicious bots or those used for data mining."

[email protected]

It's already permanent and nonstop. They're known to ignore robots.txt, and remove user agent on detection.

And the goal is not only to prevent resource abuse, but break a predatory model.

But, feel free to continue gracefully doing nothing while other takes action, it's bound to help eventually.

agnos.is Forums

AI haters build tarpits to trap and trick AI scrapers that ignore robots.txt