AI haters build tarpits to trap and trick AI scrapers that ignore robots.txt
-
[email protected]replied to [email protected] last edited by
You need to make sure you film yourself doing this and then post it on
social mediaTruth Social to an account linked to your real identity.only room-temperature nutjobs like those would act like you described
-
[email protected]replied to [email protected] last edited by
So instead of the AI wasting your resources and money by ignoring your robots.txt, you're going to waste your own resources and money by inviting them to increase their load on your server, but make it permanent and nonstop. Brilliant. Hey, even better, you should host your site on something that charges you based on usage, that'll really show the AI makers who is boss.
-
[email protected]replied to [email protected] last edited by
There are different kinds of AI scraper defenses.
This one is an active strategy. No shit people know that this costs them resources. The point is that they want to punish the owners of bad-behaved scrapers.
There is also another kind which just blocks anything that tries to follow an invisible link that goes to a resource forbidden by robots.txt
-
[email protected]replied to [email protected] last edited by
Is that they are being punished too and will hopefully stop ignoring robot.txt
-
[email protected]replied to [email protected] last edited by
Serving a pipe from ChatGPT into and AI scraping your site uses little server resources.
-
[email protected]replied to [email protected] last edited by
They follow robots.txt
-
[email protected]replied to [email protected] last edited by
The poisoned images work very well. We just haven't hit the problem yet, because a) not many people are poisoning their images yet and b) training data sets were cut off at 2021, before poison pills were created.
But, the easy way to get around this is to respect web standards, like robots.txt
-
[email protected]replied to [email protected] last edited by
You don’t want to NOT show up in search queries right?
At this point?
I am fully ok NOT being in search engines for any of my sites. Organic traffic has always been much more valuable than inorganic traffic.
-
[email protected]replied to [email protected] last edited by
If you're piping ChatGPT into AI scrapers, you're paying ChatGPT for the privilege. So to defeat the AI... you're joining the AI. It all sounds like the plot of a bad sci-fi movie.
-
[email protected]replied to [email protected] last edited by
If so, it appears to the search engine crawler that you have lots of content to be indexed, so it probably would move your page ranking up.
-
[email protected]replied to [email protected] last edited by
Any good web crawler has limits.
Yeah. Like, literally just:
- Keep track of which URLs you've been to
- Avoid going back to the same URL
- Set a soft limit, once you've hit it, start comparing the contents of the page with the previous one (to avoid things like dynamic URLs taking you to the same content)
- Set a hard limit, once you hit it, leave the domain altogether
What kind of lazy ass crawler doesn't even do that?
-
[email protected]replied to [email protected] last edited by
One or two people using this isn't going to punish anything, or make enough of a difference to poison the AI. That's the same phrase all these anti-AI projects for sites and images use, and they forget that, like a vaccine. you have to have the majority of sites using your method in order for it to be effective. And the majority of sysadmins are not going to install what's basically ICE from Cyberpunk on a production server.
Once again, it's lofty claims from the anti-AI crowd, and once again it's much ado about nothing. But I'm sure that won't stop people from believing that they're making a difference by costing themselves money out of spite.
-
[email protected]replied to [email protected] last edited by
One or two sysadmins using this isn't going to be noticeable, and even if it was, the solution would be an inline edit to add a depth limit to links. The fix wouldn't even take thirty seconds to edit your algorithm to completely defeat this.
Not to mention, OpenAI or whatever company that got caught in one of these could sue the site. They might not win, but how many people running hobby sites who are stupid enough to do this are going to have thousands of dollars on hand to fight a lawsuit from a company worth billions with a whole team of lawyers? You gonna start a GoFundMe for them or something?
-
[email protected]replied to [email protected] last edited by
Nah, you just scrape chatgpt.
I don't pay right now to hor their chat app, so I'd just integrate with that.
Not very hard to do, tbh, with curl or a library like libcurl.
-
[email protected]replied to [email protected] last edited by
Your definition of organic traffic is off-standard. When people say organic, they generally mean non-paid, including returns on web search.
The VAST majority of the web would have almost no traffic without web searches. It's not like people flock to sites from talking about it around the water cooler.
-
[email protected]replied to [email protected] last edited by
"Web Scrapers: Many web scrapers and bots do not respect robots.txt at all, as they are often designed to extract data regardless of the site's crawling policies. This can include malicious bots or those used for data mining."
-
[email protected]replied to [email protected] last edited by
It's already permanent and nonstop. They're known to ignore robots.txt, and remove user agent on detection.
And the goal is not only to prevent resource abuse, but break a predatory model.
But, feel free to continue gracefully doing nothing while other takes action, it's bound to help eventually.
-
[email protected]replied to [email protected] last edited by
room-temperature
You need to add “IQ”
-
[email protected]replied to [email protected] last edited by
The way I understand it, the hard limit to leave the domain is actually the only one of these rules that would trigger on Nepenthes. The tar pit keeps generating new linked pages full of trash.
-
[email protected]replied to [email protected] last edited by
Your definition of organic traffic is off-standard.
Fair.
The VAST majority of the web would have almost no traffic without web searches. It’s not like people flock to sites from talking about it around the water cooler.
Which is a shame, tbh. We had far better content, when people had to work to create good content, that others wanted, and got passed around.
ie, in school, before search engines, we all knew about Whitehouse.com... We all knew the sites that had the info we wanted/needed at the time.
In fact, I'd argue the downfall of the web as an actual useful tool came about once search engines automatically started indexing, rather than submitting site maps to a page like OpenDirectory to have your site cataloged, indexed, and sorted into appropriate categories by a human.
Because once people started working on "gaming algos" rather than "Making super good content", the internet just became the new "Malls" where you weren't expected to learn, you were just expected to buy.