AI haters build tarpits to trap and trick AI scrapers that ignore robots.txt
-
[email protected]replied to [email protected] last edited by
If you're piping ChatGPT into AI scrapers, you're paying ChatGPT for the privilege. So to defeat the AI... you're joining the AI. It all sounds like the plot of a bad sci-fi movie.
-
[email protected]replied to [email protected] last edited by
If so, it appears to the search engine crawler that you have lots of content to be indexed, so it probably would move your page ranking up.
-
[email protected]replied to [email protected] last edited by
Any good web crawler has limits.
Yeah. Like, literally just:
- Keep track of which URLs you've been to
- Avoid going back to the same URL
- Set a soft limit, once you've hit it, start comparing the contents of the page with the previous one (to avoid things like dynamic URLs taking you to the same content)
- Set a hard limit, once you hit it, leave the domain altogether
What kind of lazy ass crawler doesn't even do that?
-
[email protected]replied to [email protected] last edited by
One or two people using this isn't going to punish anything, or make enough of a difference to poison the AI. That's the same phrase all these anti-AI projects for sites and images use, and they forget that, like a vaccine. you have to have the majority of sites using your method in order for it to be effective. And the majority of sysadmins are not going to install what's basically ICE from Cyberpunk on a production server.
Once again, it's lofty claims from the anti-AI crowd, and once again it's much ado about nothing. But I'm sure that won't stop people from believing that they're making a difference by costing themselves money out of spite.
-
[email protected]replied to [email protected] last edited by
One or two sysadmins using this isn't going to be noticeable, and even if it was, the solution would be an inline edit to add a depth limit to links. The fix wouldn't even take thirty seconds to edit your algorithm to completely defeat this.
Not to mention, OpenAI or whatever company that got caught in one of these could sue the site. They might not win, but how many people running hobby sites who are stupid enough to do this are going to have thousands of dollars on hand to fight a lawsuit from a company worth billions with a whole team of lawyers? You gonna start a GoFundMe for them or something?
-
[email protected]replied to [email protected] last edited by
Nah, you just scrape chatgpt.
I don't pay right now to hor their chat app, so I'd just integrate with that.
Not very hard to do, tbh, with curl or a library like libcurl.
-
[email protected]replied to [email protected] last edited by
Your definition of organic traffic is off-standard. When people say organic, they generally mean non-paid, including returns on web search.
The VAST majority of the web would have almost no traffic without web searches. It's not like people flock to sites from talking about it around the water cooler.
-
[email protected]replied to [email protected] last edited by
"Web Scrapers: Many web scrapers and bots do not respect robots.txt at all, as they are often designed to extract data regardless of the site's crawling policies. This can include malicious bots or those used for data mining."
-
[email protected]replied to [email protected] last edited by
It's already permanent and nonstop. They're known to ignore robots.txt, and remove user agent on detection.
And the goal is not only to prevent resource abuse, but break a predatory model.
But, feel free to continue gracefully doing nothing while other takes action, it's bound to help eventually.
-
[email protected]replied to [email protected] last edited by
room-temperature
You need to add “IQ”
-
[email protected]replied to [email protected] last edited by
The way I understand it, the hard limit to leave the domain is actually the only one of these rules that would trigger on Nepenthes. The tar pit keeps generating new linked pages full of trash.
-
[email protected]replied to [email protected] last edited by
Your definition of organic traffic is off-standard.
Fair.
The VAST majority of the web would have almost no traffic without web searches. It’s not like people flock to sites from talking about it around the water cooler.
Which is a shame, tbh. We had far better content, when people had to work to create good content, that others wanted, and got passed around.
ie, in school, before search engines, we all knew about Whitehouse.com... We all knew the sites that had the info we wanted/needed at the time.
In fact, I'd argue the downfall of the web as an actual useful tool came about once search engines automatically started indexing, rather than submitting site maps to a page like OpenDirectory to have your site cataloged, indexed, and sorted into appropriate categories by a human.
Because once people started working on "gaming algos" rather than "Making super good content", the internet just became the new "Malls" where you weren't expected to learn, you were just expected to buy.
-
[email protected]replied to [email protected] last edited by
Clearly more than one or two admins are interested in these options I don’t know why you are assuming that’s the whole list of interested people. Not everyone is as eager as you to roll over and take it without protest.
-
[email protected]replied to [email protected] last edited by
Hey, you keep fighting the good fight, you’ve got them on the ropes! You and all your many, many friends!
-
[email protected]replied to [email protected] last edited by
Hey, you don’t need to convince me, you’ve clearly already committed to bravely sacrificing your own time and money in this valiant fight. Go get ‘em, tiger! I look forward to the articles about AI being stopped coming out any day now.
-
[email protected]replied to [email protected] last edited by
manual and builds are here: https://zadzmo.org/code/nepenthes/
-
[email protected]replied to [email protected] last edited by
what is your deal?
-
[email protected]replied to [email protected] last edited by
I liked it back when link aggregators were the go-to for discovery. You could have sites that were real gems that were just tucked away.
I think the indexing started out ok. Counting backlinks and using that as a ranking was pretty genius, right up until people realized they could game the system, then google realized that artificially screwing with their own system was worth money, then the used ads to modify ranking.
ads to modify discoverability the death of free internet
-
[email protected]replied to [email protected] last edited by
The only AI company that responded to Ars' request to comment was OpenAI, whose spokesperson confirmed that OpenAI is already working on a way to fight tarpitting.
Ah yes. It extremely common for one of the top companies in an industry to spitefully expend resources fighting the efforts of...
One or two people
Please, continue to grace us with you unbiased wisdom. Clearly you've read the article and aren't just trying to simp for AI or start flame wars like a petulant child.
-
[email protected]replied to [email protected] last edited by
Not like you can load balance requests of the malicious subdirectories to a non-prod hardware. Can be decommissioned hardware.