Cloudflare announces AI Labyrinth, which uses AI-generated content to confuse and waste the resources of AI Crawlers and bots that ignore “no crawl” directives.
-
Oh good.
now I can add digital jihad by hallucinating AI to the list of my existential terrors.
Thank your relative for me.
Not if we go butlerian jihad on them first
-
while allowing legitimate users and verified crawlers to browse normally.
What is a "verified crawler" though? What I worry about is, is it only big companies like Google that are allowed to have them now?
Cloudflare isn't the best at blocking things. As long as your crawler isn't horribly misconfigured you shouldn't have much issues.
-
This post did not contain any content.
Jokes on them. I'm going to use AI to estimate the value of content, and now I'll get the kind of content I want, though fake, that they will have to generate.
-
That's what I do too with less accuracy and knowledge. I don't get why I have to hate this. Feels like a bunch of cavemen telling me to hate fire because it might burn the food
Because we have better methods that are easier, cheaper, and less damaging to the environment. They are solving nothing and wasting a fuckton of resources to do so.
It's like telling cavemen they don't need fire because you can mount an expedition to the nearest valcanoe to cook food without the need for fuel then bring it back to them.
The best case scenario is the LLM tells you information that is already available on the internet, but 50% of the time it just makes shit up.
-
I'm glad we're burning the forests even faster in the name of identity politics.
Well that was a swing and a miss, back to the dugout with you dumbass.
-
I have no idea why the makers of LLM crawlers think it's a good idea to ignore bot rules. The rules are there for a reason and the reasons are often more complex than "well, we just don't want you to do that". They're usually more like "why would you even do that?"
Ultimately you have to trust what the site owners say. The reason why, say, your favourite search engine returns the relevant Wikipedia pages and not bazillion random old page revisions from ages ago is that Wikipedia said "please crawl the most recent versions using canonical page names, and do not follow the links to the technical pages (including history)". Again: Why would anyone index those?
Because it takes work to obey the rules, and you get less data for it. The theoretical comoetutor could get more ignoring those and get some vague advantage for it.
I'd not be surprised if the crawlers they used were bare-basic utilities set up to just grab everything without worrying about rule and the like.
-
This post did not contain any content.
This is getting ridiculous. Can someone please ban AI? Or at least regulate it somehow?
-
Surprised at the level of negativity here. Having had my sites repeatedly DDOSed offline by Claudebot and others scraping the same damned thing over and over again, thousands of times a second, I welcome any measures to help.
thousands of times a second
Modify your Nginx (or whatever web server you use) config to rate limit requests to dynamic pages, and cache them. For Nginx, you'd use either fastcgi_cache or proxy_cache depending on how the site is configured. Even if the pages change a lot, a cache with a short TTL (say 1 minute) can still help reduce load quite a bit while not letting them get too outdated.
Static content (and cached content) shouldn't cause issues even if requested thousands of times per second. Following best practices like pre-compressing content using gzip, Brotli, and zstd helps a lot, too
Of course, this advice is just for "unintentional" DDoS attacks, not intentionally malicious ones. Those are often much larger and need different protection - often some protection on the network or load balancer before it even hits the server.
-
I have no idea why the makers of LLM crawlers think it's a good idea to ignore bot rules. The rules are there for a reason and the reasons are often more complex than "well, we just don't want you to do that". They're usually more like "why would you even do that?"
Ultimately you have to trust what the site owners say. The reason why, say, your favourite search engine returns the relevant Wikipedia pages and not bazillion random old page revisions from ages ago is that Wikipedia said "please crawl the most recent versions using canonical page names, and do not follow the links to the technical pages (including history)". Again: Why would anyone index those?
They want everything, does it exist, but it's not in their dataset? Then they want it.
They want their ai to answer any question you could possibly ask it. Filtering out what is and isn't useful doesn't achieve that
-
Because we have better methods that are easier, cheaper, and less damaging to the environment. They are solving nothing and wasting a fuckton of resources to do so.
It's like telling cavemen they don't need fire because you can mount an expedition to the nearest valcanoe to cook food without the need for fuel then bring it back to them.
The best case scenario is the LLM tells you information that is already available on the internet, but 50% of the time it just makes shit up.
Wasteful?
Energy production is an issue. Using that energy isn't. LLMs are a better use of energy than most of the useless shit we produce everyday.
-
Not if we go butlerian jihad on them first
lol, I was gonna say a reverse butlerian jihad but i didnt think many people would get the reference
-
We cant even handle humans going psycho. Last thing I want is an AI losing its shit due from being overworked producing goblin tentacle porn and going full skynet judgement day.
That is simply not how "AI" models today are structured, and that is entirely a fabrication based on science fiction related media.
The series of matrix multiplication problems that an LLM is, and runs the tokens from a query through does not have the capability to be overworked, to know if it's been used before (outside of its context window, which itself is just previous stored tokens added to the math problem), to change itself, or to arbitrarily access any system resources.
You must be fun at parties.
-
This is getting ridiculous. Can someone please ban AI? Or at least regulate it somehow?
The problem is, how? I can set it up on my own computer using open source models and some of my own code. It’s really rough to regulate that.
-
don't worry, information is still shared. but with people. not with capitalist pigs
Capitalist pigs are paying media to generate AI hatred to help them convince you people to get behind laws that all limit info sharing under the guise of IP and copyright
-
This is getting ridiculous. Can someone please ban AI? Or at least regulate it somehow?
-
This is the great filter.
Why isn't there detectable life out there? They all do the same thing we're doing. Undone by greed.
-
The same way they justify cutting benefits for the disabled to balance budgets instead of putting taxes on the rich or just not giving them bailouts, they will justify cutting power to you before a data centre that's 10 corporate AIs all fighting each other, unless we as a people stand up and actually demand change.
In Texas 80% of our water usage is corporate. But when the lakes are low during a drought they tell homeowners to reduce water the grass. Nobody tells the corporations to throw away less water.
AI will be allowed to use as much energy as it wants. It will even remind people to turn off the lights in a room not being occupied while wasting energy to monitor everyone’s power usage.
-
Vote Blue No Matter Who
Any Democrat is Better than Any Republican
-
This is why we need a centrists political party. Solutions shouldn’t be a false dichotomy.
And we shouldn’t downvote people into oblivion. Take my charitable upvote.
That will require reform of campaign finance laws and progressive reform for elections, both of which are highly partisan issues.
-
Burning 29 acres of rainforest a day to do nothing
It certainly sounds like they generate the fake content once and serve it from cache every time: "Rather than creating this content on-demand (which could impact performance), we implemented a pre-generation pipeline that sanitizes the content to prevent any XSS vulnerabilities, and stores it in R2 for faster retrieval."