Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Brand Logo

agnos.is Forums

  1. Home
  2. Technology
  3. Cloudflare announces AI Labyrinth, which uses AI-generated content to confuse and waste the resources of AI Crawlers and bots that ignore “no crawl” directives.

Cloudflare announces AI Labyrinth, which uses AI-generated content to confuse and waste the resources of AI Crawlers and bots that ignore “no crawl” directives.

Scheduled Pinned Locked Moved Technology
technology
217 Posts 129 Posters 1.3k Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • L [email protected]

    while allowing legitimate users and verified crawlers to browse normally.

    What is a "verified crawler" though? What I worry about is, is it only big companies like Google that are allowed to have them now?

    F This user is from outside of this forum
    F This user is from outside of this forum
    [email protected]
    wrote on last edited by
    #181

    Cloudflare isn't the best at blocking things. As long as your crawler isn't horribly misconfigured you shouldn't have much issues.

    1 Reply Last reply
    0
    • tea@programming.devT [email protected]
      This post did not contain any content.
      X This user is from outside of this forum
      X This user is from outside of this forum
      [email protected]
      wrote on last edited by
      #182

      Jokes on them. I'm going to use AI to estimate the value of content, and now I'll get the kind of content I want, though fake, that they will have to generate.

      1 Reply Last reply
      0
      • M [email protected]

        That's what I do too with less accuracy and knowledge. I don't get why I have to hate this. Feels like a bunch of cavemen telling me to hate fire because it might burn the food

        cilethesane@lemmy.caC This user is from outside of this forum
        cilethesane@lemmy.caC This user is from outside of this forum
        [email protected]
        wrote on last edited by
        #183

        Because we have better methods that are easier, cheaper, and less damaging to the environment. They are solving nothing and wasting a fuckton of resources to do so.

        It's like telling cavemen they don't need fire because you can mount an expedition to the nearest valcanoe to cook food without the need for fuel then bring it back to them.

        The best case scenario is the LLM tells you information that is already available on the internet, but 50% of the time it just makes shit up.

        M 1 Reply Last reply
        0
        • R [email protected]

          I'm glad we're burning the forests even faster in the name of identity politics.

          dumbass@leminal.spaceD This user is from outside of this forum
          dumbass@leminal.spaceD This user is from outside of this forum
          [email protected]
          wrote on last edited by
          #184

          Well that was a swing and a miss, back to the dugout with you dumbass.

          1 Reply Last reply
          0
          • U [email protected]

            I have no idea why the makers of LLM crawlers think it's a good idea to ignore bot rules. The rules are there for a reason and the reasons are often more complex than "well, we just don't want you to do that". They're usually more like "why would you even do that?"

            Ultimately you have to trust what the site owners say. The reason why, say, your favourite search engine returns the relevant Wikipedia pages and not bazillion random old page revisions from ages ago is that Wikipedia said "please crawl the most recent versions using canonical page names, and do not follow the links to the technical pages (including history)". Again: Why would anyone index those?

            T This user is from outside of this forum
            T This user is from outside of this forum
            [email protected]
            wrote on last edited by
            #185

            Because it takes work to obey the rules, and you get less data for it. The theoretical comoetutor could get more ignoring those and get some vague advantage for it.

            I'd not be surprised if the crawlers they used were bare-basic utilities set up to just grab everything without worrying about rule and the like.

            1 Reply Last reply
            0
            • tea@programming.devT [email protected]
              This post did not contain any content.
              M This user is from outside of this forum
              M This user is from outside of this forum
              [email protected]
              wrote on last edited by
              #186

              This is getting ridiculous. Can someone please ban AI? Or at least regulate it somehow?

              S ? petaqui@lemmings.worldP 3 Replies Last reply
              0
              • D [email protected]

                Surprised at the level of negativity here. Having had my sites repeatedly DDOSed offline by Claudebot and others scraping the same damned thing over and over again, thousands of times a second, I welcome any measures to help.

                dan@upvote.auD This user is from outside of this forum
                dan@upvote.auD This user is from outside of this forum
                [email protected]
                wrote on last edited by
                #187

                thousands of times a second

                Modify your Nginx (or whatever web server you use) config to rate limit requests to dynamic pages, and cache them. For Nginx, you'd use either fastcgi_cache or proxy_cache depending on how the site is configured. Even if the pages change a lot, a cache with a short TTL (say 1 minute) can still help reduce load quite a bit while not letting them get too outdated.

                Static content (and cached content) shouldn't cause issues even if requested thousands of times per second. Following best practices like pre-compressing content using gzip, Brotli, and zstd helps a lot, too 🙂

                Of course, this advice is just for "unintentional" DDoS attacks, not intentionally malicious ones. Those are often much larger and need different protection - often some protection on the network or load balancer before it even hits the server.

                1 Reply Last reply
                0
                • U [email protected]

                  I have no idea why the makers of LLM crawlers think it's a good idea to ignore bot rules. The rules are there for a reason and the reasons are often more complex than "well, we just don't want you to do that". They're usually more like "why would you even do that?"

                  Ultimately you have to trust what the site owners say. The reason why, say, your favourite search engine returns the relevant Wikipedia pages and not bazillion random old page revisions from ages ago is that Wikipedia said "please crawl the most recent versions using canonical page names, and do not follow the links to the technical pages (including history)". Again: Why would anyone index those?

                  E This user is from outside of this forum
                  E This user is from outside of this forum
                  [email protected]
                  wrote on last edited by
                  #188

                  They want everything, does it exist, but it's not in their dataset? Then they want it.

                  They want their ai to answer any question you could possibly ask it. Filtering out what is and isn't useful doesn't achieve that

                  1 Reply Last reply
                  0
                  • cilethesane@lemmy.caC [email protected]

                    Because we have better methods that are easier, cheaper, and less damaging to the environment. They are solving nothing and wasting a fuckton of resources to do so.

                    It's like telling cavemen they don't need fire because you can mount an expedition to the nearest valcanoe to cook food without the need for fuel then bring it back to them.

                    The best case scenario is the LLM tells you information that is already available on the internet, but 50% of the time it just makes shit up.

                    M This user is from outside of this forum
                    M This user is from outside of this forum
                    [email protected]
                    wrote on last edited by
                    #189

                    Wasteful?

                    Energy production is an issue. Using that energy isn't. LLMs are a better use of energy than most of the useless shit we produce everyday.

                    cilethesane@lemmy.caC 1 Reply Last reply
                    0
                    • S [email protected]

                      Not if we go butlerian jihad on them first

                      a_random_idiot@lemmy.worldA This user is from outside of this forum
                      a_random_idiot@lemmy.worldA This user is from outside of this forum
                      [email protected]
                      wrote on last edited by
                      #190

                      lol, I was gonna say a reverse butlerian jihad but i didnt think many people would get the reference 😛

                      1 Reply Last reply
                      0
                      • archrecord@lemm.eeA [email protected]

                        We cant even handle humans going psycho. Last thing I want is an AI losing its shit due from being overworked producing goblin tentacle porn and going full skynet judgement day.

                        That is simply not how "AI" models today are structured, and that is entirely a fabrication based on science fiction related media.

                        The series of matrix multiplication problems that an LLM is, and runs the tokens from a query through does not have the capability to be overworked, to know if it's been used before (outside of its context window, which itself is just previous stored tokens added to the math problem), to change itself, or to arbitrarily access any system resources.

                        a_random_idiot@lemmy.worldA This user is from outside of this forum
                        a_random_idiot@lemmy.worldA This user is from outside of this forum
                        [email protected]
                        wrote on last edited by
                        #191

                        You must be fun at parties.

                        archrecord@lemm.eeA 1 Reply Last reply
                        0
                        • M [email protected]

                          This is getting ridiculous. Can someone please ban AI? Or at least regulate it somehow?

                          S This user is from outside of this forum
                          S This user is from outside of this forum
                          [email protected]
                          wrote on last edited by
                          #192

                          The problem is, how? I can set it up on my own computer using open source models and some of my own code. It’s really rough to regulate that.

                          1 Reply Last reply
                          0
                          • W [email protected]

                            don't worry, information is still shared. but with people. not with capitalist pigs

                            M This user is from outside of this forum
                            M This user is from outside of this forum
                            [email protected]
                            wrote on last edited by
                            #193

                            Capitalist pigs are paying media to generate AI hatred to help them convince you people to get behind laws that all limit info sharing under the guise of IP and copyright

                            1 Reply Last reply
                            0
                            • M [email protected]

                              This is getting ridiculous. Can someone please ban AI? Or at least regulate it somehow?

                              ? Offline
                              ? Offline
                              Guest
                              wrote on last edited by
                              #194

                              Once a technology or even an idea is there, you can't really make it go away - ai is here to stay. The generative LLM are just a small part.

                              1 Reply Last reply
                              0
                              • F [email protected]

                                This is the great filter.

                                Why isn't there detectable life out there? They all do the same thing we're doing. Undone by greed.

                                ? Offline
                                ? Offline
                                Guest
                                wrote on last edited by
                                #195

                                I haven’t heard of someone refer to the great filter of intelligent life for a while. Good post.

                                1 Reply Last reply
                                0
                                • L [email protected]

                                  The same way they justify cutting benefits for the disabled to balance budgets instead of putting taxes on the rich or just not giving them bailouts, they will justify cutting power to you before a data centre that's 10 corporate AIs all fighting each other, unless we as a people stand up and actually demand change.

                                  ? Offline
                                  ? Offline
                                  Guest
                                  wrote on last edited by
                                  #196

                                  In Texas 80% of our water usage is corporate. But when the lakes are low during a drought they tell homeowners to reduce water the grass. Nobody tells the corporations to throw away less water.

                                  AI will be allowed to use as much energy as it wants. It will even remind people to turn off the lights in a room not being occupied while wasting energy to monitor everyone’s power usage.

                                  1 Reply Last reply
                                  0
                                  • F [email protected]

                                    Vote Blue No Matter Who

                                    Any Democrat is Better than Any Republican

                                    ? Offline
                                    ? Offline
                                    Guest
                                    wrote on last edited by
                                    #197

                                    This is why we need a centrists political party. Solutions shouldn’t be a false dichotomy.

                                    And we shouldn’t downvote people into oblivion. Take my charitable upvote.

                                    F 1 Reply Last reply
                                    0
                                    • ? Guest

                                      This is why we need a centrists political party. Solutions shouldn’t be a false dichotomy.

                                      And we shouldn’t downvote people into oblivion. Take my charitable upvote.

                                      F This user is from outside of this forum
                                      F This user is from outside of this forum
                                      [email protected]
                                      wrote on last edited by
                                      #198

                                      That will require reform of campaign finance laws and progressive reform for elections, both of which are highly partisan issues.

                                      ? 1 Reply Last reply
                                      0
                                      • K [email protected]

                                        Burning 29 acres of rainforest a day to do nothing

                                        ? Offline
                                        ? Offline
                                        Guest
                                        wrote on last edited by
                                        #199

                                        It certainly sounds like they generate the fake content once and serve it from cache every time: "Rather than creating this content on-demand (which could impact performance), we implemented a pre-generation pipeline that sanitizes the content to prevent any XSS vulnerabilities, and stores it in R2 for faster retrieval."

                                        K 1 Reply Last reply
                                        0
                                        • 4 [email protected]

                                          Imagine how much power is wasted on this unfortunate necessity.

                                          Now imagine how much power will be wasted circumventing it.

                                          Fucking clown world we live in

                                          ? Offline
                                          ? Offline
                                          Guest
                                          wrote on last edited by
                                          #200

                                          From the article it seems like they don't generate a new labyrinth for every single time: Rather than creating this content on-demand (which could impact performance), we implemented a pre-generation pipeline that sanitizes the content to prevent any XSS vulnerabilities, and stores it in R2 for faster retrieval."

                                          1 Reply Last reply
                                          0
                                          Reply
                                          • Reply as topic
                                          Log in to reply
                                          • Oldest to Newest
                                          • Newest to Oldest
                                          • Most Votes


                                          • Login

                                          • Login or register to search.
                                          • First post
                                            Last post
                                          0
                                          • Categories
                                          • Recent
                                          • Tags
                                          • Popular
                                          • World
                                          • Users
                                          • Groups