Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Brand Logo

agnos.is Forums

  1. Home
  2. Selfhosted
  3. Cloudflare blocking AI crawlers

Cloudflare blocking AI crawlers

Scheduled Pinned Locked Moved Selfhosted
selfhosted
34 Posts 24 Posters 0 Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • 3dcadmin@lemmy.relayeasy.com3 [email protected]

    Cloudflare trying to stop AI crawling somehow!

    https://arstechnica.com/tech-policy/2025/07/pay-up-or-stop-scraping-cloudflare-program-charges-bots-for-each-crawl/

    L This user is from outside of this forum
    L This user is from outside of this forum
    [email protected]
    wrote on last edited by
    #4

    Is that why I can no longer go from a web search (eg.: DDG, Ecosia) or forum link to StackOverflow without going through three CF captchas? If AI had not killed SO for me before, this does.

    G irmadlad@lemmy.worldI 2 Replies Last reply
    26
    • L [email protected]

      Is that why I can no longer go from a web search (eg.: DDG, Ecosia) or forum link to StackOverflow without going through three CF captchas? If AI had not killed SO for me before, this does.

      G This user is from outside of this forum
      G This user is from outside of this forum
      [email protected]
      wrote on last edited by
      #5

      Yeah, it's only anecdotal but I feel like hobbyists like us, who do slightly unusual things without nefarious intent, who are the ones who get hit with these sorts of issues the most. For example, I've noticed that some websites start throwing captchas at me or even just straight-up refuse to load with 403: unauthorized errors because I have my router set up to load-balance across two Internet connections. (At least, that's my guess as to why it's happening.)

      buelldozer@lemmy.todayB S 2 Replies Last reply
      18
      • G [email protected]

        Yeah, it's only anecdotal but I feel like hobbyists like us, who do slightly unusual things without nefarious intent, who are the ones who get hit with these sorts of issues the most. For example, I've noticed that some websites start throwing captchas at me or even just straight-up refuse to load with 403: unauthorized errors because I have my router set up to load-balance across two Internet connections. (At least, that's my guess as to why it's happening.)

        buelldozer@lemmy.todayB This user is from outside of this forum
        buelldozer@lemmy.todayB This user is from outside of this forum
        [email protected]
        wrote on last edited by
        #6

        For example, I’ve noticed that some websites start throwing captchas at me or even just straight-up refuse to load with 403: unauthorized errors because I have my router set up to load-balance across two Internet connections. (At least, that’s my guess as to why it’s happening.)

        I maintain several multi-wan commercial setups and they don't have this problem. I obviously don't know what your setup is but I'd guess something is wrong with how its handling flows / connections. Once a connection is established between your edge and an internet resource that flow should remain "stuck" to whatever wan port it started with and it sounds like that isn't happening.

        G 1 Reply Last reply
        8
        • buelldozer@lemmy.todayB [email protected]

          For example, I’ve noticed that some websites start throwing captchas at me or even just straight-up refuse to load with 403: unauthorized errors because I have my router set up to load-balance across two Internet connections. (At least, that’s my guess as to why it’s happening.)

          I maintain several multi-wan commercial setups and they don't have this problem. I obviously don't know what your setup is but I'd guess something is wrong with how its handling flows / connections. Once a connection is established between your edge and an internet resource that flow should remain "stuck" to whatever wan port it started with and it sounds like that isn't happening.

          G This user is from outside of this forum
          G This user is from outside of this forum
          [email protected]
          wrote on last edited by
          #7

          Could very well be. I'm using OpenWRT and basically did the bare minimum to get it to work.

          1 Reply Last reply
          1
          • 3dcadmin@lemmy.relayeasy.com3 [email protected]

            Cloudflare trying to stop AI crawling somehow!

            https://arstechnica.com/tech-policy/2025/07/pay-up-or-stop-scraping-cloudflare-program-charges-bots-for-each-crawl/

            tuxenthusiast@sopuli.xyzT This user is from outside of this forum
            tuxenthusiast@sopuli.xyzT This user is from outside of this forum
            [email protected]
            wrote on last edited by
            #8

            Anubis!
            https://github.com/TecharoHQ/anubis

            C 1 Reply Last reply
            9
            • 3dcadmin@lemmy.relayeasy.com3 [email protected]

              Cloudflare trying to stop AI crawling somehow!

              https://arstechnica.com/tech-policy/2025/07/pay-up-or-stop-scraping-cloudflare-program-charges-bots-for-each-crawl/

              3dcadmin@lemmy.relayeasy.com3 This user is from outside of this forum
              3dcadmin@lemmy.relayeasy.com3 This user is from outside of this forum
              [email protected]
              wrote on last edited by
              #9

              Seen plenty of people who think this is a bad thing, do they just want everything to be crawled. I mean I don't think this is the saviour but it has got to be better than wholesale theft

              G 1 Reply Last reply
              3
              • L [email protected]

                Is that why I can no longer go from a web search (eg.: DDG, Ecosia) or forum link to StackOverflow without going through three CF captchas? If AI had not killed SO for me before, this does.

                irmadlad@lemmy.worldI This user is from outside of this forum
                irmadlad@lemmy.worldI This user is from outside of this forum
                [email protected]
                wrote on last edited by
                #10

                I've seen captchas for years before the recent influx of AI. It's the way I go about obfuscating network activities that the site security cannot determine if I am a bot on not. There is a Captcha Buster extension for Firefox. If the captcha is 'Pick the three busses from these blurry, pixelated set of pictures' then I can solve those easily. It's when the captcha is a full page of a motorcycle and you have to check all the relevant pieces, then on to the next full picture, that chap me. So you click Captcha Buddy and it 'listens' to the audio portion of the captcha, then solves it. It's not 100% on all types of captchas, but it 90% of the time it works every time. It's interesting to me that after a while, you start to notice patterns in the captcha images. For instance if the directions are 'Pick the fire hydrants', there will be at least 5 you have to pick. Crosswalks are the same way too.

                I'd much rather have to do captchas than have my jimmy out in the ether traffic. Anecdotal, but Stack Overflow doesn't trigger a captcha for me. All I get is the cookie popup.

                andres4ny@social.ridetrans.itA 1 Reply Last reply
                1
                • 3dcadmin@lemmy.relayeasy.com3 [email protected]

                  Seen plenty of people who think this is a bad thing, do they just want everything to be crawled. I mean I don't think this is the saviour but it has got to be better than wholesale theft

                  G This user is from outside of this forum
                  G This user is from outside of this forum
                  [email protected]
                  wrote on last edited by
                  #11

                  do they just want everything to be crawled

                  Yes. Web crawling has been a normal and vital part of the web from day 1. We'd have no search engines without crawlers.

                  The web is user-centric by design. I'm sick of tech companies trying to flip the script and hoard information, most of which is not theirs to begin with (e.g. Google, Reddit, Twitter, Facebook, etc.).

                  P 1 Reply Last reply
                  1
                  • irmadlad@lemmy.worldI [email protected]

                    I've seen captchas for years before the recent influx of AI. It's the way I go about obfuscating network activities that the site security cannot determine if I am a bot on not. There is a Captcha Buster extension for Firefox. If the captcha is 'Pick the three busses from these blurry, pixelated set of pictures' then I can solve those easily. It's when the captcha is a full page of a motorcycle and you have to check all the relevant pieces, then on to the next full picture, that chap me. So you click Captcha Buddy and it 'listens' to the audio portion of the captcha, then solves it. It's not 100% on all types of captchas, but it 90% of the time it works every time. It's interesting to me that after a while, you start to notice patterns in the captcha images. For instance if the directions are 'Pick the fire hydrants', there will be at least 5 you have to pick. Crosswalks are the same way too.

                    I'd much rather have to do captchas than have my jimmy out in the ether traffic. Anecdotal, but Stack Overflow doesn't trigger a captcha for me. All I get is the cookie popup.

                    andres4ny@social.ridetrans.itA This user is from outside of this forum
                    andres4ny@social.ridetrans.itA This user is from outside of this forum
                    [email protected]
                    wrote on last edited by
                    #12

                    @irmadlad @lambalicious I just manually do the audio captcha. Every time. Because the picture captchas often don't work correctly for me.

                    It does bug me a little that I don't know what the audio captcha is being used for - am I helping an amazon echo transcribe whatever it is surreptitiously listening to?

                    irmadlad@lemmy.worldI 1 Reply Last reply
                    2
                    • andres4ny@social.ridetrans.itA [email protected]

                      @irmadlad @lambalicious I just manually do the audio captcha. Every time. Because the picture captchas often don't work correctly for me.

                      It does bug me a little that I don't know what the audio captcha is being used for - am I helping an amazon echo transcribe whatever it is surreptitiously listening to?

                      irmadlad@lemmy.worldI This user is from outside of this forum
                      irmadlad@lemmy.worldI This user is from outside of this forum
                      [email protected]
                      wrote on last edited by
                      #13

                      am I helping an amazon echo transcribe whatever it is surreptitiously listening to?

                      I've always wondered where the hell they scrape all that audio from. I mean, it's random shit.

                      L 1 Reply Last reply
                      1
                      • 3dcadmin@lemmy.relayeasy.com3 [email protected]

                        Cloudflare trying to stop AI crawling somehow!

                        https://arstechnica.com/tech-policy/2025/07/pay-up-or-stop-scraping-cloudflare-program-charges-bots-for-each-crawl/

                        D This user is from outside of this forum
                        D This user is from outside of this forum
                        [email protected]
                        wrote on last edited by
                        #14

                        How does it differentiate an "AI crawler", from any other crawler?
                        Search engine crawler?
                        Someone monitoring data to offer statistics?
                        Archiving?

                        This is not good. They are most likely doing the crawling themselves and them selling the data to the best bidder. That bidder could obviously be openAI for all we know.

                        They just know that introducing the sentence "this is anti AI" a lot of people is not going to question anything.

                        _cryptagion@lemmy.dbzer0.com_ 1 Reply Last reply
                        7
                        • 3dcadmin@lemmy.relayeasy.com3 [email protected]

                          Cloudflare trying to stop AI crawling somehow!

                          https://arstechnica.com/tech-policy/2025/07/pay-up-or-stop-scraping-cloudflare-program-charges-bots-for-each-crawl/

                          E This user is from outside of this forum
                          E This user is from outside of this forum
                          [email protected]
                          wrote on last edited by
                          #15

                          All this discussion about captchas raises a question for me: if fingerprinting is so accurate and easy, that ublock, no cookies and a VPN don't help... then why the fuck do I have to keep doing captchas?

                          D I 2 Replies Last reply
                          28
                          • 3dcadmin@lemmy.relayeasy.com3 [email protected]

                            Cloudflare trying to stop AI crawling somehow!

                            https://arstechnica.com/tech-policy/2025/07/pay-up-or-stop-scraping-cloudflare-program-charges-bots-for-each-crawl/

                            deebster@infosec.pubD This user is from outside of this forum
                            deebster@infosec.pubD This user is from outside of this forum
                            [email protected]
                            wrote on last edited by
                            #16

                            FYI, you've added a link where the label is the URL and the actual link is empty. You can fix this by removing the [ and ]() around the link. If the link is there as plain text, it gets a hyperlink automatically: https://arstechnica.com/tech-policy/2025/07/pay-up-or-stop-scraping-cloudflare-program-charges-bots-for-each-crawl/

                            3dcadmin@lemmy.relayeasy.com3 1 Reply Last reply
                            2
                            • E [email protected]

                              All this discussion about captchas raises a question for me: if fingerprinting is so accurate and easy, that ublock, no cookies and a VPN don't help... then why the fuck do I have to keep doing captchas?

                              D This user is from outside of this forum
                              D This user is from outside of this forum
                              [email protected]
                              wrote on last edited by [email protected]
                              #17

                              Because it never was about security. You're training LLMs for free.

                              I'm pretty sure some auto drive company is getting the advantage since a lot of captchas are spotting crosswalks, traffic lights, stairs, busses, mountains, motorcycles etc. Wonder if it's fucking tesla

                              irmadlad@lemmy.worldI 1 Reply Last reply
                              38
                              • G [email protected]

                                do they just want everything to be crawled

                                Yes. Web crawling has been a normal and vital part of the web from day 1. We'd have no search engines without crawlers.

                                The web is user-centric by design. I'm sick of tech companies trying to flip the script and hoard information, most of which is not theirs to begin with (e.g. Google, Reddit, Twitter, Facebook, etc.).

                                P This user is from outside of this forum
                                P This user is from outside of this forum
                                [email protected]
                                wrote on last edited by
                                #18

                                I don’t think this blocks crawlers. About 1/5 websites uses cloudflare, the significant thing here’s is that AI scraping is now blocked by default on most of those sites, NOT crawling

                                1 Reply Last reply
                                3
                                • D [email protected]

                                  Because it never was about security. You're training LLMs for free.

                                  I'm pretty sure some auto drive company is getting the advantage since a lot of captchas are spotting crosswalks, traffic lights, stairs, busses, mountains, motorcycles etc. Wonder if it's fucking tesla

                                  irmadlad@lemmy.worldI This user is from outside of this forum
                                  irmadlad@lemmy.worldI This user is from outside of this forum
                                  [email protected]
                                  wrote on last edited by
                                  #19

                                  I’m pretty sure some auto drive company is getting the advantage

                                  I'd recon that a lot of that is spliced from pictures captured from Google Map vehicles.

                                  W 1 Reply Last reply
                                  6
                                  • D [email protected]

                                    How does it differentiate an "AI crawler", from any other crawler?
                                    Search engine crawler?
                                    Someone monitoring data to offer statistics?
                                    Archiving?

                                    This is not good. They are most likely doing the crawling themselves and them selling the data to the best bidder. That bidder could obviously be openAI for all we know.

                                    They just know that introducing the sentence "this is anti AI" a lot of people is not going to question anything.

                                    _cryptagion@lemmy.dbzer0.com_ This user is from outside of this forum
                                    _cryptagion@lemmy.dbzer0.com_ This user is from outside of this forum
                                    [email protected]
                                    wrote on last edited by
                                    #20

                                    Well, they have access to logs showing who connects to 24 million websites, how they use those websites, and for how long. So if there’s anyone who knows what traffic is crawlers, and which crawlers are AI, it’s Cloudflare. There’s no way they wouldn’t know, they have all the data they would ever need to figure it out. In fact, there’s nobody on the internet who is better positioned to be able to identify AI crawlers than Cloudflare.

                                    x00z@lemmy.worldX 1 Reply Last reply
                                    4
                                    • teft@lemmy.worldT [email protected]

                                      Seeing as how they can't reliably detect that I'm human or not, I don't have much confidence in this.

                                      F This user is from outside of this forum
                                      F This user is from outside of this forum
                                      [email protected]
                                      wrote on last edited by
                                      #21

                                      Yeah. Me choosing to use a vpn and a privacy respecting browser has earnt me a constant captcha

                                      tywele@lemmy.dbzer0.comT 1 Reply Last reply
                                      4
                                      • irmadlad@lemmy.worldI [email protected]

                                        am I helping an amazon echo transcribe whatever it is surreptitiously listening to?

                                        I've always wondered where the hell they scrape all that audio from. I mean, it's random shit.

                                        L This user is from outside of this forum
                                        L This user is from outside of this forum
                                        [email protected]
                                        wrote on last edited by
                                        #22

                                        Gotta be physicists or fanfic writers. I can not imagine other better options.

                                        irmadlad@lemmy.worldI 1 Reply Last reply
                                        1
                                        • irmadlad@lemmy.worldI [email protected]

                                          I’m pretty sure some auto drive company is getting the advantage

                                          I'd recon that a lot of that is spliced from pictures captured from Google Map vehicles.

                                          W This user is from outside of this forum
                                          W This user is from outside of this forum
                                          [email protected]
                                          wrote on last edited by [email protected]
                                          #23

                                          Both you and @[email protected] are correct. Google bought reCAPTCHA in 2012.

                                          Here’s an article about it from 2018.

                                          (╯°□°)╯︵ ┻━┻

                                          Captcha if you can: how you’ve been training AI for years without realising it

                                          And another from 2019! Captchas got harder for us because the AI had learned from our training.

                                          Why CAPTCHAs have gotten so difficult

                                          D irmadlad@lemmy.worldI 2 Replies Last reply
                                          9
                                          Reply
                                          • Reply as topic
                                          Log in to reply
                                          • Oldest to Newest
                                          • Newest to Oldest
                                          • Most Votes


                                          • Login

                                          • Login or register to search.
                                          • First post
                                            Last post
                                          0
                                          • Categories
                                          • Recent
                                          • Tags
                                          • Popular
                                          • World
                                          • Users
                                          • Groups