Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Brand Logo

agnos.is Forums

  1. Home
  2. Open Source
  3. The Open-Source Software Saving the Internet From AI Bot Scrapers

The Open-Source Software Saving the Internet From AI Bot Scrapers

Scheduled Pinned Locked Moved Open Source
opensource
102 Posts 65 Posters 1 Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • U [email protected]

    Non paywalled link https://archive.is/VcoE1

    It basically boils down to making the browser do some cpu heavy calculations before allowing access. This is no problem for a single user, but for a bot farm this would increase the amount of compute power they need 100x or more.

    mubelotix@jlai.luM This user is from outside of this forum
    mubelotix@jlai.luM This user is from outside of this forum
    [email protected]
    wrote last edited by
    #36

    Exactly. It's called proof-of-work and was originally invented to reduce spam emails but was later used by Bitcoin to control its growth speed

    R jackbydev@programming.devJ 2 Replies Last reply
    60
    • F [email protected]

      I’d like to use Anubis but the strange hentai character as a mascot is not too professional

      S This user is from outside of this forum
      S This user is from outside of this forum
      [email protected]
      wrote last edited by
      #37

      BRB gonna add a kill the corpo to the kill the boer

      1 Reply Last reply
      0
      • fattyfoods@feddit.nlF [email protected]
        This post did not contain any content.
        drunkanroot@sh.itjust.worksD This user is from outside of this forum
        drunkanroot@sh.itjust.worksD This user is from outside of this forum
        [email protected]
        wrote last edited by
        #38

        it wont protect more then one subdomain i think

        1 Reply Last reply
        0
        • B [email protected]

          It is basically instantaneous on my 12 year old Keppler GPU Linux Box. It is substantially less impactful on the environment than AI tar pits and other deterrents. The Cryptography happening is something almost all browsers from the last 10 years can do natively that Scrapers have to be individually programmed to do. Making it several orders of magnitude beyond impractical for every single corporate bot to be repurposed for. Only to then be rendered moot, because it's an open-source project that someone will just update the cryptographic algorithm for. These posts contain links to articles, if you read them you might answer some of your own questions and have more to contribute to the conversation.

          K This user is from outside of this forum
          K This user is from outside of this forum
          [email protected]
          wrote last edited by
          #39

          It is basically instantaneous on my 12 year old Keppler GPU Linux Box.

          It depends on what the website admin sets, but I've had checks take more than 20 seconds on my reasonably modern phone. And as scrapers get more ruthless, that difficulty setting will have to go up.

          The Cryptography happening is something almost all browsers from the last 10 years can do natively that Scrapers have to be individually programmed to do. Making it several orders of magnitude beyond impractical for every single corporate bot to be repurposed for.

          At best these browsers are going to have some efficient CPU implementation. Scrapers can send these challenges off to dedicated GPU farms or even FPGAs, which are an order of magnitude faster and more efficient. This is also not complex, a team of engineers could set this up in a few days.

          Only to then be rendered moot, because it's an open-source project that someone will just update the cryptographic algorithm for.

          There might be something in changing to a better, GPU resistant algorithm like argon2, but browsers don't support those natively so you would rely on an even less efficient implementation in js or wasm. Quickly changing details of the algorithm in a game of whack-a-mole could work to an extent, but that would turn this into an arms race. And the scrapers can afford far more development time than the maintainers of Anubis.

          These posts contain links to articles, if you read them you might answer some of your own questions and have more to contribute to the conversation.

          This is very condescending. I would prefer if you would just engage with my arguments.

          mcasq_qsacj_234@lemmy.zipM B D 3 Replies Last reply
          3
          • M [email protected]

            <Stupidquestion>

            What advantage does this software provide over simply banning bots via robots.txt?

            </Stupidquestion>

            thingsiplay@beehaw.orgT This user is from outside of this forum
            thingsiplay@beehaw.orgT This user is from outside of this forum
            [email protected]
            wrote last edited by
            #40

            The difference is:

            • robots.txt is a promise without a door
            • Anubis is a physical closed door, that opens up after some time
            1 Reply Last reply
            9
            • D [email protected]

              It takes like half a second on my Fairphone 3, and the CPU in this thing is absolute dogshit. I also doubt that the power consumption is particularly significant compared to the overhead of parsing, executing and JIT-compiling the 14MiB of JavaScript frameworks on the actual website.

              K This user is from outside of this forum
              K This user is from outside of this forum
              [email protected]
              wrote last edited by
              #41

              It depends on the website's setting. I have the same phone and there was one website where it took more than 20 seconds.

              The power consumption is significant, because it needs to be. That is the entire point of this design. If it doesn't take significant a significant number of CPU cycles, scrapers will just power through them. This may not be significant for an individual user, but it does add up when this reaches widespread adoption and everyone's devices have to solve those challenges.

              I 1 Reply Last reply
              3
              • U [email protected]

                Non paywalled link https://archive.is/VcoE1

                It basically boils down to making the browser do some cpu heavy calculations before allowing access. This is no problem for a single user, but for a bot farm this would increase the amount of compute power they need 100x or more.

                lazynooblet@lazysoci.alL This user is from outside of this forum
                lazynooblet@lazysoci.alL This user is from outside of this forum
                [email protected]
                wrote last edited by
                #42

                Thank you for the link. Good read

                1 Reply Last reply
                3
                • lukecooperatus@lemmy.mlL [email protected]

                  she’s working on a non cryptographic challenge so it taxes users’ CPUs less, and also thinking about a version that doesn’t require JavaScript

                  Sounds like the developer of Anubis is aware and working on these shortcomings.

                  Still, IMO these are minor short term issues compared to the scope of the AI problem it's addressing.

                  K This user is from outside of this forum
                  K This user is from outside of this forum
                  [email protected]
                  wrote last edited by [email protected]
                  #43

                  To be clear, I am not minimizing the problems of scrapers. I am merely pointing out that this strategy of proof-of-work has nasty side effects and we need something better.

                  These issues are not short term. PoW means you are entering into an arms race against an adversary with bottomless pockets that inherently requires a ton of useless computations in the browser.

                  When it comes to moving towards something based on heuristics, which is what the developer was talking about there, that is much better. But that is basically what many others are already doing (like the "I am not a robot" checkmark) and fundamentally different from the PoW that I argue against.

                  Go do heuristics, not PoW.

                  vendetta9076@sh.itjust.worksV 1 Reply Last reply
                  3
                  • chickenandrice@sh.itjust.worksC [email protected]

                    Oh no why can't the web be even more boring and professional

                    F This user is from outside of this forum
                    F This user is from outside of this forum
                    [email protected]
                    wrote last edited by
                    #44

                    It’s just not my style ok is all I’m saying and it’s nothing I’d be able to get past all my superiors as a recommendation of software to use.

                    lime@feddit.nuL phase@lemmy.8th.worldP 2 Replies Last reply
                    3
                    • M [email protected]

                      <Stupidquestion>

                      What advantage does this software provide over simply banning bots via robots.txt?

                      </Stupidquestion>

                      I This user is from outside of this forum
                      I This user is from outside of this forum
                      [email protected]
                      wrote last edited by
                      #45

                      TL;DR: You should have both due to the explicit breaking of the robots.txt contract by AI companies.

                      AI generally doesn't obey robots.txt. That file is just notifying scrapers what they shouldn't scrape, but relies on good faith of the scrapers. Many AI companies have explicitly chosen not no to comply with robots.txt, thus breaking the contract, so this is a system that causes those scrapers that are not willing to comply to get stuck in a black hole of junk and waste their time. This is a countermeasure, but not a solution. It's just way less complex than other options that just block these connections, but then make you get pounded with retries. This way the scraper bot gets stuck for a while and doesn't waste as many of your resources blocking them over and over again.

                      1 Reply Last reply
                      18
                      • K [email protected]

                        It is basically instantaneous on my 12 year old Keppler GPU Linux Box.

                        It depends on what the website admin sets, but I've had checks take more than 20 seconds on my reasonably modern phone. And as scrapers get more ruthless, that difficulty setting will have to go up.

                        The Cryptography happening is something almost all browsers from the last 10 years can do natively that Scrapers have to be individually programmed to do. Making it several orders of magnitude beyond impractical for every single corporate bot to be repurposed for.

                        At best these browsers are going to have some efficient CPU implementation. Scrapers can send these challenges off to dedicated GPU farms or even FPGAs, which are an order of magnitude faster and more efficient. This is also not complex, a team of engineers could set this up in a few days.

                        Only to then be rendered moot, because it's an open-source project that someone will just update the cryptographic algorithm for.

                        There might be something in changing to a better, GPU resistant algorithm like argon2, but browsers don't support those natively so you would rely on an even less efficient implementation in js or wasm. Quickly changing details of the algorithm in a game of whack-a-mole could work to an extent, but that would turn this into an arms race. And the scrapers can afford far more development time than the maintainers of Anubis.

                        These posts contain links to articles, if you read them you might answer some of your own questions and have more to contribute to the conversation.

                        This is very condescending. I would prefer if you would just engage with my arguments.

                        mcasq_qsacj_234@lemmy.zipM This user is from outside of this forum
                        mcasq_qsacj_234@lemmy.zipM This user is from outside of this forum
                        [email protected]
                        wrote last edited by
                        #46

                        How will Anubis attack if browsers start acting like manual scrapers used by AI companies to collect information?

                        Because OpenAI is planning to release an AI-powered browser, what happens if it ends up using it as another way to collect information?

                        Blocking all Chromium browsers, I don't think, is a good idea.

                        1 Reply Last reply
                        1
                        • fattyfoods@feddit.nlF [email protected]
                          This post did not contain any content.
                          J This user is from outside of this forum
                          J This user is from outside of this forum
                          [email protected]
                          wrote last edited by
                          #47

                          Everytime I see anubis I get happy because I know the website has some quality information.

                          1 Reply Last reply
                          31
                          • K [email protected]

                            It is basically instantaneous on my 12 year old Keppler GPU Linux Box.

                            It depends on what the website admin sets, but I've had checks take more than 20 seconds on my reasonably modern phone. And as scrapers get more ruthless, that difficulty setting will have to go up.

                            The Cryptography happening is something almost all browsers from the last 10 years can do natively that Scrapers have to be individually programmed to do. Making it several orders of magnitude beyond impractical for every single corporate bot to be repurposed for.

                            At best these browsers are going to have some efficient CPU implementation. Scrapers can send these challenges off to dedicated GPU farms or even FPGAs, which are an order of magnitude faster and more efficient. This is also not complex, a team of engineers could set this up in a few days.

                            Only to then be rendered moot, because it's an open-source project that someone will just update the cryptographic algorithm for.

                            There might be something in changing to a better, GPU resistant algorithm like argon2, but browsers don't support those natively so you would rely on an even less efficient implementation in js or wasm. Quickly changing details of the algorithm in a game of whack-a-mole could work to an extent, but that would turn this into an arms race. And the scrapers can afford far more development time than the maintainers of Anubis.

                            These posts contain links to articles, if you read them you might answer some of your own questions and have more to contribute to the conversation.

                            This is very condescending. I would prefer if you would just engage with my arguments.

                            B This user is from outside of this forum
                            B This user is from outside of this forum
                            [email protected]
                            wrote last edited by [email protected]
                            #48

                            At best these browsers are going to have some efficient CPU implementation.

                            Means absolutely nothing in context to what I said, or any information contained in this article. Does not relate to anything I originally replied to.

                            Scrapers can send these challenges off to dedicated GPU farms or even FPGAs, which are an order of magnitude faster and more efficient.

                            Not what's happening here. Be Serious.

                            I would prefer if you would just engage with my arguments.

                            I did, your arguments are bad and you're being intellectually disingenuous.

                            This is very condescending.

                            Yeah, that's the point. Very Astute

                            K 1 Reply Last reply
                            2
                            • bdonvr@thelemmy.clubB [email protected]

                              Ooh can this work with Lemmy without affecting federation?

                              D This user is from outside of this forum
                              D This user is from outside of this forum
                              [email protected]
                              wrote last edited by
                              #49

                              "Yes", for any bits the user sees. The frontend UI can be behind Anubis without issues. The API, including both user and federation, cannot. We expect "bots" to use an API, so you can't put human verification in front of it. These "bots* also include applications that aren't aware of Anubis, or unable to pass it, like all third party Lemmy apps.

                              That does stop almost all generic AI scraping, though it does not prevent targeted abuse.

                              B 1 Reply Last reply
                              7
                              • K [email protected]

                                It depends on the website's setting. I have the same phone and there was one website where it took more than 20 seconds.

                                The power consumption is significant, because it needs to be. That is the entire point of this design. If it doesn't take significant a significant number of CPU cycles, scrapers will just power through them. This may not be significant for an individual user, but it does add up when this reaches widespread adoption and everyone's devices have to solve those challenges.

                                I This user is from outside of this forum
                                I This user is from outside of this forum
                                [email protected]
                                wrote last edited by
                                #50

                                The usage of the phone's CPU is usually around 1w, but could jump to 5-6w when boosting to solve a nasty challenge. At 20s per challenge, that's 0.03 watt hours. You need to see a thousand of these challenges to use up 0.03 kwh

                                My last power bill was around 300 kwh or 10,000 more than what your phone would use on those thousand challenges. Or a million times more than what this 20s challenge would use.

                                1 Reply Last reply
                                1
                                • K [email protected]

                                  Just recently there was a guy on the NANOG List ranting about Anubis being the wrong approach and people should just cache properly then their servers would handle thousands of users and the bots wouldn't matter. Anyone who puts git online has no-one to blame but themselves, e-commerce should just be made cacheable etc. Seemed a bit idealistic, a bit detached from the current reality.

                                  Ah found it, here

                                  D This user is from outside of this forum
                                  D This user is from outside of this forum
                                  [email protected]
                                  wrote last edited by
                                  #51

                                  Someone making an argument like that clearly does not understand the situation. Just 4 years ago, a robots.txt was enough to keep most bots away, and hosting personal git on the web required very little resources. With AI companies actively profiting off stealing everything, a robots.txt doesn't mean anything. Now, even a relatively small git web host takes an insane amount of resources. I'd know - I host a Forgejo instance. Caching doesn't matter, because diffs berween two random commits are likely unique. Ratelimiting doesn't matter, they will use different IP (ranges) and user agents. It would also heavily impact actual users "because the site is busy".

                                  A proof-of-work solution like Anubis is the best we have currently. The least possible impact to end users, while keeping most (if not all) AI scrapers off the site.

                                  I 1 Reply Last reply
                                  9
                                  • K [email protected]

                                    It is basically instantaneous on my 12 year old Keppler GPU Linux Box.

                                    It depends on what the website admin sets, but I've had checks take more than 20 seconds on my reasonably modern phone. And as scrapers get more ruthless, that difficulty setting will have to go up.

                                    The Cryptography happening is something almost all browsers from the last 10 years can do natively that Scrapers have to be individually programmed to do. Making it several orders of magnitude beyond impractical for every single corporate bot to be repurposed for.

                                    At best these browsers are going to have some efficient CPU implementation. Scrapers can send these challenges off to dedicated GPU farms or even FPGAs, which are an order of magnitude faster and more efficient. This is also not complex, a team of engineers could set this up in a few days.

                                    Only to then be rendered moot, because it's an open-source project that someone will just update the cryptographic algorithm for.

                                    There might be something in changing to a better, GPU resistant algorithm like argon2, but browsers don't support those natively so you would rely on an even less efficient implementation in js or wasm. Quickly changing details of the algorithm in a game of whack-a-mole could work to an extent, but that would turn this into an arms race. And the scrapers can afford far more development time than the maintainers of Anubis.

                                    These posts contain links to articles, if you read them you might answer some of your own questions and have more to contribute to the conversation.

                                    This is very condescending. I would prefer if you would just engage with my arguments.

                                    D This user is from outside of this forum
                                    D This user is from outside of this forum
                                    [email protected]
                                    wrote last edited by
                                    #52

                                    Scrapers can send these challenges off to dedicated GPU farms or even FPGAs, which are an order of magnitude faster and more efficient.

                                    Lets assume for the sake of argument, an AI scraper company actually attempted this. They don't, but lets assume it anyway.

                                    The next Anubis release could include (for example), SHA256 instead of SHA1. This would be a simple, and basically transparent update for admins and end users. The AI company that invested into offloading the PoW to somewhere more efficient now has to spend significantly more resources changing their implementation than what it took for the devs and users of Anubis.

                                    Yes, it technically remains a game of "cat and mouse", but heavily stacked against the cat. One step for Anubis is 2000 steps for a company reimplementing its client in more efficient hardware. Most of the Anubis changes can even be done without impacting the end users at all. That's a game AI companies aren't willing to play, because they've basically already lost. It doesn't really matter how "efficient" the implementation is, if it can be rendered unusable by a small Anubis update.

                                    1 Reply Last reply
                                    1
                                    • fattyfoods@feddit.nlF [email protected]
                                      This post did not contain any content.
                                      R This user is from outside of this forum
                                      R This user is from outside of this forum
                                      [email protected]
                                      wrote last edited by [email protected]
                                      #53

                                      I don't understand how/why this got so popular out of nowhere... the same solution has already existed for years in the form of haproxy-protection and a couple others... but nobody seems to care about those.

                                      F L 2 Replies Last reply
                                      17
                                      • R [email protected]

                                        I don't understand how/why this got so popular out of nowhere... the same solution has already existed for years in the form of haproxy-protection and a couple others... but nobody seems to care about those.

                                        F This user is from outside of this forum
                                        F This user is from outside of this forum
                                        [email protected]
                                        wrote last edited by
                                        #54

                                        Probably because the creator had a blog post that got shared around at a point in time where this exact problem was resonating with users.

                                        It's not always about being first but about marketing.

                                        johnedwa@sopuli.xyzJ 1 Reply Last reply
                                        41
                                        • K [email protected]

                                          To be clear, I am not minimizing the problems of scrapers. I am merely pointing out that this strategy of proof-of-work has nasty side effects and we need something better.

                                          These issues are not short term. PoW means you are entering into an arms race against an adversary with bottomless pockets that inherently requires a ton of useless computations in the browser.

                                          When it comes to moving towards something based on heuristics, which is what the developer was talking about there, that is much better. But that is basically what many others are already doing (like the "I am not a robot" checkmark) and fundamentally different from the PoW that I argue against.

                                          Go do heuristics, not PoW.

                                          vendetta9076@sh.itjust.worksV This user is from outside of this forum
                                          vendetta9076@sh.itjust.worksV This user is from outside of this forum
                                          [email protected]
                                          wrote last edited by
                                          #55

                                          Youre more than welcome to try and implement something better.

                                          K 1 Reply Last reply
                                          1
                                          Reply
                                          • Reply as topic
                                          Log in to reply
                                          • Oldest to Newest
                                          • Newest to Oldest
                                          • Most Votes


                                          • Login

                                          • Login or register to search.
                                          • First post
                                            Last post
                                          0
                                          • Categories
                                          • Recent
                                          • Tags
                                          • Popular
                                          • World
                                          • Users
                                          • Groups