Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Brand Logo

agnos.is Forums

  1. Home
  2. Open Source
  3. The Open-Source Software Saving the Internet From AI Bot Scrapers

The Open-Source Software Saving the Internet From AI Bot Scrapers

Scheduled Pinned Locked Moved Open Source
opensource
102 Posts 65 Posters 1 Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • fattyfoods@feddit.nlF [email protected]
    This post did not contain any content.
    bdonvr@thelemmy.clubB This user is from outside of this forum
    bdonvr@thelemmy.clubB This user is from outside of this forum
    [email protected]
    wrote last edited by
    #22

    Ooh can this work with Lemmy without affecting federation?

    I S B D I 6 Replies Last reply
    22
    • M [email protected]

      <Stupidquestion>

      What advantage does this software provide over simply banning bots via robots.txt?

      </Stupidquestion>

      O This user is from outside of this forum
      O This user is from outside of this forum
      [email protected]
      wrote last edited by
      #23

      I mean, you could have read the article before asking, it's literally in there...

      1 Reply Last reply
      1
      • bdonvr@thelemmy.clubB [email protected]

        Ooh can this work with Lemmy without affecting federation?

        I This user is from outside of this forum
        I This user is from outside of this forum
        [email protected]
        wrote last edited by
        #24

        Yeah, it's already deployed on slrpnk.net. I see it momentarily every time I load the site.

        1 Reply Last reply
        3
        • K [email protected]

          I get that website admins are desperate for a solution, but Anubis is fundamentally flawed.

          It is hostile to the user, because it is very slow on older hardware andere forces you to use javascript.

          It is bad for the environment, because it wastes energy on useless computations similar to mining crypto. If more websites start using this, that really adds up.

          But most importantly, it won't work in the end. These scraping tech companies have much deeper pockets and can use specialized hardware that is much more efficient at solving these challenges than a normal web browser.

          D This user is from outside of this forum
          D This user is from outside of this forum
          [email protected]
          wrote last edited by
          #25

          It takes like half a second on my Fairphone 3, and the CPU in this thing is absolute dogshit. I also doubt that the power consumption is particularly significant compared to the overhead of parsing, executing and JIT-compiling the 14MiB of JavaScript frameworks on the actual website.

          K 1 Reply Last reply
          8
          • fattyfoods@feddit.nlF [email protected]
            This post did not contain any content.
            K This user is from outside of this forum
            K This user is from outside of this forum
            [email protected]
            wrote last edited by
            #26

            Just recently there was a guy on the NANOG List ranting about Anubis being the wrong approach and people should just cache properly then their servers would handle thousands of users and the bots wouldn't matter. Anyone who puts git online has no-one to blame but themselves, e-commerce should just be made cacheable etc. Seemed a bit idealistic, a bit detached from the current reality.

            Ah found it, here

            D 1 Reply Last reply
            8
            • bdonvr@thelemmy.clubB [email protected]

              Ooh can this work with Lemmy without affecting federation?

              S This user is from outside of this forum
              S This user is from outside of this forum
              [email protected]
              wrote last edited by
              #27

              As long as its not configured improperly. When forgejo devs added it it broke downloading images with Kubernetes for a moment. Basically would need to make sure user agent header for federation is allowed.

              1 Reply Last reply
              2
              • K [email protected]

                I get that website admins are desperate for a solution, but Anubis is fundamentally flawed.

                It is hostile to the user, because it is very slow on older hardware andere forces you to use javascript.

                It is bad for the environment, because it wastes energy on useless computations similar to mining crypto. If more websites start using this, that really adds up.

                But most importantly, it won't work in the end. These scraping tech companies have much deeper pockets and can use specialized hardware that is much more efficient at solving these challenges than a normal web browser.

                S This user is from outside of this forum
                S This user is from outside of this forum
                [email protected]
                wrote last edited by
                #28

                A javascriptless check was released recently I just read about it. Uses some refresh HTML tag and a delay. Its not default though since its new.

                phase@lemmy.8th.worldP 1 Reply Last reply
                1
                • F [email protected]

                  I’d like to use Anubis but the strange hentai character as a mascot is not too professional

                  chickenandrice@sh.itjust.worksC This user is from outside of this forum
                  chickenandrice@sh.itjust.worksC This user is from outside of this forum
                  [email protected]
                  wrote last edited by
                  #29

                  Oh no why can't the web be even more boring and professional

                  F 1 Reply Last reply
                  10
                  • bdonvr@thelemmy.clubB [email protected]

                    Ooh can this work with Lemmy without affecting federation?

                    B This user is from outside of this forum
                    B This user is from outside of this forum
                    [email protected]
                    wrote last edited by
                    #30

                    Yes.

                    Source: I use it on my instance and federation works fine

                    bdonvr@thelemmy.clubB 1 Reply Last reply
                    21
                    • B [email protected]

                      Yes.

                      Source: I use it on my instance and federation works fine

                      bdonvr@thelemmy.clubB This user is from outside of this forum
                      bdonvr@thelemmy.clubB This user is from outside of this forum
                      [email protected]
                      wrote last edited by
                      #31

                      Thanks. Anything special configuring it?

                      B 1 Reply Last reply
                      10
                      • F [email protected]

                        I’d like to use Anubis but the strange hentai character as a mascot is not too professional

                        B This user is from outside of this forum
                        B This user is from outside of this forum
                        [email protected]
                        wrote last edited by
                        #32

                        hentai character

                        anime != hentai

                        I smile whenever I encounter the Anubis character in the wild. She's holding up the free software internet on her shoulders after all.

                        1 Reply Last reply
                        22
                        • fattyfoods@feddit.nlF [email protected]
                          This post did not contain any content.
                          grysbok@lemmy.sdf.orgG This user is from outside of this forum
                          grysbok@lemmy.sdf.orgG This user is from outside of this forum
                          [email protected]
                          wrote last edited by
                          #33

                          My archive's server uses Anubis and after initial configuration it's been pain-free. Also, I'm no longer getting multiple automated emails a day about how the server's timing out. It's great.

                          We went from about 3000 unique "pinky swear I'm not a bot" visitors per (iirc) half a day to 20 such visitors. Twenty is much more in-line with expectations.

                          1 Reply Last reply
                          37
                          • fattyfoods@feddit.nlF [email protected]
                            This post did not contain any content.
                            U This user is from outside of this forum
                            U This user is from outside of this forum
                            [email protected]
                            wrote last edited by [email protected]
                            #34

                            Non paywalled link https://archive.is/VcoE1

                            It basically boils down to making the browser do some cpu heavy calculations before allowing access. This is no problem for a single user, but for a bot farm this would increase the amount of compute power they need 100x or more.

                            mubelotix@jlai.luM lazynooblet@lazysoci.alL exu@feditown.comE 3 Replies Last reply
                            109
                            • bdonvr@thelemmy.clubB [email protected]

                              Thanks. Anything special configuring it?

                              B This user is from outside of this forum
                              B This user is from outside of this forum
                              [email protected]
                              wrote last edited by [email protected]
                              #35

                              I keep my server config in a public git repo, but I don't think you have to do anything really special to make it work with lemmy. Since I use Traefik I followed the guide for setting up Anubis with Traefik.

                              I don't expect to run into issues as Anubis specifically looks for user-agent strings that appear like human users (i.e. they contain the word "Mozilla" as most graphical web browsers do) any request clearly coming from a bot that identifies itself is left alone, and lemmy identifies itself as "Lemmy/{version} +{hostname}" in requests.

                              1 Reply Last reply
                              12
                              • U [email protected]

                                Non paywalled link https://archive.is/VcoE1

                                It basically boils down to making the browser do some cpu heavy calculations before allowing access. This is no problem for a single user, but for a bot farm this would increase the amount of compute power they need 100x or more.

                                mubelotix@jlai.luM This user is from outside of this forum
                                mubelotix@jlai.luM This user is from outside of this forum
                                [email protected]
                                wrote last edited by
                                #36

                                Exactly. It's called proof-of-work and was originally invented to reduce spam emails but was later used by Bitcoin to control its growth speed

                                R jackbydev@programming.devJ 2 Replies Last reply
                                60
                                • F [email protected]

                                  I’d like to use Anubis but the strange hentai character as a mascot is not too professional

                                  S This user is from outside of this forum
                                  S This user is from outside of this forum
                                  [email protected]
                                  wrote last edited by
                                  #37

                                  BRB gonna add a kill the corpo to the kill the boer

                                  1 Reply Last reply
                                  0
                                  • fattyfoods@feddit.nlF [email protected]
                                    This post did not contain any content.
                                    drunkanroot@sh.itjust.worksD This user is from outside of this forum
                                    drunkanroot@sh.itjust.worksD This user is from outside of this forum
                                    [email protected]
                                    wrote last edited by
                                    #38

                                    it wont protect more then one subdomain i think

                                    1 Reply Last reply
                                    0
                                    • B [email protected]

                                      It is basically instantaneous on my 12 year old Keppler GPU Linux Box. It is substantially less impactful on the environment than AI tar pits and other deterrents. The Cryptography happening is something almost all browsers from the last 10 years can do natively that Scrapers have to be individually programmed to do. Making it several orders of magnitude beyond impractical for every single corporate bot to be repurposed for. Only to then be rendered moot, because it's an open-source project that someone will just update the cryptographic algorithm for. These posts contain links to articles, if you read them you might answer some of your own questions and have more to contribute to the conversation.

                                      K This user is from outside of this forum
                                      K This user is from outside of this forum
                                      [email protected]
                                      wrote last edited by
                                      #39

                                      It is basically instantaneous on my 12 year old Keppler GPU Linux Box.

                                      It depends on what the website admin sets, but I've had checks take more than 20 seconds on my reasonably modern phone. And as scrapers get more ruthless, that difficulty setting will have to go up.

                                      The Cryptography happening is something almost all browsers from the last 10 years can do natively that Scrapers have to be individually programmed to do. Making it several orders of magnitude beyond impractical for every single corporate bot to be repurposed for.

                                      At best these browsers are going to have some efficient CPU implementation. Scrapers can send these challenges off to dedicated GPU farms or even FPGAs, which are an order of magnitude faster and more efficient. This is also not complex, a team of engineers could set this up in a few days.

                                      Only to then be rendered moot, because it's an open-source project that someone will just update the cryptographic algorithm for.

                                      There might be something in changing to a better, GPU resistant algorithm like argon2, but browsers don't support those natively so you would rely on an even less efficient implementation in js or wasm. Quickly changing details of the algorithm in a game of whack-a-mole could work to an extent, but that would turn this into an arms race. And the scrapers can afford far more development time than the maintainers of Anubis.

                                      These posts contain links to articles, if you read them you might answer some of your own questions and have more to contribute to the conversation.

                                      This is very condescending. I would prefer if you would just engage with my arguments.

                                      mcasq_qsacj_234@lemmy.zipM B D 3 Replies Last reply
                                      3
                                      • M [email protected]

                                        <Stupidquestion>

                                        What advantage does this software provide over simply banning bots via robots.txt?

                                        </Stupidquestion>

                                        thingsiplay@beehaw.orgT This user is from outside of this forum
                                        thingsiplay@beehaw.orgT This user is from outside of this forum
                                        [email protected]
                                        wrote last edited by
                                        #40

                                        The difference is:

                                        • robots.txt is a promise without a door
                                        • Anubis is a physical closed door, that opens up after some time
                                        1 Reply Last reply
                                        9
                                        • D [email protected]

                                          It takes like half a second on my Fairphone 3, and the CPU in this thing is absolute dogshit. I also doubt that the power consumption is particularly significant compared to the overhead of parsing, executing and JIT-compiling the 14MiB of JavaScript frameworks on the actual website.

                                          K This user is from outside of this forum
                                          K This user is from outside of this forum
                                          [email protected]
                                          wrote last edited by
                                          #41

                                          It depends on the website's setting. I have the same phone and there was one website where it took more than 20 seconds.

                                          The power consumption is significant, because it needs to be. That is the entire point of this design. If it doesn't take significant a significant number of CPU cycles, scrapers will just power through them. This may not be significant for an individual user, but it does add up when this reaches widespread adoption and everyone's devices have to solve those challenges.

                                          I 1 Reply Last reply
                                          3
                                          Reply
                                          • Reply as topic
                                          Log in to reply
                                          • Oldest to Newest
                                          • Newest to Oldest
                                          • Most Votes


                                          • Login

                                          • Login or register to search.
                                          • First post
                                            Last post
                                          0
                                          • Categories
                                          • Recent
                                          • Tags
                                          • Popular
                                          • World
                                          • Users
                                          • Groups