Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Brand Logo

agnos.is Forums

  1. Home
  2. Open Source
  3. The Open-Source Software Saving the Internet From AI Bot Scrapers

The Open-Source Software Saving the Internet From AI Bot Scrapers

Scheduled Pinned Locked Moved Open Source
opensource
102 Posts 65 Posters 1 Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • F [email protected]

    It’s just not my style ok is all I’m saying and it’s nothing I’d be able to get past all my superiors as a recommendation of software to use.

    lime@feddit.nuL This user is from outside of this forum
    lime@feddit.nuL This user is from outside of this forum
    [email protected]
    wrote last edited by
    #60

    then have them pay for it.

    1 Reply Last reply
    6
    • fattyfoods@feddit.nlF [email protected]
      This post did not contain any content.
      I This user is from outside of this forum
      I This user is from outside of this forum
      [email protected]
      wrote last edited by
      #61

      Open source is also the AI scraper bots AND the internet itself, it is every character in the story.

      1 Reply Last reply
      7
      • bdonvr@thelemmy.clubB [email protected]

        Ooh can this work with Lemmy without affecting federation?

        I This user is from outside of this forum
        I This user is from outside of this forum
        [email protected]
        wrote last edited by
        #62

        Yes, it would make lemmy as unsearchable as discord. Instead of unsearchable as pinterest.

        bdonvr@thelemmy.clubB 1 Reply Last reply
        3
        • F [email protected]

          It’s just not my style ok is all I’m saying and it’s nothing I’d be able to get past all my superiors as a recommendation of software to use.

          phase@lemmy.8th.worldP This user is from outside of this forum
          phase@lemmy.8th.worldP This user is from outside of this forum
          [email protected]
          wrote last edited by
          #63

          Support, pay, and get it 🙂

          F 1 Reply Last reply
          2
          • F [email protected]

            This is fantastic and I appreciate that it scales well on the server side.

            Ai scraping is a scourge and I would love to know the collective amount of power wasted due to the necessity of countermeasures like this and add this to the total wasted by ai.

            I This user is from outside of this forum
            I This user is from outside of this forum
            [email protected]
            wrote last edited by
            #64

            All this could be avoided by making submit photo id to login into a account.

            anzo@programming.devA H 2 Replies Last reply
            4
            • D [email protected]

              Someone making an argument like that clearly does not understand the situation. Just 4 years ago, a robots.txt was enough to keep most bots away, and hosting personal git on the web required very little resources. With AI companies actively profiting off stealing everything, a robots.txt doesn't mean anything. Now, even a relatively small git web host takes an insane amount of resources. I'd know - I host a Forgejo instance. Caching doesn't matter, because diffs berween two random commits are likely unique. Ratelimiting doesn't matter, they will use different IP (ranges) and user agents. It would also heavily impact actual users "because the site is busy".

              A proof-of-work solution like Anubis is the best we have currently. The least possible impact to end users, while keeping most (if not all) AI scrapers off the site.

              I This user is from outside of this forum
              I This user is from outside of this forum
              [email protected]
              wrote last edited by
              #65

              This would not be a problem if one bot scraped once, and the result was then mirrored to all on Big Tech's dime (cloudflare, tailscale) but since they are all competing now, they think their edge is going to be their own more better scraper setup and they won't share.

              Maybe there should just be a web to torrent bridge sovtge data is pushed out once by the server and tge swarm does the heavy lifting as a cache.

              D 1 Reply Last reply
              1
              • I [email protected]

                This would not be a problem if one bot scraped once, and the result was then mirrored to all on Big Tech's dime (cloudflare, tailscale) but since they are all competing now, they think their edge is going to be their own more better scraper setup and they won't share.

                Maybe there should just be a web to torrent bridge sovtge data is pushed out once by the server and tge swarm does the heavy lifting as a cache.

                D This user is from outside of this forum
                D This user is from outside of this forum
                [email protected]
                wrote last edited by
                #66

                No, it'd still be a problem; every diff between commits is expensive to render to web, even if "only one company" is scraping it, "only one time". Many of these applications are designed for humans, not scrapers.

                I 1 Reply Last reply
                0
                • S [email protected]

                  A javascriptless check was released recently I just read about it. Uses some refresh HTML tag and a delay. Its not default though since its new.

                  phase@lemmy.8th.worldP This user is from outside of this forum
                  phase@lemmy.8th.worldP This user is from outside of this forum
                  [email protected]
                  wrote last edited by
                  #67

                  The source I assume: challenges/metarefresh.

                  1 Reply Last reply
                  0
                  • F [email protected]

                    Probably because the creator had a blog post that got shared around at a point in time where this exact problem was resonating with users.

                    It's not always about being first but about marketing.

                    johnedwa@sopuli.xyzJ This user is from outside of this forum
                    johnedwa@sopuli.xyzJ This user is from outside of this forum
                    [email protected]
                    wrote last edited by [email protected]
                    #68

                    It’s not always about being first but about marketing.

                    And one has a cute catgirl mascot, the other a website that looks like a blockchain techbro startup.
                    I'm even willing to bet the amount of people that set up Anubis just to get the cute splash screen isn't insignificant.

                    jackbydev@programming.devJ 1 Reply Last reply
                    22
                    • lime@feddit.nuL [email protected]

                      anubis is basically a bitcoin miner, with the difficulty turned way down (and obviously not resulting in any coins), so it's inherently random. if it takes minutes it does seem like something is wrong though. maybe a network error?

                      isolatedscotch@discuss.tchncs.deI This user is from outside of this forum
                      isolatedscotch@discuss.tchncs.deI This user is from outside of this forum
                      [email protected]
                      wrote last edited by [email protected]
                      #69

                      adding to this, some sites set the difficulty way higher then others, nerdvpn's invidious and redlib instances take about 5 seconds and some ~20k hashes, while privacyredirect's inatances are almost instant with less then 50 hashes each time

                      repletelocum@lemmy.blahaj.zoneR 1 Reply Last reply
                      15
                      • mubelotix@jlai.luM [email protected]

                        Exactly. It's called proof-of-work and was originally invented to reduce spam emails but was later used by Bitcoin to control its growth speed

                        R This user is from outside of this forum
                        R This user is from outside of this forum
                        [email protected]
                        wrote last edited by
                        #70

                        it wasn't made for bitcoin originally? didn't know that!

                        0 1 Reply Last reply
                        2
                        • I [email protected]

                          Yes, it would make lemmy as unsearchable as discord. Instead of unsearchable as pinterest.

                          bdonvr@thelemmy.clubB This user is from outside of this forum
                          bdonvr@thelemmy.clubB This user is from outside of this forum
                          [email protected]
                          wrote last edited by
                          #71

                          That's not true, search indexer bots should be allowed through from what I read here.

                          I 1 Reply Last reply
                          1
                          • bdonvr@thelemmy.clubB [email protected]

                            Ooh can this work with Lemmy without affecting federation?

                            R This user is from outside of this forum
                            R This user is from outside of this forum
                            [email protected]
                            wrote last edited by
                            #72

                            To be honest, I need to ask my admin about that!

                            fxomt@lemmy.dbzer0.comF 1 Reply Last reply
                            2
                            • I [email protected]

                              All this could be avoided by making submit photo id to login into a account.

                              anzo@programming.devA This user is from outside of this forum
                              anzo@programming.devA This user is from outside of this forum
                              [email protected]
                              wrote last edited by
                              #73

                              That's awful, it means I would get my photo id stolen hundreds of times per day, or there's also thisfacedoesntexists... and won't work. For many reasons. Not all websites require an account. And even those that do, when they ask for "personal verification" (like dating apps) have a hard time to implement just that. Most "serious" cases use human review of the photo and a video that has your face and you move in and out of an oval shape...

                              I 1 Reply Last reply
                              12
                              • anzo@programming.devA [email protected]

                                That's awful, it means I would get my photo id stolen hundreds of times per day, or there's also thisfacedoesntexists... and won't work. For many reasons. Not all websites require an account. And even those that do, when they ask for "personal verification" (like dating apps) have a hard time to implement just that. Most "serious" cases use human review of the photo and a video that has your face and you move in and out of an oval shape...

                                I This user is from outside of this forum
                                I This user is from outside of this forum
                                [email protected]
                                wrote last edited by
                                #74

                                Also you must drink a verification can !

                                1 Reply Last reply
                                4
                                • bdonvr@thelemmy.clubB [email protected]

                                  That's not true, search indexer bots should be allowed through from what I read here.

                                  I This user is from outside of this forum
                                  I This user is from outside of this forum
                                  [email protected]
                                  wrote last edited by
                                  #75

                                  If you allow my searchxng search scraper then an AI scraper is indistinguishable.

                                  If you mean, "google and duckduckgo are whitelisted" then lemmy will only be searchable there, those specific whitelisted hosts. And google search index is also an AI scraper bot.

                                  1 Reply Last reply
                                  4
                                  • D [email protected]

                                    No, it'd still be a problem; every diff between commits is expensive to render to web, even if "only one company" is scraping it, "only one time". Many of these applications are designed for humans, not scrapers.

                                    I This user is from outside of this forum
                                    I This user is from outside of this forum
                                    [email protected]
                                    wrote last edited by
                                    #76

                                    If the rendering data for scraper was really the problem
                                    Then the solution is simple, just have downloadable dumps of the publicly available information
                                    That would be extremely efficient and cost fractions of pennies in monthly bandwidth
                                    Plus the data would be far more usable for whatever they are using it for.

                                    The problem is trying to have freely available data, but for the host to maintain the ability to leverage this data later.

                                    I don't think we can have both of these.

                                    1 Reply Last reply
                                    0
                                    • R [email protected]

                                      it wasn't made for bitcoin originally? didn't know that!

                                      0 This user is from outside of this forum
                                      0 This user is from outside of this forum
                                      [email protected]
                                      wrote last edited by
                                      #77

                                      Originally called hashcash: http://hashcash.org/

                                      R 1 Reply Last reply
                                      10
                                      • 0 [email protected]

                                        Originally called hashcash: http://hashcash.org/

                                        R This user is from outside of this forum
                                        R This user is from outside of this forum
                                        [email protected]
                                        wrote last edited by
                                        #78

                                        you know it's old when it doesn't have ssl

                                        1 Reply Last reply
                                        8
                                        • phase@lemmy.8th.worldP [email protected]

                                          Support, pay, and get it 🙂

                                          F This user is from outside of this forum
                                          F This user is from outside of this forum
                                          [email protected]
                                          wrote last edited by
                                          #79

                                          Ah so it is possible to change it

                                          1 Reply Last reply
                                          1
                                          Reply
                                          • Reply as topic
                                          Log in to reply
                                          • Oldest to Newest
                                          • Newest to Oldest
                                          • Most Votes


                                          • Login

                                          • Login or register to search.
                                          • First post
                                            Last post
                                          0
                                          • Categories
                                          • Recent
                                          • Tags
                                          • Popular
                                          • World
                                          • Users
                                          • Groups