Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Brite
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Brand Logo

agnos.is Forums

  1. Home
  2. Programmer Humor
  3. lads

lads

Scheduled Pinned Locked Moved Programmer Humor
programmerhumor
58 Posts 23 Posters 3 Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • H [email protected]

    Websites were under a constant noise of malicious requests even before AI, but now AI scraping of Lemmy instances usually triples traffic. While some sites can cope with this, this means a three-fold increase in hosting costs in order to essentially fuel investment portfolios.

    AI scrapers will already use as much energy as available, so making them use more per site measn less sites being scraped, not more total energy used.

    And this is not DDoS, the objective of scrapers is to get the data, not bring the site down, so while the server must reply to all requests, the clients can't get the data out without doing more work than the server.

    D This user is from outside of this forum
    D This user is from outside of this forum
    [email protected]
    wrote last edited by
    #18

    AI does not triple traffic. It's a completely irrational statement to make.

    There's a very limited number of companies training big LLM models, and these companies do train a model a few times per year. I would bet that the number of requests per year of s resource by an AI scrapper is on the dozens at most.

    Using as much energy as a available per scrapping doesn't even make physical sense. What does that sentence even mean?

    H grysbok@lemmy.sdf.orgG 2 Replies Last reply
    0
    • R [email protected]

      It assumes that we sites are under constant ddos

      It is literally happening. https://www.youtube.com/watch?v=cQk2mPcAAWo https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies/

      It assumes that anubis is effective against ddos attacks

      It's being used by some little-known entities like the LKML, FreeBSD, SourceHut, UNESCO, and the fucking UN, so I'm assuming it probably works well enough. https://policytoolbox.iiep.unesco.org/ https://xeiaso.net/notes/2025/anubis-works/

      anti-AI wave

      Oh, you're one of those people. Enough said. (edit) By the way, Anubis' author seems to be a big fan of machine learning and AI.

      (edit 2 just because I'm extra cross that you don't seem to understand this part)

      Do you know what a web crawler does when a process finishes grabbing the response from the web server? Do you think it takes a little break to conserve energy and let all the other remaining processes do their thing? No, it spawns another bloody process to scrape the next hyperlink.

      D This user is from outside of this forum
      D This user is from outside of this forum
      [email protected]
      wrote last edited by [email protected]
      #19

      Some websites being under ddos attack =/= all sites are under constant ddos attack, nor it cannot exist without it.

      First there's a logic fallacy in there. Being used by does not mean it's useful. Many companies use AI for some task, does that make AI useful? Not.

      The logic it's still there all anubis can do against ddos is raising a little the barrier before the site goes down. That's call mitigation not protection. If you are targeted for a ddos that mitigation is not going to do much, and your site is going down regardless.

      C 1 Reply Last reply
      0
      • K [email protected]

        I don’t think even a raspberry 2 would go down over a web scrap

        Absolutely depends on what software the server is running, if there’s proper caching involved. If running some PoW is involved to scrape 1 page it shouldn’t be too much of an issue, as opposed to just blindly following and ingesting every link.

        Additionally, you can choose “good bots” like the internet archive, and they’re currently working on a list of “good bots”

        https://github.com/TecharoHQ/anubis/blob/main/docs/docs/admin/policies.mdx

        AI companies ingesting data nonstop to train their models doesn’t make for a open and free internet, and will likely lead to the opposite, where users no longer even browse the web but trust in AI responses that maybe be hallucinated.

        D This user is from outside of this forum
        D This user is from outside of this forum
        [email protected]
        wrote last edited by [email protected]
        #20

        There a small number of AI companies training full LLM models. And they usually do a few trains per years. What most people see as "AI bots" are not actually that.

        The influence of AI over the net is another topic. But anubis is also not doing anything about that as it just makes so the AI bots waste more energy getting the data or at most that data under "anubis protection" does not enter the training dataset. The AI will still be there.

        Am I in the list of "good bots" ?sometimes I scrap websites for price tracking or change tracking. If I see a website running malware on my end I would most likely just block that site, one legitimate user less.

        S S 2 Replies Last reply
        0
        • D [email protected]

          There a small number of AI companies training full LLM models. And they usually do a few trains per years. What most people see as "AI bots" are not actually that.

          The influence of AI over the net is another topic. But anubis is also not doing anything about that as it just makes so the AI bots waste more energy getting the data or at most that data under "anubis protection" does not enter the training dataset. The AI will still be there.

          Am I in the list of "good bots" ?sometimes I scrap websites for price tracking or change tracking. If I see a website running malware on my end I would most likely just block that site, one legitimate user less.

          S This user is from outside of this forum
          S This user is from outside of this forum
          [email protected]
          wrote last edited by
          #21

          I (and A LOT) of lemmings already had enough of AI. We DON'T need AI-everything. So we block/make it harder for ai to be trained. We didn't say "hey, please train your llm on our data" anyways.

          D 1 Reply Last reply
          0
          • S [email protected]

            I (and A LOT) of lemmings already had enough of AI. We DON'T need AI-everything. So we block/make it harder for ai to be trained. We didn't say "hey, please train your llm on our data" anyways.

            D This user is from outside of this forum
            D This user is from outside of this forum
            [email protected]
            wrote last edited by [email protected]
            #22

            That's legitimate.

            But it's not "open", nor "free".

            Also it's a little placebo. For instance Lemmy is not an Anubis usecase. As lemmy can be legitimately scrapped by any agent through the federation system. And I don't really know how would even Anubis work with the openess of the Lemmy API.

            1 Reply Last reply
            0
            • D [email protected]

              AI does not triple traffic. It's a completely irrational statement to make.

              There's a very limited number of companies training big LLM models, and these companies do train a model a few times per year. I would bet that the number of requests per year of s resource by an AI scrapper is on the dozens at most.

              Using as much energy as a available per scrapping doesn't even make physical sense. What does that sentence even mean?

              H This user is from outside of this forum
              H This user is from outside of this forum
              [email protected]
              wrote last edited by
              #23

              AI does not triple traffic. It’s a completely irrational statement to make.

              Multiple testimonials from people who host sites say they do. Multiple Lemmy instances also supported this claim.

              I would bet that the number of requests per year of s resource by an AI scrapper is on the dozens at most.

              You obviously don't know much about hosting a public server. Try dozens per second.

              There is a booming startup industry all over the world training AI, and scraping data to sell to companies training AI. It's not just Microsoft, Facebook and Twitter doing it, but also Chinese companies trying to compete. Also companies not developing public models, but models for internal use. They all use public cloud IPs, so the traffic is coming from all over incessantly.

              Using as much energy as a available per scrapping doesn’t even make physical sense. What does that sentence even mean?

              It means that Microsoft buys a server for scraping, they are going to be running it 24/7, with the CPU/network maxed out, maximum power use, to get as much data as they can. If the server can scrape 100 sites per minute, it will scrape 100 sites. If it can scrape 1000, it will scrape 1000, and if it can do 10, it will do 10.

              It will not stop scraping ever, as it is the equivalent of shutting down a production line. Everyone always uses their scrapers as much as they can. Ironically, increasing the cost of scraping would result in less energy consumed in total, since it would force companies to work more "smart" and less "hard" at scraping and training AI.

              Oh, and it's S-C-R-A-P-I-N-G, not scrapping. It comes from the word "scrape", meaning to remove the surface from an object using a sharp instrument, not "scrap", which means to take something apart for its components.

              D 1 Reply Last reply
              9
              • D [email protected]

                Some websites being under ddos attack =/= all sites are under constant ddos attack, nor it cannot exist without it.

                First there's a logic fallacy in there. Being used by does not mean it's useful. Many companies use AI for some task, does that make AI useful? Not.

                The logic it's still there all anubis can do against ddos is raising a little the barrier before the site goes down. That's call mitigation not protection. If you are targeted for a ddos that mitigation is not going to do much, and your site is going down regardless.

                C This user is from outside of this forum
                C This user is from outside of this forum
                [email protected]
                wrote last edited by [email protected]
                #24

                If a request is taking a full minute of user CPU time, it's one hell of a mitigation, and anybody who's not a major corporation or government isn't going to shrug it off.

                D 1 Reply Last reply
                4
                • D [email protected]

                  AI does not triple traffic. It's a completely irrational statement to make.

                  There's a very limited number of companies training big LLM models, and these companies do train a model a few times per year. I would bet that the number of requests per year of s resource by an AI scrapper is on the dozens at most.

                  Using as much energy as a available per scrapping doesn't even make physical sense. What does that sentence even mean?

                  grysbok@lemmy.sdf.orgG This user is from outside of this forum
                  grysbok@lemmy.sdf.orgG This user is from outside of this forum
                  [email protected]
                  wrote last edited by
                  #25

                  You're right. AI didn't just triple the traffic to my tiny archive's site. It way more than tripled it. After implementing Anubis, we went from 3000 'unique' visitors down to 20 in a half-day. Twenty is a much more expected number for a small college archive in the summer. That's before I did any fine-tuning to Anubis, just the default settings.

                  I was getting constant outage reports. Now I'm not.

                  For us, it's not about protecting our IP. We want folks to get to find out information. That's why we write finding aids, scan it, accession it. But, allowing bots to siphon it all up inefficiently was denying everyone access to it.

                  And if you think bots aren't inefficient, explain why Facebook requests my robots.txt 10 times a second.

                  D 1 Reply Last reply
                  8
                  • H [email protected]

                    AI does not triple traffic. It’s a completely irrational statement to make.

                    Multiple testimonials from people who host sites say they do. Multiple Lemmy instances also supported this claim.

                    I would bet that the number of requests per year of s resource by an AI scrapper is on the dozens at most.

                    You obviously don't know much about hosting a public server. Try dozens per second.

                    There is a booming startup industry all over the world training AI, and scraping data to sell to companies training AI. It's not just Microsoft, Facebook and Twitter doing it, but also Chinese companies trying to compete. Also companies not developing public models, but models for internal use. They all use public cloud IPs, so the traffic is coming from all over incessantly.

                    Using as much energy as a available per scrapping doesn’t even make physical sense. What does that sentence even mean?

                    It means that Microsoft buys a server for scraping, they are going to be running it 24/7, with the CPU/network maxed out, maximum power use, to get as much data as they can. If the server can scrape 100 sites per minute, it will scrape 100 sites. If it can scrape 1000, it will scrape 1000, and if it can do 10, it will do 10.

                    It will not stop scraping ever, as it is the equivalent of shutting down a production line. Everyone always uses their scrapers as much as they can. Ironically, increasing the cost of scraping would result in less energy consumed in total, since it would force companies to work more "smart" and less "hard" at scraping and training AI.

                    Oh, and it's S-C-R-A-P-I-N-G, not scrapping. It comes from the word "scrape", meaning to remove the surface from an object using a sharp instrument, not "scrap", which means to take something apart for its components.

                    D This user is from outside of this forum
                    D This user is from outside of this forum
                    [email protected]
                    wrote last edited by
                    #26

                    I'm not native English speaker. So I would apologize if there's bad English in my response. And would thank any corrections.

                    That being said I do host public services, before and after AI was a thing. And I have asked many of these people who claim "we are under AI bot attacks" how are they able to differentiate when a request is from a AI scrapper or just any other scrapper and there was no satisfying answer.

                    H 1 Reply Last reply
                    0
                    • grysbok@lemmy.sdf.orgG [email protected]

                      You're right. AI didn't just triple the traffic to my tiny archive's site. It way more than tripled it. After implementing Anubis, we went from 3000 'unique' visitors down to 20 in a half-day. Twenty is a much more expected number for a small college archive in the summer. That's before I did any fine-tuning to Anubis, just the default settings.

                      I was getting constant outage reports. Now I'm not.

                      For us, it's not about protecting our IP. We want folks to get to find out information. That's why we write finding aids, scan it, accession it. But, allowing bots to siphon it all up inefficiently was denying everyone access to it.

                      And if you think bots aren't inefficient, explain why Facebook requests my robots.txt 10 times a second.

                      D This user is from outside of this forum
                      D This user is from outside of this forum
                      [email protected]
                      wrote last edited by [email protected]
                      #27

                      How do you know those reduced request were AI companies and not any other purpose?

                      T grysbok@lemmy.sdf.orgG 2 Replies Last reply
                      0
                      • R [email protected]

                        Anubis is a simple anti-scraper defense that weighs a web client's soul by giving it a tiny proof-of-work workload (some calculation that doesn't have an efficient solution, like cryptography) before letting it pass through to the actual website. The workload is insignificant for human users, but very taxing for high-volume scrapers. The calculations are done on the client's side using Javascript code.

                        (edit) For clarification: this works because the computation workload takes a relatively long time, not because it bogs down the CPU. Halting each request at the gate for only a few seconds adds up very quickly.

                        Recently, the FSF published an article that likened Anubis to malware because it's basically arbitrary code that the user has no choice but to execute:

                        [...] The problem is that Anubis makes the website send out a free JavaScript program that acts like malware. A website using Anubis will respond to a request for a webpage with a free JavaScript program and not the page that was requested. If you run the JavaScript program sent through Anubis, it will do some useless computations on random numbers and keep one CPU entirely busy. It could take less than a second or over a minute. When it is done, it sends the computation results back to the website. The website will verify that the useless computation was done by looking at the results and only then give access to the originally requested page.

                        Here's the article, and here's aussie linux man talking about it.

                        I This user is from outside of this forum
                        I This user is from outside of this forum
                        [email protected]
                        wrote last edited by
                        #28

                        But they can still scrape it, it just costs them computation?

                        R 1 Reply Last reply
                        2
                        • C [email protected]

                          If a request is taking a full minute of user CPU time, it's one hell of a mitigation, and anybody who's not a major corporation or government isn't going to shrug it off.

                          D This user is from outside of this forum
                          D This user is from outside of this forum
                          [email protected]
                          wrote last edited by [email protected]
                          #29

                          Precisely that's my point. It fits a very small risk profile. People who is going to be ddosed but not by a big agent.

                          It's not the most common risk profile. Usually ddos attacks are very heavy or doesn't happen at all. These "half gas" ddos attacks are not really common.

                          I think that's why when I read about Anubis is never in a context of ddos protection. It's always on a context of "let's fuck AI", like this precise line of comments.

                          C 1 Reply Last reply
                          1
                          • D [email protected]

                            How do you know those reduced request were AI companies and not any other purpose?

                            T This user is from outside of this forum
                            T This user is from outside of this forum
                            [email protected]
                            wrote last edited by
                            #30

                            Does it matter what the purpose was? It was still causing them issues hosting their site.

                            D 1 Reply Last reply
                            4
                            • D [email protected]

                              Precisely that's my point. It fits a very small risk profile. People who is going to be ddosed but not by a big agent.

                              It's not the most common risk profile. Usually ddos attacks are very heavy or doesn't happen at all. These "half gas" ddos attacks are not really common.

                              I think that's why when I read about Anubis is never in a context of ddos protection. It's always on a context of "let's fuck AI", like this precise line of comments.

                              C This user is from outside of this forum
                              C This user is from outside of this forum
                              [email protected]
                              wrote last edited by [email protected]
                              #31

                              There's heavy, and then there's heavy. I don't have any experience dealing with threats like this myself, so I can't comment on what's most common, but we're talking about potentially millions of times more resources for the attacker than the defender here.

                              There is a lot of AI hype and AI anti-hype right now, that's true.

                              D isveryloud@lemmy.caI 2 Replies Last reply
                              3
                              • D [email protected]

                                How do you know those reduced request were AI companies and not any other purpose?

                                grysbok@lemmy.sdf.orgG This user is from outside of this forum
                                grysbok@lemmy.sdf.orgG This user is from outside of this forum
                                [email protected]
                                wrote last edited by [email protected]
                                #32

                                Timing and request patterns. The increase in traffic coincided with the increase in AI in the marketplace. Before, we'd get hit by bots in waves and we'd just suck it up for a day. Now it's constant. The request patterns are deep deep solr requests, with far more filters than any human would ever use. These are expensive requests and the results aren't any more informative that just scooping up the nicely formatted EAD/XML finding aids we provide.

                                And, TBH, I don't care if it's AI. I care that it's rude. If the bots respected robots.txt then I'd be fine with them. They don't and they break stuff for actual researchers.

                                D 1 Reply Last reply
                                7
                                • C [email protected]

                                  There's heavy, and then there's heavy. I don't have any experience dealing with threats like this myself, so I can't comment on what's most common, but we're talking about potentially millions of times more resources for the attacker than the defender here.

                                  There is a lot of AI hype and AI anti-hype right now, that's true.

                                  D This user is from outside of this forum
                                  D This user is from outside of this forum
                                  [email protected]
                                  wrote last edited by
                                  #33

                                  I don't think is millions. Take into account that a ddos attacker is not going to execute JavaScript code, at least not any competent one, so they are not going to run the PoW.

                                  In fact the unsolicited and unwarned PoW does not provide more protection than a captcha again ddos.

                                  The mitigation comes from the smaller and easier requests response by the server, so the number of requests to saturate the service must increase. How much? Depending how demanding the "real" website would be in comparison.
                                  I doubt the answer is millions. And they would achieve the exact same result with a captcha without running literal malware on the clients.

                                  C 1 Reply Last reply
                                  0
                                  • I [email protected]

                                    But they can still scrape it, it just costs them computation?

                                    R This user is from outside of this forum
                                    R This user is from outside of this forum
                                    [email protected]
                                    wrote last edited by [email protected]
                                    #34

                                    Correct. Anubis' goal is to decrease the web traffic that hits the server, not to prevent scraping altogether. I should also clarify that this works because it costs the scrapers time with each request, not because it bogs down the CPU.

                                    xylight@lemdro.idX 1 Reply Last reply
                                    10
                                    • R [email protected]

                                      Anubis is a simple anti-scraper defense that weighs a web client's soul by giving it a tiny proof-of-work workload (some calculation that doesn't have an efficient solution, like cryptography) before letting it pass through to the actual website. The workload is insignificant for human users, but very taxing for high-volume scrapers. The calculations are done on the client's side using Javascript code.

                                      (edit) For clarification: this works because the computation workload takes a relatively long time, not because it bogs down the CPU. Halting each request at the gate for only a few seconds adds up very quickly.

                                      Recently, the FSF published an article that likened Anubis to malware because it's basically arbitrary code that the user has no choice but to execute:

                                      [...] The problem is that Anubis makes the website send out a free JavaScript program that acts like malware. A website using Anubis will respond to a request for a webpage with a free JavaScript program and not the page that was requested. If you run the JavaScript program sent through Anubis, it will do some useless computations on random numbers and keep one CPU entirely busy. It could take less than a second or over a minute. When it is done, it sends the computation results back to the website. The website will verify that the useless computation was done by looking at the results and only then give access to the originally requested page.

                                      Here's the article, and here's aussie linux man talking about it.

                                      C This user is from outside of this forum
                                      C This user is from outside of this forum
                                      [email protected]
                                      wrote last edited by
                                      #35

                                      Well, that's a typically abstract, to-the-letter take on the definition of software freedom from them. I think the practical necessity of doing something like this, especially for services like Invidious that are at risk, and the fact it's a harmless nonsense calculation really deserves an exception.

                                      1 Reply Last reply
                                      6
                                      • D [email protected]

                                        I'm not native English speaker. So I would apologize if there's bad English in my response. And would thank any corrections.

                                        That being said I do host public services, before and after AI was a thing. And I have asked many of these people who claim "we are under AI bot attacks" how are they able to differentiate when a request is from a AI scrapper or just any other scrapper and there was no satisfying answer.

                                        H This user is from outside of this forum
                                        H This user is from outside of this forum
                                        [email protected]
                                        wrote last edited by
                                        #36

                                        Yeah but it doesn't matter what the objective of the scraper is, the only thing that matters is that it's an automated client that is going to send mass requests to you. If it wasn't, Anubis would not be a problem for it.

                                        The effect is the same, increased hosting costs and less access for legitimate clients. And sites want to defend against it.

                                        That said, it is not mandatory, you can avoid using Anubis as a host. Nobody is forcing you to use it. And as someone who regularly gets locked out of services because I use a VPN, Anubis is one of the least intrusive protection methods out there.

                                        D 1 Reply Last reply
                                        6
                                        • T [email protected]

                                          Does it matter what the purpose was? It was still causing them issues hosting their site.

                                          D This user is from outside of this forum
                                          D This user is from outside of this forum
                                          [email protected]
                                          wrote last edited by
                                          #37

                                          Not really. I only ask because people always say it's for LLM training, which seem a little illogical to me, knowing the small number of companies that have access to the computer power to actually do a training with that data. And big companies are not going to scrape hundreds of times the same resource for a piece of information they already have.

                                          But I think people should be more critique trying to understand who is making the request and with which purpose. So then people could make a better informed decision of they need that system (which is very intrusive for the clients) or not.

                                          T 1 Reply Last reply
                                          0
                                          Reply
                                          • Reply as topic
                                          Log in to reply
                                          • Oldest to Newest
                                          • Newest to Oldest
                                          • Most Votes


                                          • Login

                                          • Login or register to search.
                                          • First post
                                            Last post
                                          0
                                          • Categories
                                          • Recent
                                          • Tags
                                          • Popular
                                          • World
                                          • Users
                                          • Groups