Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Brand Logo

agnos.is Forums

  1. Home
  2. Selfhosted
  3. Cheapskate's Guide: Nuking web-scraping bots

Cheapskate's Guide: Nuking web-scraping bots

Scheduled Pinned Locked Moved Selfhosted
selfhosted
32 Posts 18 Posters 108 Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • klu9@lemmy.caK This user is from outside of this forum
    klu9@lemmy.caK This user is from outside of this forum
    [email protected]
    wrote on last edited by
    #1

    Lemmy newb here, not sure if this is right for this /c.

    An article I found from someone who hosts their own website and micro-social network, and their experience with web-scraping robots who refuse to respect robots.txt, and how they deal with them.

    F drkt_@lemmy.dbzer0.comD sxan@midwest.socialS oyzmo@lemmy.worldO J 6 Replies Last reply
    1
    0
    • System shared this topic on
    • klu9@lemmy.caK [email protected]

      Lemmy newb here, not sure if this is right for this /c.

      An article I found from someone who hosts their own website and micro-social network, and their experience with web-scraping robots who refuse to respect robots.txt, and how they deal with them.

      F This user is from outside of this forum
      F This user is from outside of this forum
      [email protected]
      wrote on last edited by
      #2

      Interesting approach but looks like this ultimately ends up:

      • being a lot of babysitting / manual work
      • blocking a lot of humans
      • not being robust against scrapers

      Anubis seems like a much better option, for those wanting to block bots without relying on Cloudflare:

      https://anubis.techaro.lol/

      I 1 Reply Last reply
      0
      • klu9@lemmy.caK [email protected]

        Lemmy newb here, not sure if this is right for this /c.

        An article I found from someone who hosts their own website and micro-social network, and their experience with web-scraping robots who refuse to respect robots.txt, and how they deal with them.

        drkt_@lemmy.dbzer0.comD This user is from outside of this forum
        drkt_@lemmy.dbzer0.comD This user is from outside of this forum
        [email protected]
        wrote on last edited by
        #3

        I have plenty of spare bandwidth and babysitting-resources so my approach is largely to waste their time. If they poke my honeypot they get poked back and have to escape a tarpit specifically designed to waste their bandwidth above all. It costs me nothing because of my circumstances but I know it costs them because their connections are metered. I also know it works because they largely stop crawling my domains I employ this on. I am essentially making my domains appear hostile.

        It does mean that my residential IP ends up on various blocklists but I'm just at a point in my life where I don't give an unwiped asshole about it. I can't access your site? I'm not going to your site, then. Fuck you. I'm not even gonna email you about the false-positive.

        It is also fun to keep a log of which IPs have poked the honeypot have open ports, and to automate a process of siphoning information out of those ports. Finding a lot of hacked NVR's recently I think are part of some IoT botnet to scrape the internet.

        paraphrand@lemmy.worldP melroy@kbin.melroy.orgM 2 Replies Last reply
        0
        • drkt_@lemmy.dbzer0.comD [email protected]

          I have plenty of spare bandwidth and babysitting-resources so my approach is largely to waste their time. If they poke my honeypot they get poked back and have to escape a tarpit specifically designed to waste their bandwidth above all. It costs me nothing because of my circumstances but I know it costs them because their connections are metered. I also know it works because they largely stop crawling my domains I employ this on. I am essentially making my domains appear hostile.

          It does mean that my residential IP ends up on various blocklists but I'm just at a point in my life where I don't give an unwiped asshole about it. I can't access your site? I'm not going to your site, then. Fuck you. I'm not even gonna email you about the false-positive.

          It is also fun to keep a log of which IPs have poked the honeypot have open ports, and to automate a process of siphoning information out of those ports. Finding a lot of hacked NVR's recently I think are part of some IoT botnet to scrape the internet.

          paraphrand@lemmy.worldP This user is from outside of this forum
          paraphrand@lemmy.worldP This user is from outside of this forum
          [email protected]
          wrote on last edited by
          #4

          That last bit looks like something you should send off to a place like 404 media.

          drkt_@lemmy.dbzer0.comD 1 Reply Last reply
          0
          • F [email protected]

            Interesting approach but looks like this ultimately ends up:

            • being a lot of babysitting / manual work
            • blocking a lot of humans
            • not being robust against scrapers

            Anubis seems like a much better option, for those wanting to block bots without relying on Cloudflare:

            https://anubis.techaro.lol/

            I This user is from outside of this forum
            I This user is from outside of this forum
            [email protected]
            wrote on last edited by
            #5

            Are there any guides to using it with reverse proxies like traefik? I've been wanting to try it out but haven't had time to do the research yet.

            H 1 Reply Last reply
            0
            • drkt_@lemmy.dbzer0.comD [email protected]

              I have plenty of spare bandwidth and babysitting-resources so my approach is largely to waste their time. If they poke my honeypot they get poked back and have to escape a tarpit specifically designed to waste their bandwidth above all. It costs me nothing because of my circumstances but I know it costs them because their connections are metered. I also know it works because they largely stop crawling my domains I employ this on. I am essentially making my domains appear hostile.

              It does mean that my residential IP ends up on various blocklists but I'm just at a point in my life where I don't give an unwiped asshole about it. I can't access your site? I'm not going to your site, then. Fuck you. I'm not even gonna email you about the false-positive.

              It is also fun to keep a log of which IPs have poked the honeypot have open ports, and to automate a process of siphoning information out of those ports. Finding a lot of hacked NVR's recently I think are part of some IoT botnet to scrape the internet.

              melroy@kbin.melroy.orgM This user is from outside of this forum
              melroy@kbin.melroy.orgM This user is from outside of this forum
              [email protected]
              wrote on last edited by
              #6

              I found a very large botnet in Brazil mainly and several other countries. And abuseipdb.com is not marking those IPs are a thread. We need a better solution.

              I think a honeypot is a good way. Another way is to use proof of work basically on the client side. Or we need a better place to share all stupid web scraping bot IPs.

              drkt_@lemmy.dbzer0.comD 1 Reply Last reply
              0
              • melroy@kbin.melroy.orgM [email protected]

                I found a very large botnet in Brazil mainly and several other countries. And abuseipdb.com is not marking those IPs are a thread. We need a better solution.

                I think a honeypot is a good way. Another way is to use proof of work basically on the client side. Or we need a better place to share all stupid web scraping bot IPs.

                drkt_@lemmy.dbzer0.comD This user is from outside of this forum
                drkt_@lemmy.dbzer0.comD This user is from outside of this forum
                [email protected]
                wrote on last edited by
                #7

                I love the idea of abuseipdb and I even contributed to it briefly. Unfortunately, even as a contributor, I don't get enough API resources to actually use it for my own purposes without having to pay. I think the problem is simply that if you created a good enough database of abusive IPs then you'd be overwhelmed in traffic trying to pull that data out.

                1 Reply Last reply
                0
                • paraphrand@lemmy.worldP [email protected]

                  That last bit looks like something you should send off to a place like 404 media.

                  drkt_@lemmy.dbzer0.comD This user is from outside of this forum
                  drkt_@lemmy.dbzer0.comD This user is from outside of this forum
                  [email protected]
                  wrote on last edited by
                  #8

                  I wouldn't even know where to begin, but I also don't think that what I'm doing is anything special. These NVR IPs are hurling abuse at the whole internet. Anyone listening will have seen them, and anyone paying attention would've seen the pattern.

                  The NVRs I get the most traffic from have been a known hacked IoT device for a decade and even has a github page explaining how to bypass their authentication and pull out arbitrary files like passwd.

                  1 Reply Last reply
                  0
                  • klu9@lemmy.caK [email protected]

                    Lemmy newb here, not sure if this is right for this /c.

                    An article I found from someone who hosts their own website and micro-social network, and their experience with web-scraping robots who refuse to respect robots.txt, and how they deal with them.

                    sxan@midwest.socialS This user is from outside of this forum
                    sxan@midwest.socialS This user is from outside of this forum
                    [email protected]
                    wrote on last edited by
                    #9

                    They block VPN exit nodes. Why bother hosting a web site if you don't want anyone to read your content?

                    Fuck that noise. My privacy is more important to me than your blog.

                    T S klu9@lemmy.caK 3 Replies Last reply
                    0
                    • I [email protected]

                      Are there any guides to using it with reverse proxies like traefik? I've been wanting to try it out but haven't had time to do the research yet.

                      H This user is from outside of this forum
                      H This user is from outside of this forum
                      [email protected]
                      wrote on last edited by
                      #10

                      https://github.com/TecharoHQ/anubis/issues/92

                      1 Reply Last reply
                      0
                      • klu9@lemmy.caK [email protected]

                        Lemmy newb here, not sure if this is right for this /c.

                        An article I found from someone who hosts their own website and micro-social network, and their experience with web-scraping robots who refuse to respect robots.txt, and how they deal with them.

                        oyzmo@lemmy.worldO This user is from outside of this forum
                        oyzmo@lemmy.worldO This user is from outside of this forum
                        [email protected]
                        wrote on last edited by
                        #11

                        Thanks, great site! 😊

                        klu9@lemmy.caK 1 Reply Last reply
                        0
                        • sxan@midwest.socialS [email protected]

                          They block VPN exit nodes. Why bother hosting a web site if you don't want anyone to read your content?

                          Fuck that noise. My privacy is more important to me than your blog.

                          T This user is from outside of this forum
                          T This user is from outside of this forum
                          [email protected]
                          wrote on last edited by
                          #12

                          and filtering malicious traffic is more important to me than you visiting my services, so I guess that makes us even πŸ™‚

                          sxan@midwest.socialS 1 Reply Last reply
                          0
                          • klu9@lemmy.caK [email protected]

                            Lemmy newb here, not sure if this is right for this /c.

                            An article I found from someone who hosts their own website and micro-social network, and their experience with web-scraping robots who refuse to respect robots.txt, and how they deal with them.

                            J This user is from outside of this forum
                            J This user is from outside of this forum
                            [email protected]
                            wrote on last edited by
                            #13

                            This is signal detection theory combined with an arms race that keeps the problem hard. You cannot block scrapers without blocking people, and you cannot inconvenience bots without also inconveniencing readers. You might figure something clever out temporarily, but eventually this truism will resurface. Excuse me while I solve a few more captchas.

                            T nomugisan@lemmy.dbzer0.comN irmadlad@lemmy.worldI 3 Replies Last reply
                            0
                            • sxan@midwest.socialS [email protected]

                              They block VPN exit nodes. Why bother hosting a web site if you don't want anyone to read your content?

                              Fuck that noise. My privacy is more important to me than your blog.

                              S This user is from outside of this forum
                              S This user is from outside of this forum
                              [email protected]
                              wrote on last edited by
                              #14

                              They block VPN exit nodes. Why bother hosting a web site if you don’t want anyone to read your content?

                              Fuck that noise. My privacy is more important to me than your blog.

                              It's a minimalist private blog that sets no 3rd party cookies and loads no 3rd party resources. I presume that alleviates your concerns? 😜

                              S sxan@midwest.socialS 2 Replies Last reply
                              0
                              • S [email protected]

                                They block VPN exit nodes. Why bother hosting a web site if you don’t want anyone to read your content?

                                Fuck that noise. My privacy is more important to me than your blog.

                                It's a minimalist private blog that sets no 3rd party cookies and loads no 3rd party resources. I presume that alleviates your concerns? 😜

                                S This user is from outside of this forum
                                S This user is from outside of this forum
                                [email protected]
                                wrote on last edited by
                                #15

                                The admin could use a CDN and not worry about it, if it's just static content.

                                klu9@lemmy.caK 1 Reply Last reply
                                0
                                • S [email protected]

                                  They block VPN exit nodes. Why bother hosting a web site if you don’t want anyone to read your content?

                                  Fuck that noise. My privacy is more important to me than your blog.

                                  It's a minimalist private blog that sets no 3rd party cookies and loads no 3rd party resources. I presume that alleviates your concerns? 😜

                                  sxan@midwest.socialS This user is from outside of this forum
                                  sxan@midwest.socialS This user is from outside of this forum
                                  [email protected]
                                  wrote on last edited by
                                  #16

                                  That's not what I'm complaining about. I'm unable to access the site because they're blocking anyone coming through a VPN. I would need to lower my security and turn off my VPN to read their blog. That's my issue.

                                  1 Reply Last reply
                                  0
                                  • T [email protected]

                                    and filtering malicious traffic is more important to me than you visiting my services, so I guess that makes us even πŸ™‚

                                    sxan@midwest.socialS This user is from outside of this forum
                                    sxan@midwest.socialS This user is from outside of this forum
                                    [email protected]
                                    wrote on last edited by
                                    #17

                                    You know how popular VPNs are, right? And how they improve privacy and security for people who is them? And you're blocking anyone who's exercising a basic privacy right?

                                    It's not an ethically sound position.

                                    T E 2 Replies Last reply
                                    0
                                    • klu9@lemmy.caK [email protected]

                                      Lemmy newb here, not sure if this is right for this /c.

                                      An article I found from someone who hosts their own website and micro-social network, and their experience with web-scraping robots who refuse to respect robots.txt, and how they deal with them.

                                      S This user is from outside of this forum
                                      S This user is from outside of this forum
                                      [email protected]
                                      wrote on last edited by
                                      #18

                                      I've found that many of these solutions/hacks block legitimate users that are using the tor browser and Internet Archive scrapers, which may be a dealbreaker for some but maybe acceptable for most users and website owners.

                                      1 Reply Last reply
                                      0
                                      • J [email protected]

                                        This is signal detection theory combined with an arms race that keeps the problem hard. You cannot block scrapers without blocking people, and you cannot inconvenience bots without also inconveniencing readers. You might figure something clever out temporarily, but eventually this truism will resurface. Excuse me while I solve a few more captchas.

                                        T This user is from outside of this forum
                                        T This user is from outside of this forum
                                        [email protected]
                                        wrote on last edited by
                                        #19

                                        The internet as we know it is dead, we just need a few more years to realise it. And I'm afraid that telecommunications will be going the same way, when no-one can trust that anyone is who they say anymore.

                                        1 Reply Last reply
                                        0
                                        • J [email protected]

                                          This is signal detection theory combined with an arms race that keeps the problem hard. You cannot block scrapers without blocking people, and you cannot inconvenience bots without also inconveniencing readers. You might figure something clever out temporarily, but eventually this truism will resurface. Excuse me while I solve a few more captchas.

                                          nomugisan@lemmy.dbzer0.comN This user is from outside of this forum
                                          nomugisan@lemmy.dbzer0.comN This user is from outside of this forum
                                          [email protected]
                                          wrote on last edited by
                                          #20

                                          Time to start hosting Trojans on your website

                                          1 Reply Last reply
                                          0
                                          Reply
                                          • Reply as topic
                                          Log in to reply
                                          • Oldest to Newest
                                          • Newest to Oldest
                                          • Most Votes


                                          • Login

                                          • Login or register to search.
                                          • First post
                                            Last post
                                          0
                                          • Categories
                                          • Recent
                                          • Tags
                                          • Popular
                                          • World
                                          • Users
                                          • Groups