Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Brand Logo

agnos.is Forums

  1. Home
  2. Selfhosted
  3. Cheapskate's Guide: Nuking web-scraping bots

Cheapskate's Guide: Nuking web-scraping bots

Scheduled Pinned Locked Moved Selfhosted
selfhosted
32 Posts 18 Posters 108 Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • klu9@lemmy.caK [email protected]

    Lemmy newb here, not sure if this is right for this /c.

    An article I found from someone who hosts their own website and micro-social network, and their experience with web-scraping robots who refuse to respect robots.txt, and how they deal with them.

    J This user is from outside of this forum
    J This user is from outside of this forum
    [email protected]
    wrote on last edited by
    #13

    This is signal detection theory combined with an arms race that keeps the problem hard. You cannot block scrapers without blocking people, and you cannot inconvenience bots without also inconveniencing readers. You might figure something clever out temporarily, but eventually this truism will resurface. Excuse me while I solve a few more captchas.

    T nomugisan@lemmy.dbzer0.comN irmadlad@lemmy.worldI 3 Replies Last reply
    0
    • sxan@midwest.socialS [email protected]

      They block VPN exit nodes. Why bother hosting a web site if you don't want anyone to read your content?

      Fuck that noise. My privacy is more important to me than your blog.

      S This user is from outside of this forum
      S This user is from outside of this forum
      [email protected]
      wrote on last edited by
      #14

      They block VPN exit nodes. Why bother hosting a web site if you donโ€™t want anyone to read your content?

      Fuck that noise. My privacy is more important to me than your blog.

      It's a minimalist private blog that sets no 3rd party cookies and loads no 3rd party resources. I presume that alleviates your concerns? ๐Ÿ˜œ

      S sxan@midwest.socialS 2 Replies Last reply
      0
      • S [email protected]

        They block VPN exit nodes. Why bother hosting a web site if you donโ€™t want anyone to read your content?

        Fuck that noise. My privacy is more important to me than your blog.

        It's a minimalist private blog that sets no 3rd party cookies and loads no 3rd party resources. I presume that alleviates your concerns? ๐Ÿ˜œ

        S This user is from outside of this forum
        S This user is from outside of this forum
        [email protected]
        wrote on last edited by
        #15

        The admin could use a CDN and not worry about it, if it's just static content.

        klu9@lemmy.caK 1 Reply Last reply
        0
        • S [email protected]

          They block VPN exit nodes. Why bother hosting a web site if you donโ€™t want anyone to read your content?

          Fuck that noise. My privacy is more important to me than your blog.

          It's a minimalist private blog that sets no 3rd party cookies and loads no 3rd party resources. I presume that alleviates your concerns? ๐Ÿ˜œ

          sxan@midwest.socialS This user is from outside of this forum
          sxan@midwest.socialS This user is from outside of this forum
          [email protected]
          wrote on last edited by
          #16

          That's not what I'm complaining about. I'm unable to access the site because they're blocking anyone coming through a VPN. I would need to lower my security and turn off my VPN to read their blog. That's my issue.

          1 Reply Last reply
          0
          • T [email protected]

            and filtering malicious traffic is more important to me than you visiting my services, so I guess that makes us even ๐Ÿ™‚

            sxan@midwest.socialS This user is from outside of this forum
            sxan@midwest.socialS This user is from outside of this forum
            [email protected]
            wrote on last edited by
            #17

            You know how popular VPNs are, right? And how they improve privacy and security for people who is them? And you're blocking anyone who's exercising a basic privacy right?

            It's not an ethically sound position.

            T E 2 Replies Last reply
            0
            • klu9@lemmy.caK [email protected]

              Lemmy newb here, not sure if this is right for this /c.

              An article I found from someone who hosts their own website and micro-social network, and their experience with web-scraping robots who refuse to respect robots.txt, and how they deal with them.

              S This user is from outside of this forum
              S This user is from outside of this forum
              [email protected]
              wrote on last edited by
              #18

              I've found that many of these solutions/hacks block legitimate users that are using the tor browser and Internet Archive scrapers, which may be a dealbreaker for some but maybe acceptable for most users and website owners.

              1 Reply Last reply
              0
              • J [email protected]

                This is signal detection theory combined with an arms race that keeps the problem hard. You cannot block scrapers without blocking people, and you cannot inconvenience bots without also inconveniencing readers. You might figure something clever out temporarily, but eventually this truism will resurface. Excuse me while I solve a few more captchas.

                T This user is from outside of this forum
                T This user is from outside of this forum
                [email protected]
                wrote on last edited by
                #19

                The internet as we know it is dead, we just need a few more years to realise it. And I'm afraid that telecommunications will be going the same way, when no-one can trust that anyone is who they say anymore.

                1 Reply Last reply
                0
                • J [email protected]

                  This is signal detection theory combined with an arms race that keeps the problem hard. You cannot block scrapers without blocking people, and you cannot inconvenience bots without also inconveniencing readers. You might figure something clever out temporarily, but eventually this truism will resurface. Excuse me while I solve a few more captchas.

                  nomugisan@lemmy.dbzer0.comN This user is from outside of this forum
                  nomugisan@lemmy.dbzer0.comN This user is from outside of this forum
                  [email protected]
                  wrote on last edited by
                  #20

                  Time to start hosting Trojans on your website

                  1 Reply Last reply
                  0
                  • sxan@midwest.socialS [email protected]

                    You know how popular VPNs are, right? And how they improve privacy and security for people who is them? And you're blocking anyone who's exercising a basic privacy right?

                    It's not an ethically sound position.

                    T This user is from outside of this forum
                    T This user is from outside of this forum
                    [email protected]
                    wrote on last edited by
                    #21

                    Absolutely; if I was a company, or hosting something important, or something that was intended for the general public, then I'd agree.

                    But I'm just an idiot hosting whimsical stuff from my basement, and 99% of it is only of interest for my friends. I know ~everyone in my target audience, and I know that none of them use a VPN for general-purpose browsing.

                    As it is, I don't mind keeping the door open to the general public, but nothing of value will be lost if I need to pull the plug on some more ASN's to preserve my bandwidth. For example when a guy hopping through a VPN in Sweden decides to download the same zip file thousands of times, wasting terabytes of traffic over a few hours (this happened a week ago).

                    sxan@midwest.socialS 1 Reply Last reply
                    0
                    • J [email protected]

                      This is signal detection theory combined with an arms race that keeps the problem hard. You cannot block scrapers without blocking people, and you cannot inconvenience bots without also inconveniencing readers. You might figure something clever out temporarily, but eventually this truism will resurface. Excuse me while I solve a few more captchas.

                      irmadlad@lemmy.worldI This user is from outside of this forum
                      irmadlad@lemmy.worldI This user is from outside of this forum
                      [email protected]
                      wrote on last edited by
                      #22

                      Excuse me while I solve a few more captchas.

                      Buster for captcha.

                      1 Reply Last reply
                      0
                      • sxan@midwest.socialS [email protected]

                        You know how popular VPNs are, right? And how they improve privacy and security for people who is them? And you're blocking anyone who's exercising a basic privacy right?

                        It's not an ethically sound position.

                        E This user is from outside of this forum
                        E This user is from outside of this forum
                        [email protected]
                        wrote on last edited by
                        #23

                        You had me until the "ethically sound position" part.

                        You're saying that Joe Blogger is acting unethically because he doesn't allow VPN users to visit his site. C'mon, brother.

                        sxan@midwest.socialS 1 Reply Last reply
                        0
                        • E [email protected]

                          You had me until the "ethically sound position" part.

                          You're saying that Joe Blogger is acting unethically because he doesn't allow VPN users to visit his site. C'mon, brother.

                          sxan@midwest.socialS This user is from outside of this forum
                          sxan@midwest.socialS This user is from outside of this forum
                          [email protected]
                          wrote on last edited by
                          #24

                          You're saying targeting people who are taking steps to improve their privacy and security is ethical? Out do you just believe that there's no such thing as ethics in CIS?

                          E 1 Reply Last reply
                          0
                          • T [email protected]

                            Absolutely; if I was a company, or hosting something important, or something that was intended for the general public, then I'd agree.

                            But I'm just an idiot hosting whimsical stuff from my basement, and 99% of it is only of interest for my friends. I know ~everyone in my target audience, and I know that none of them use a VPN for general-purpose browsing.

                            As it is, I don't mind keeping the door open to the general public, but nothing of value will be lost if I need to pull the plug on some more ASN's to preserve my bandwidth. For example when a guy hopping through a VPN in Sweden decides to download the same zip file thousands of times, wasting terabytes of traffic over a few hours (this happened a week ago).

                            sxan@midwest.socialS This user is from outside of this forum
                            sxan@midwest.socialS This user is from outside of this forum
                            [email protected]
                            wrote on last edited by
                            #25

                            I know that none of them use a VPN for general-purpose browsing.

                            Interesting. The most common setup I encounter is when the VPN is implemented in the home router - that's the way it is in my house. If you're connected to my WiFi, you're going through my VPN.

                            I have a second VPN, which is how my private servers are connected; that's a bespoke peer-to-peer subnet set up in each machine, but it handles almost no outbound traffic.

                            My phone detects when it isn't connected to my home WiFi and automatically turns on the VPN service for all phone data; that's probably less common. I used to just leave it on all the time, but VPN over VPN seemed a little excessive.

                            It sounds like you were a victim of a DOS attack - not distributed, though. It could have just been done directly; what about it being through a VPN made it worse?

                            1 Reply Last reply
                            0
                            • oyzmo@lemmy.worldO [email protected]

                              Thanks, great site! ๐Ÿ˜Š

                              klu9@lemmy.caK This user is from outside of this forum
                              klu9@lemmy.caK This user is from outside of this forum
                              [email protected]
                              wrote on last edited by
                              #26

                              You're welcome.

                              I believe I found it originally via the "distribuverse"... specifically, ZeroNet.

                              1 Reply Last reply
                              0
                              • sxan@midwest.socialS [email protected]

                                They block VPN exit nodes. Why bother hosting a web site if you don't want anyone to read your content?

                                Fuck that noise. My privacy is more important to me than your blog.

                                klu9@lemmy.caK This user is from outside of this forum
                                klu9@lemmy.caK This user is from outside of this forum
                                [email protected]
                                wrote on last edited by
                                #27

                                A problem with this approach was that many readers use VPN's and other proxies that change IP addresses virtually every time they use them. For that reason and because I believe in protecting every Internet user's privacy as much as possible, I wanted a way of immediately unblocking visitors to my website without them having to reveal personal information like names and email addresses.

                                I recently spent a few weeks on a new idea for solving this problem. With some help from two knowledgeable users on Blue Dwarf, I came up with a workable approach two weeks ago. So far, it looks like it works well enough. To summarize this method, when a blocked visitor reaches my custom 403 error page, he is asked whether he would like to be unblocked by having his IP address added to the website's white list. If he follows that hypertext link, he is sent to the robot test page. If he answers the robot test question correctly, his IP address is automatically added to the white list. He doesn't need to enter it or even know what it is. If he fails the test, he is told to click on the back button in his browser and try again. After he has passed the robot test, Nginx is commanded to reload its configuration file (PHP command: shell_exec("sudo nginx -s reload");), which causes it to immediately accept the new whitelist entry, and he is granted immediate access. He is then allowed to visit cheapskatesguide as often as he likes for as long as he continues to use the same IP address. If he switches IP addresses in the future, he has about a one in twenty chance of needing to pass the robot test again each time he switches IP addresses. My hope is that visitors who use proxies will only have to pass the test a few times a year. As the whitelist grows, I suppose that frequency may decrease. Of course, it will reach a non-zero equilibrium point that depends on the churn in the IP addresses being used by commercial web-hosting companies. In a few years, I may have a better idea of where that equilibrium point is.

                                1 Reply Last reply
                                0
                                • S [email protected]

                                  The admin could use a CDN and not worry about it, if it's just static content.

                                  klu9@lemmy.caK This user is from outside of this forum
                                  klu9@lemmy.caK This user is from outside of this forum
                                  [email protected]
                                  wrote on last edited by
                                  #28

                                  I believe using a CDN would defeat the author's goal of not being reliant on third-party service providers.

                                  1 Reply Last reply
                                  0
                                  • sxan@midwest.socialS [email protected]

                                    You're saying targeting people who are taking steps to improve their privacy and security is ethical? Out do you just believe that there's no such thing as ethics in CIS?

                                    E This user is from outside of this forum
                                    E This user is from outside of this forum
                                    [email protected]
                                    wrote on last edited by
                                    #29

                                    You're putting words in my mouth. I didn't say that. Targeting sounds like specifically doing it with an agenda.

                                    What you're saying the equivalent of being offended that you can't bring guns inside someone's private property. "It is not ethical that you forbid me to exercise my constitutional rights of bearing arms in your house. How dare you not allowing me to put my AK-47 in your kitchen counter!"

                                    Nope. I said that if someone doesn't want to deal with VPN users because it's more hassle than worth (e.g. bots), then so be it. Joe Blogger may get 20 visitors a month instead of 24. Oh the horror!

                                    I am a huge advocate of privacy laws. But if Joe Blogger doesn't allow me in his personal website, eh. I might try archive.org.

                                    sxan@midwest.socialS 1 Reply Last reply
                                    0
                                    • E [email protected]

                                      You're putting words in my mouth. I didn't say that. Targeting sounds like specifically doing it with an agenda.

                                      What you're saying the equivalent of being offended that you can't bring guns inside someone's private property. "It is not ethical that you forbid me to exercise my constitutional rights of bearing arms in your house. How dare you not allowing me to put my AK-47 in your kitchen counter!"

                                      Nope. I said that if someone doesn't want to deal with VPN users because it's more hassle than worth (e.g. bots), then so be it. Joe Blogger may get 20 visitors a month instead of 24. Oh the horror!

                                      I am a huge advocate of privacy laws. But if Joe Blogger doesn't allow me in his personal website, eh. I might try archive.org.

                                      sxan@midwest.socialS This user is from outside of this forum
                                      sxan@midwest.socialS This user is from outside of this forum
                                      [email protected]
                                      wrote on last edited by
                                      #30

                                      Hold on a tick.

                                      Specifically blacklisting a group of users because of the technology they use is, by definition, "targeting", right? I mean, if not, what qualifies as "targeting" for you?

                                      And, yeah. Posting a sign saying "No Nazi symbolism is allowed in this establishment" is - I would claim - targeting Nazis. Same as posting a sign, "no blacks allowed" - you're saying that's not targeting?

                                      I know we're arguing definitions and have strayed from the original topic, but I think this is an important point to clarify, since you took specific objection to my use of it in that context; and because I'm being pedantic about it.

                                      E 1 Reply Last reply
                                      0
                                      • sxan@midwest.socialS [email protected]

                                        Hold on a tick.

                                        Specifically blacklisting a group of users because of the technology they use is, by definition, "targeting", right? I mean, if not, what qualifies as "targeting" for you?

                                        And, yeah. Posting a sign saying "No Nazi symbolism is allowed in this establishment" is - I would claim - targeting Nazis. Same as posting a sign, "no blacks allowed" - you're saying that's not targeting?

                                        I know we're arguing definitions and have strayed from the original topic, but I think this is an important point to clarify, since you took specific objection to my use of it in that context; and because I'm being pedantic about it.

                                        E This user is from outside of this forum
                                        E This user is from outside of this forum
                                        [email protected]
                                        wrote on last edited by
                                        #31

                                        Specifically blacklisting a group of users because of the technology they use is, by definition, โ€œtargetingโ€, right? I mean, if not, what qualifies as โ€œtargetingโ€ for you?

                                        You may be right. I guess it's a matter of semantics. But the way you described it sounded more nefarious. "I'll target this group of VPN users because fuck them, I hope they all die in a tsunami!!!!" when it's more like "ugh, another VPN bot. The 9th this hour and I'm hungry. You know what - I'll just block VPN altogether and go fix me a sandwich." Maybe that's just my perception.

                                        But anyway - it's Joe Blogger's machine, at his home, for him to do whatever he likes. Some rando from the street knocks on the door and says "excuse me, do you mind if I send an e-mail from your computer?" Joe Blogger can perfectly say no, not even an excuse is owed.

                                        You'd have a point if it was a business or a corporation. Some home machine? Out of billions? Why bother?

                                        I guess we're two pedantic folks. I enjoy these discussions. I sometimes gain some new knowledge out of them.

                                        sxan@midwest.socialS 1 Reply Last reply
                                        0
                                        • E [email protected]

                                          Specifically blacklisting a group of users because of the technology they use is, by definition, โ€œtargetingโ€, right? I mean, if not, what qualifies as โ€œtargetingโ€ for you?

                                          You may be right. I guess it's a matter of semantics. But the way you described it sounded more nefarious. "I'll target this group of VPN users because fuck them, I hope they all die in a tsunami!!!!" when it's more like "ugh, another VPN bot. The 9th this hour and I'm hungry. You know what - I'll just block VPN altogether and go fix me a sandwich." Maybe that's just my perception.

                                          But anyway - it's Joe Blogger's machine, at his home, for him to do whatever he likes. Some rando from the street knocks on the door and says "excuse me, do you mind if I send an e-mail from your computer?" Joe Blogger can perfectly say no, not even an excuse is owed.

                                          You'd have a point if it was a business or a corporation. Some home machine? Out of billions? Why bother?

                                          I guess we're two pedantic folks. I enjoy these discussions. I sometimes gain some new knowledge out of them.

                                          sxan@midwest.socialS This user is from outside of this forum
                                          sxan@midwest.socialS This user is from outside of this forum
                                          [email protected]
                                          wrote on last edited by
                                          #32

                                          But the way you described it sounded more nefarious

                                          Oh. Yeah, I don't think they're being malicious; I just get frustrated with that sort of behavior. The primary DNS servers for usps.com, neakasa.com, and vitacost.com all block DNS queries from Mullvad's DNS servers, and one of them blocks all traffic from at least some of Mullvad's exit nodes. It means I have to waste time working around these blocks, because I'll be damned if I'm going to take down the house VPN just to visit their stupid sites. So, I hard-code DNS entries for them, and route traffic to the one through one of my VPSes. It's annoying, a waste of my time, and I'm just generally offended by the whiff of surveillance state about it, even when that's not the reason why they're doing it.

                                          Really, it boils down to the fact that I'm offended by the presumption that their (not OP, but VPN-hostile companies in general) anti-spam or whatever they're trying to accomplish takes priority over my right to privacy. So, yeah; I generally have a bone to pick with any site that's hostile to VPNs.

                                          Maybe that's just my perception.

                                          I have no doubt at all that you're right. And, they have no obligation to accommodate me (which I think is not true for companies I'm trying to do business with).

                                          I'm just uppity about the topic, is all.

                                          I enjoy these discussions. I sometimes gain some new knowledge out of them.

                                          I'll happily have a cordial disagreement with anyone arguing in good faith. It's echo-ey enough, and these are good conversations.

                                          1 Reply Last reply
                                          0
                                          • System shared this topic on
                                          Reply
                                          • Reply as topic
                                          Log in to reply
                                          • Oldest to Newest
                                          • Newest to Oldest
                                          • Most Votes


                                          • Login

                                          • Login or register to search.
                                          • First post
                                            Last post
                                          0
                                          • Categories
                                          • Recent
                                          • Tags
                                          • Popular
                                          • World
                                          • Users
                                          • Groups