FediDB has stoped crawling until they get robots.txt support

[email protected]

From your own wiki link

robots.txt is the filename used for implementing the Robots Exclusion Protocol, a standard used by websites to indicate to visiting web crawlers and other web robots which portions of the website they are allowed to visit.

How is f3didn not an "other web robot"?

[email protected]

stoped

Well, they needed to stope. Stope, I said. Lest thy carriage spede into the crosseth-rhodes.

[email protected]

Ooh, nice.

[email protected]

We can't afford to wait at every sop, yeld, or one vay sign!

[email protected]

Ok if you want to focus on that single phrase and ignore the whole rest of the page which documents decades of stuff to do with search engines and not a single mention of api endpoints, that's fine. You can have the win on this, here's a gold star.

[email protected]

Whan that Aprill with his shoures soote

[email protected]

if you run a federated services... Is that enough to assume you consent

If she says yes to the marriage that doesn't mean she permanently says yes to sex. I can run a fully air gapped "federated" instance if I want to

[email protected]

Hmmh, I don't think we'll come to an agreement here. I think marriage is a good example, since that comes with lots of implicit consent. First of all you expect to move in together after you got engaged. You do small things like expect to eat dinner together. It's not a question anymore whether everyone cooks their own meal each day. And it extends to big things. Most people expect one party cares for the other once they're old. And stuff like that. And yeah. Intimacy isn't granted. There is a protocol to it. But I'm way more comfortable to make the moves on my partner, than for example place my hands on a stranger on the bus, and see if they take my invitation...

Isn't that ho it works? I mean going with your analogy... Sure, you can marry someone and never touch each other or move in together. But that's kind of a weird one, in my opinion. Of course you should be able to do that. But it might require some more explicit agreement than going the default route. And I think that's what happened here. Assumptions have been made, those turned out to be wrong and now people need to find a way to deal with it so everyone's needs are met...

[email protected]

Going by your example

Air gapping my service is the agreement you're talking about in this analogy, but otherwise I do actually agree with you. There is a lot of implied consent, but I think we have a near miss misunderstanding on one part.

In this scenario (analogies are nice but let's get to reality) crawling the website to check the MAU, as harmless as it is, is still adding load to the server. A tiny amount, sure, but if you're going to increase my workload by even 1% I wanna know beforehand. Thus, I put things on my website that say "don't increase my workload" like robots.txt and whatnot.

Other people aren't this concerned with their workload, in which case it might be fine to go with implied consent. However, it's always best to follow the best practices and just make sure with the owner of a server that it's okay to do anything to their server IMO

[email protected]

Okay,

So why should reinevent a standard when one that serves functionally the same purpose with one of implied consent?

[email protected]

How is it air gapped and federated? Do you unairgap it periodically for a refresh then reairgap it? I’ve not heard of airgapped federated servers before and am intrigued. Is it purely for security purposes or also bandwidth savings? Are there other reasons one may want to run an air gapped instance?

[email protected]

I don't think that'll work. Asking for consent and retrieving the robots.txt is yet another request with a similar workload. So by that logic, we can't do anything on the internet. Since asking for consent is work and that requires consent, which requires consent... And if you're concerned with efficiency alone, cut the additional asking and complexity by just straightforward doing the single request.

Plus, it's not even that complex. Sending a few bytes of JSON with daily precalculated numbers is a fraction of what a single user interaction does. It's maybe zero point something of a request. Or with a lots of more zero's in-between if we look at what a server does each day. I mean every single refresh of the website or me opening the app loads several files, API endpoints, regularly loads hundreds of kilobytes of Javascript, images etc. There are lots of calculations and database requests involved to display several posts along with votes etc. I'd say one single pageview of me counts like the FediDB collecting sttats each day for like 1000 years.

I invented these numbers. They're wrong. But I think you get what I'm trying to say... For all practical purposes, these requests are for free and have zero cost. Plus if it's efficientcy, it's always a good idea not to ask to ask, but outright do it and deal with it while answering. So it rally can't be computational cost. It has to be consent.

[email protected]

In this scenario, I have multiple servers which are networked together and federated via ActivityPub but the server cluster itself is air gapped.

As to your questions about feasibility and purposes, I will admit I both didn't think about that, and should have been more clear that this air gapped federated instance was theoretical lol

[email protected]

You're definitely right that I went a bit extreme with what I used as a reason against it, but I feel like the point still stands about "just ask before you slam people's servers with yet another bot on the pile of millions of bots hitting their F2B system"

[email protected]

issue link here

It was a good read, personally speaking I think it probably would have just been better off to block gotosocial until proper robot support was provided I found it weird that they paused the entire system.

[email protected]

It is not possible to detect bots. Attempting to do so will invariably lead to false positives denying access to your content to what is usually the most at-risk & marginalized folks

Just implement a cache and forget about it. If read only content is causing you too much load, you're doing something terribly wrong.

[email protected]

Thank you for providing the link.

[email protected]

Then I’m not sure what point you were trying to make in the above conversation lol.

[email protected]

The point was "don't add another bot into the pile of millions of bots that hit people's servers every day unless you're gonna be polite about it"

[email protected]

While I agree with you, the quantity of robots has greatly increased of late. While still not as numerous as users, they are hitting every link and wrecking your caches by not focusing on hotspots like humans do.

agnos.is Forums

FediDB has stoped crawling until they get robots.txt support