FediDB has stoped crawling until they get robots.txt support
-
F [email protected] shared this topic
-
-
Did someone complain? Or why stop?
-
This looks more accurate than fedidb TBH. The initial serge from reddit back in 2023. The slow fall of active members. I personally think the reason the number of users drops so much is because certain instances turn off the ability for outside crawlers to get their user info.
-
No idea honestly. If anyone knows, let us know!
I dont think its necessarily a bad thing, If their crawler was being too aggressive, then it can accidentally DDOS smaller servers. Im hoping that is what they are doing and respecting the robot.txt that some sites have. -
I think it's just one HTTP request to the nodeinfo API endpoint once a day or so. Can't really be an issue regarding load on the instances.
-
Gotosocial has a setting in development that is designed to baffle bots that don't respect robots.txt. FediDB didn't know about that feature and thought gotosocial was trying to inflate their stats.
In the arguments that went back and forth between the devs of the apps involved, it turns out that FediDB was ignoring robots.txt. ie, it was badly behaved
-
-
It's not about the impact it's about consent.
-
lol FediDB isn't a crawler, though. It makes API calls.
-
-
Because of AI bots ignoring robots.txt (especially when you don't explicitly mention their user-agent and rather use a * wildcard) more and more people are implementing exactly that and I wouldn't be surprised if that is what triggered the need to implement robots.txt support for FediDB.
-
-
Why invent implied consent when complicit consent has been the standard in robots.txt for ages now?
-
-
ANY SITE THIS SOFTWARE IS APPLIED TO WILL LIKELY DISAPPEAR FROM ALL SEARCH RESULTS.
I’m sold
-
-
-
You can consent to a federation interface without consenting to having a bot crawl all your endpoints.
Just because something is available on the internet it doesn't mean all uses are legitimate - this is effectively the same problem as AI training with stolen content.
-
Yes. I wholeheartedly agree. Not every use is legitimate. But I'd really need to know what exactly happeded and the whole story to judge here. I'd say if it were a proper crawler, they'd need to read the robots.txt. That's accepted consensus. But is that what's happened here?
And I mean the whole thing with consensus and arbitrary use cases is just complicated. I have a website, and a Fediverse instance. Now you visit it. Is this legitimate? We'd need to factor in why I put it there. And what you're doing with that information. If it's my blog, it's obviously there for you to read it... Or is it!? But that's implied consent. I'd argue this is how the internet works. And most of the times it's super easy to tell what's right an what is wrong. But sometimes it isn't.