AI crawlers cause Wikimedia Commons bandwidth demands to surge 50%.

? Offline

I agree with that assessment!

? Offline

To just have the most recent data within reasonable time frame is one thing. AI companies are like "I must have every single article within 5 minutes they get updated, or I'll throw my pacifier out of the pram". No regard for the considerations of the source sites.

[email protected]

Sure but we're in the comments section of an article about wikipedia being crawled, which is silly because they could just download a snapshot of wikipedia

[email protected]

Apparently the dump doesn't include media, though there's ongoing discussion within wikimedia about changing that. It also seems likely to me that AI scrapers don't care about externalizing costs onto others if it might mean a competitive advantage.

[email protected]

Understood and I do. I try to tweak it a little to my own style. But it helps write the hundreds of cover letters I’m submitting a day. Looking for work. This usually took me hours for just one submission. Now I can fly through.

[email protected]

There's a chance this isn't being done by someone who only wants Wikipedia's data. As the amount of websites you scrape increases, your desire to use the easy tools loses out to creating the most general tool that can look at most webpages.

[email protected]

There's also Anubis, but it uses proof of work not a maze.

https://anubis.techaro.lol/

[email protected]

Scraper bots don't read instructions, they just follow links

[email protected]

That's right. It's not humans making careful decisions about what to download. It's a program that follows links and saves files.

[email protected]

Charcoal is more dusty and more conductive than pencil "lead", which is pretty much processed charcoal and glue.

agnos.is Forums

AI crawlers cause Wikimedia Commons bandwidth demands to surge 50%.