AI crawlers cause Wikimedia Commons bandwidth demands to surge 50%.

[email protected]

AI has niches but they're exactly that: Niches. Small duct tape tasks for fudging over "hard problems" where manual code would result in a worse outcome and take far more time. Little esoteric problem spaces, which notably don't actually require you to use several states worth of electrical power training on a 50PB dataset of anime titties.

An example: I have a name generator in my game that strings together several consonant+vowel phoneme pairs into a name. This means that the names are always pronounceable, but often the spelling looks really unintuitive. Eg Joosiffe, which the player would likely pronounce as Joseph. However, the leap we do in our head between those two spellings is a process of declassifying phonemes and then re-classifying phonemes, and is actually a "hard problem" from a coding perspective due to the unintituive, multifarious complexities of written, spoken, and conceptualized human language. Adding this step to my name generator in code would be a project of it's own, larger than the game itself, and wouldn't ever work nearly as well as it needed to. But relatively small (30MB) AI models that do this with something like 99.8% satisfaction already exist. They didn't require a data center's worth of resources to train, and since they're academic projects they have licenses that allow them to be used for free in a game.

[email protected]

This is a cool use case. Just make sure you retain your own voice! If you read an AI-generated sentence out loud and think "I'd have said it this way instead", you should absolutely then change it to be that way.

[email protected]

AI bros aren't that smart.

[email protected]

We're going to be entering a golden age of hacks in the next 5 years, I'm calling it now. All this copy-pasted bad ChatGPT code is going to be used in ways that generate security holes the likes of which we've never seen before.

? Offline

Governments aren't fundamentally bad. Having a governing body along with laws and regulations is a good thing when done beneficently. For example, government is responsible for access to public education, libraries, banking, worker safety, and hospitals - all of which are objectively good things to have as a society. The problems usually occur when some individuals have more power/influence than others to choose what the government does, which is what's happening in much of the world right now.

? Offline

Not in this case, to be fair. The only concern is cost - since Wiki wouldn't be opposed to them getting their actual data - and AI mazes are designed to safeguard more sensitive data, not reducing cost

[email protected]

These fucking companies.. downing a torrent of annas archive but crawling wikipedia scourge of mankind

[email protected]

To have the most recent data?

[email protected]

Bots lie about who they are, ignore robots.txt, and come from a gazillion different IPs.

[email protected]

Feel like this belongs in [email protected]

Think I should cross-post?

? Offline

I don't know about stopping entirely. I built a pretty cool RAG system for internal use in my company, it very much facilitates navigating very large amounts of text data.

[email protected]

Google automatically gives me ai search results that are piss poor.

And these results are taken at face value by a shocking number of people. I’ve gotten into niche academic arguments where someone just copy and pasted the AIs completed hallucinated response as “evidence.”

I experimented with using AI to generate basic quizzes for students on concepts like atomic theory or conservation of energy, but maybe 2/20 questions it came up with were any form of accurate/useful. Even when it’s not making shit up entirely, the information is so shallow as to be useless.

[email protected]

Go on, my brother.

[email protected]

That's what ddos protection is for.

? Offline

Nice analysis. Need more smart people like you in the world

[email protected]

When I imagine a future with AI ruining the world, I always thought it was going to be some Skynet/CABAL/HAL9000 type of thing

Not this sad, boring, depressing type shit

? Offline

I agree with that assessment!

? Offline

To just have the most recent data within reasonable time frame is one thing. AI companies are like "I must have every single article within 5 minutes they get updated, or I'll throw my pacifier out of the pram". No regard for the considerations of the source sites.

[email protected]

Sure but we're in the comments section of an article about wikipedia being crawled, which is silly because they could just download a snapshot of wikipedia

[email protected]

Apparently the dump doesn't include media, though there's ongoing discussion within wikimedia about changing that. It also seems likely to me that AI scrapers don't care about externalizing costs onto others if it might mean a competitive advantage.

agnos.is Forums

AI crawlers cause Wikimedia Commons bandwidth demands to surge 50%.