'Meta Torrented over 81 TB of Data Through Anna's Archive, Despite Few Seeders' * TorrentFreak
-
I’m not sure I’m following you either, it appears to me that you don’t see a difference between tax and theft. It was common to outgrow this belief but it appears to be common now. I’ll try to explain.
When Meta takes from everyone it’s a bully that takes from the weak who can’t fight back. Meta does it so that they become the biggest fish in the pond as an end goal.
When a state takes from everyone and rich in particular it’s because we don’t to have this kind of big fish in the pond. We just want to chill.
-
Oh look, another tech giant treating open knowledge initiatives like their personal data buffet. Let me translate this corporate nonsense for you:
Meta: "We need training data for our AI!"
Also Meta: Let's leech 81.7TB from a community project without contributing anything back.The absolute audacity of downloading terabytes through torrents while their employees were internally admitting it was "legally problematic". And the best part? They couldn't even be bothered to seed properly - just grab and go, classic corporate behavior.
Remember when companies actually contributed to open source instead of just parasitically consuming it? But no, they'd rather burden volunteer-run projects with massive bandwidth costs while their lawyers probably bill more per hour than these projects' entire monthly budget.
Pro tip Meta: If you're going to pilfer knowledge from the commons, at least seed back properly. Your "move fast and break things" motto isn't supposed to apply to community archives.
-
I’m not sure I’m following you either, it appears to me that you don’t see a difference between tax and theft.
That's an odd thing to write. Why do you believe that?
When Meta takes from everyone it’s a bully that takes from the weak who can’t fight back. Meta does it so that they become the biggest fish in the pond as an end goal.
When a state takes from everyone and rich in particular it’s because we don’t to have this kind of big fish in the pond. We just want to chill.
Ok, I think I get this now. You believe in far-reaching intellectual property, and that property is inviolable, except to limit inequality. So, you reject US-style Fair Use which has a public benefit in mind. Instead, copying only doesn't require permission if the rights-owner is wealthier than oneself. So, most people could freely copy Taylor Swift songs but perhaps not songs by some street musician. Does that cover it?
-
It’s not Meta vs us, but opensource vs Google and Openai.
I never said it's Meta vs us. It's Meta vs (in this particular case) the book publishing industry. You can't reduce the whole situation to open source vs closed source, there's other "axes" at play here as well.
They are being sued for copyright infringement when it’s clearly highly transformative
They downloaded the entire Libgen and more. Going by the traditional explanations of piracy, that's like stealing several hundred bookstores worth of books all at once, and then claiming it's alright because your own writing is not plagiarised from any of the books you've stolen. (Piracy is not the same as actual stealing of course, but countless people have been being legally bullied and ruined with that logic.) Meta also got its data from Internet Archive; unless they only obtained their materials that are public domain or under a similar license, they've obtained a lot of material that IA has been sentenced for allowing unlimited access to back in 2020 (if you've followed the Hachette v. Internet Archive case). The brainfucking conclusion of your and Facebook's case is that using illegal services is perfectly legal as long as you sufficiently transform the results of the illegal activity.
The rules are fine as is
Actually they're not. Copyright law is insanely restrictive, and I don't think you've dealt much with media if you think it's fine (but I don't wish to delve into this further as it's beyond the scope of discussion).
Meta isn’t the one trying to change them
Of course they're not trying to change them, that's the point, they will get away with breaking them while being perfectly fine with other actors not being able to do so.
-
When you're shilling for copyright, at least pick a lane. Are they bad for "pirating" or bad for not supporting "piracy"?
I guess it doesn't matter as long as the owners collect their rent.
-
My seedbox is locked and load, please point me to the. Torrent in need.
Archive team assemble! -
This is the website listed in the article
-
I’m a reasonable man so I’ll allow it.
-
I wasted enough of time which can be spent more productively. I’m pretty sure you’re not really interested.
-
An alternate domain https://annas-archive.li/
-
All LLMs and Gen AI use data they don't own. The Pile is all scraped or pirated info, which served as a starting point for most LLMs. Image gen is all scraped from the web. Speech to text and video gen mainly uses YouTube data.
So either you put a price tag on that data, which means only a handful of companies can afford to build these tools (including Meta), or you understand that piracy is the only way for most to aquire this data but since it's highly transformative, it isn't breaching copyrights or directly stealing from them as piracy "normally" is.
I'm being pragmatic.
-
Not seeding is crazy ...
-
It's complicated.
I know Stable Diffusion best so I'll speak to that, they used to the LAION-5B dataset, which is, in practice freely available to download and use:
https://www.kaggle.com/code/vitaliykinakh/guie-laion-5b-collect-and-download
https://github.com/opendatalab/laion5b-downloader
It's also on HuggingFace but it's unavailable.
https://huggingface.co/datasets/danielz01/laion-5b
But you can use this smaller newer version:
https://huggingface.co/datasets/laion/relaion2B-en-research
Whether it's appropriately licensed is an unsolved question though.
The dataset itself and the text portion of the text-imags pairs needed for training is CC-BY-SA, the newer versions linked above are CC-BY-4.0. https://creativecommons.org/licenses/by/4.0/deed.en
The images however are technically under their own copyright, which in practice means each of the billions of images could or could not have a licence that implicitly or explicitly forbids AI training use or forbids it only for commercial use.
Whether such a license is legally binding is at present unknown though, since licenses primarily deal with reproductions, which the pro-AI folks argue isn't the case, and that training of NNs is more akin to viewing an image and memorising the patterns and relationships within, like a person viewing it.
That would make it non-infringing and therefore the model itself libre. In that case Mistral and LLaMa are also libre as long as the model itself is open source, which in this case really means "open weights", so not like GPT and anything by """OpenAI""".
Weights are the result of a model being trained essentially. They're they key bit that makes it or breaks it and how it works. Given that and knowing the structure of the model and framework used you can refine, modify and distribute it.
Those against AI will say that it's more akin to file compression and that in one form or another it's misuse. That would make the model an infringing derivative work and as such nor libre even if the model weights are open source.
In a way though you could argue that me vaguely memorising the imagery of a dude dressed in white holding a laser sword is just a lossy compressed copy of the copyrighted work of Star wars, and it'd be absurd to think that's a violation and that infringement only occurs if I reproduce a work of substantial similarity commercially from that memory.
That's my own personal stance on the legal side of things, so up to you how you see it.
-
Yes please support annas-archive!! It is a wonderful project. I can essentially get an epub file for any book (including banned books) I want. They have so much more than that too.
-
They are pirating, while also DOSing the providers.
-
And I'd guess all that money would then go to military funding, with Anna's Archive, again getting nothing out of it?
-
It would go to... Uh...
HEY SOMEONE PUT A DEAD CAT ON THE TABLE!