'Meta Torrented over 81 TB of Data Through Anna's Archive, Despite Few Seeders' * TorrentFreak
-
-
-
-
-
-
When you're shilling for copyright, at least pick a lane. Are they bad for "pirating" or bad for not supporting "piracy"?
I guess it doesn't matter as long as the owners collect their rent.
-
My seedbox is locked and load, please point me to the. Torrent in need.
Archive team assemble! -
This is the website listed in the article
-
I’m a reasonable man so I’ll allow it.
-
I wasted enough of time which can be spent more productively. I’m pretty sure you’re not really interested.
-
An alternate domain https://annas-archive.li/
-
All LLMs and Gen AI use data they don't own. The Pile is all scraped or pirated info, which served as a starting point for most LLMs. Image gen is all scraped from the web. Speech to text and video gen mainly uses YouTube data.
So either you put a price tag on that data, which means only a handful of companies can afford to build these tools (including Meta), or you understand that piracy is the only way for most to aquire this data but since it's highly transformative, it isn't breaching copyrights or directly stealing from them as piracy "normally" is.
I'm being pragmatic.
-
Not seeding is crazy ...
-
It's complicated.
I know Stable Diffusion best so I'll speak to that, they used to the LAION-5B dataset, which is, in practice freely available to download and use:
https://www.kaggle.com/code/vitaliykinakh/guie-laion-5b-collect-and-download
https://github.com/opendatalab/laion5b-downloader
It's also on HuggingFace but it's unavailable.
https://huggingface.co/datasets/danielz01/laion-5b
But you can use this smaller newer version:
https://huggingface.co/datasets/laion/relaion2B-en-research
Whether it's appropriately licensed is an unsolved question though.
The dataset itself and the text portion of the text-imags pairs needed for training is CC-BY-SA, the newer versions linked above are CC-BY-4.0. https://creativecommons.org/licenses/by/4.0/deed.en
The images however are technically under their own copyright, which in practice means each of the billions of images could or could not have a licence that implicitly or explicitly forbids AI training use or forbids it only for commercial use.
Whether such a license is legally binding is at present unknown though, since licenses primarily deal with reproductions, which the pro-AI folks argue isn't the case, and that training of NNs is more akin to viewing an image and memorising the patterns and relationships within, like a person viewing it.
That would make it non-infringing and therefore the model itself libre. In that case Mistral and LLaMa are also libre as long as the model itself is open source, which in this case really means "open weights", so not like GPT and anything by """OpenAI""".
Weights are the result of a model being trained essentially. They're they key bit that makes it or breaks it and how it works. Given that and knowing the structure of the model and framework used you can refine, modify and distribute it.
Those against AI will say that it's more akin to file compression and that in one form or another it's misuse. That would make the model an infringing derivative work and as such nor libre even if the model weights are open source.
In a way though you could argue that me vaguely memorising the imagery of a dude dressed in white holding a laser sword is just a lossy compressed copy of the copyrighted work of Star wars, and it'd be absurd to think that's a violation and that infringement only occurs if I reproduce a work of substantial similarity commercially from that memory.
That's my own personal stance on the legal side of things, so up to you how you see it.
-
Yes please support annas-archive!! It is a wonderful project. I can essentially get an epub file for any book (including banned books) I want. They have so much more than that too.
-
They are pirating, while also DOSing the providers.
-
And I'd guess all that money would then go to military funding, with Anna's Archive, again getting nothing out of it?
-
It would go to... Uh...
HEY SOMEONE PUT A DEAD CAT ON THE TABLE!