Internet Archive played crucial role in tracking shady CDC data removals
-
[email protected]replied to [email protected] last edited by
Here's how to help them: https://github.com/ArchiveTeam/warrior-dockerfile
-
[email protected]replied to [email protected] last edited by
A couple years ago I read that Filecoin has teamed up with the internet archive to synchronize the data on the Blockchain. I'm not sure how far they are yet, but it's something that could work if it doesn't turn out to be just crypto hype in the end.
-
[email protected]replied to [email protected] last edited by
although could have been avoidable, it begs the question who was behind the attacks.
I think we can safely say it was Peelon Shmusk, the worlds worst spy!
-
[email protected]replied to [email protected] last edited by
When the internet archive was attacked a few months ago we were like "who would be dumb and mean enough to do that?". We have new suspects!
-
[email protected]replied to [email protected] last edited by
Oh, cool, didn't know about this, throwing it on my home lab now.
-
[email protected]replied to [email protected] last edited by
The problem is you'd need to split it down to an amount that people would be happy hosting and then host it multiple times in case any node goes offline.
Another comment in the thread says it's likely over 100PB today (100,000 terabytes). I'd say 4 copies (spread over different time zones) is a relatively minimal level of redundancy (people may host on machines that aren't powered all the time), and you'd get a network with the most participants at around the 150gb per node mark.
That comes to nearly 3 million participants needed
Which isn't insurmountable, but also not remotely easy to get going from nothing
-
[email protected]replied to [email protected] last edited by
That's not the Internet Archive; that's a separate group (ArchiveTeam). They're completely unrelated. They use the Internet Archive for storage but are otherwise completely unrelated. The data archived by Archive Team Warrior does not go into the Wayback Machine.
-
[email protected]replied to [email protected] last edited by
This comment from 8 months ago says 152PB: https://www.reddit.com/r/DataHoarder/comments/1cu79ke/the_archiveteam_has_a_cost_shameboard_of_the_top/l4om4m6/
-
[email protected]replied to [email protected] last edited by
These guys seem cool but they're not the archive.org from the op article
-
[email protected]replied to [email protected] last edited by
As I understand it, their data does in fact enter into the Wayback Machine. They are just also available in the direct WARC archive files(which IMO sounds beneficial to the idea of exporting in bulk to another backup host). At least thatโs how their FAQ reads.
And given that they focus on web crawling, and not other arbitrary data formats that IA accepts, 2.8% of over 100 petabytes is still a respectable amount of data.
That said, help is help. If another archival project team wants me to run a worker node so they can distribute load and dodge crawler blocks, let me know, Iโve got space.
-
[email protected]replied to [email protected] last edited by
It's a team of volunteers who help scrape and upload things to archive.org.
-
[email protected]replied to [email protected] last edited by
It does go into the WaybackMachine AFAIK.
-
[email protected]replied to [email protected] last edited by
Need an archive of the archive
-
[email protected]replied to [email protected] last edited by
It doesn't help that people put silly things onto the IA. I've seen some things like YouTube videos that really didn't need to be there (they have, objectively, nothing of value enough to warrant taking up space on these servers that could be used for more important materials..).
-
[email protected]replied to [email protected] last edited by
I literally posted a comment saying "sure is odd that this is happening right before the election. Not saying it means anything, but maybe it's not a coincidence?" and got downvoted to hell lmao.
-
[email protected]replied to [email protected] last edited by
If they added download options on different taxonomies, I'd try to grab some things to archive.
-
[email protected]replied to [email protected] last edited by
Yeah, stuff that is able to be taught is vital to have archives, but some Twitch streamer playing some MMO/shooter/scary game isn't what I would consider very imperative to get backed up.
-
[email protected]replied to [email protected] last edited by
As I understand it, their data does in fact enter into the Wayback Machine
Thanks for the info! It never used to, so I guess that changed at some point.