Confessions of a Data Hoarder

Okay, so I’m not actually sure I would consider myself a data hoarder, but I thought the title was pretty eye catching, and archival is close enough to data hoarding to only be moderately clickbaity. (I actually think I’ve got a pretty well organized and reasonably trimmed home directory). Anyway, I thought I would go over a couple of related topics, specifically my backup strategy, some data archiving, and me getting back into useing optical storage.

Backups:

You need to have backups, period. In all reality, if you’re reading some random person’s writing about backup strategies and data hoarding then you’re probably keeping good backups, but I like to write about random stuff and this is my blog so you can’t stop me. But, to repeat the golden rule of backups: you need to have at least three copies of data in at least two locations.

My backup strategy for a while has been to have backup drives and cloud storage. Every so often I copy everything to my computer (e.g. photos from my phone), then run FreeFileSync to copy the important directories in my home directory to an encrypted backup drive. I have two drives, one that resides with me and one that I keep at my parent’s house out of state. In this way I’m pretty much guaranteed to have a physical backup, and in the past when I’ve needed it (e.g. new PC, broken HD/SSD, re-installed OS) I can literally just flick a switch in FreeFileSync and have it pull all the data back into my home directory.

Cloud storage wise I use IceDrive. IceDrive offers E2E encryption, though it’s only 128bit and they use Blowfish instead of AES on the weird logic that “The US government recommends AES so we won’t.” For online storage I usually just zip & encrypt my potentially sensitive files, then upload them with my browser. It’s a little inefficient to re-upload existing data (e.g. re-uploading a copy of all my documents despite only ~1% of the folder being new) but it still works. Less sensitive data like my photos get uploaded unzipped and unencrypted to that year’s photo folder, and content like my game saves get uploaded zipped but not encrypted. I’ve never actually restored files from cloud storage that I remember, using a drive is always faster, but it’s very nice to have something not managed by me and accessible anywhere as long as I’ve got two encryption passphrases.

With this backup strategy I think I have most of the basics covered. I’ve got two “offsite” copies, so if everything in my apartment got destroyed or stolen I’d still be safe. I also always have at least one of the backup drives disconnected, so say I got hit by ransomware or another destructive hack that destroyed my PC’s storage, cloud storage, and a plugged in backup drive I’d still be good with another drive safely unplugged in another state.

Throwing optical media into my backup strategy

Finally, as of fairly recently, I’ve also been starting to use optical media again for backups. Like I assume a lot of people who have backed up files since the 2000s/2010s I used to use CDs and DVDs, but then stopped as hard drives and flash media got cheaper/easier. I have started to use it again recently (albeit DVDs and Blu Rays instead), and for my important backups it’s for two reasons: better redundancy and longevity. Better redundancy comes into play because my other methods are a bit more “live” for lack of a better term. Every so often I mirror my important data on my computer’s drive to them, and there’s the fear that important files get overwritten/corrupted/deleted on my main drive and then I overwrite all my backups with the bad copies before I realize it. With optical media it’s burnt once and then left alone, so it’s a permanent snapshot for as long as the optical media lasts (something that could get expensive on other forms of media). On the topic of how long it lasts, I’ll delve into that in the next part.

Setting aside various archival projects, my strategy is pretty simple: take my most important files, encrypt anything sensitive, and stick it on an archival grade Blu Ray or DVD. I’ll also throw in some par2 files for redundancy (fancy math called a Solomon Matrix that lets you repair corrupted data) and then stick it on a shelf.

Optical Media:

Dropdown: Is Optical Media Obsolete?

Am I being a silly goose starting to use optical media again? I don’t believe so. If you think it could be useful for your computing purposes, but fear it’ll be obsolete right after you start using it then I wouldn’t worry too much. First off, optical is still the longest lasting media by far (talking easily 10-200 times longer than hard disks) - which is why it’s still used by governments, companies, and individuals for archival.

It’s also still the primary form of physical media. Even though it’s taken a back seat to streaming services of various kinds, there’s still plenty of physical media being bought and sold. I’m willing to bet that since it works, but it’s playing second fiddle so-to-speak, it’s prime circumstances for things to stay the same for a while.

And last, legacy equipment always sticks around. There are legitimately floppy disks being used still, despite them not having any benefit over other forms of media. Even if it does wind up becoming obsolete there’ll be plenty of time to copy stuff over to the media form that surpasses it.


So you might be wondering, why start using optical media again, especially for a Gen Z who hadn’t burnt a disk in a long time. A large reason is it’s for longevity: I’ll feel safe from bitrot on a hard disk in cold storage for like five, maybe 10 years - or non-volatile NAND/Flash for 1-2 years, but that’s about it. I can either spend like $5-$10k to get a tape writer and tape that lasts 10-30 years, or I can spend like $50 to get a Blu Ray burner and archival Blu Rays that last up to a hundred years. If you want to archive content for long periods you can always re-write the data on other forms of storage, but it becomes a hassle and/or gets forgotten and corrupted. Optical media is also extremely durable for electronic storage.

I can hear somebody yelling about scratches already, but keep in mind optical media isn’t in any form of housing as opposed to other storage devices. Get your grubby little sausage fingers on an HD’s magnetic disk or a NAND drive’s internals, or try washing them off with water, and see how well they read afterward.

As I mentioned above, I have the two main drives and cloud backups that I keep regularly updated, but I like the idea of having some media I can stick on a shelf and forget about. Focusing on my important stuff (photos, documents, encryption keys, etc) I was able to throw all of that stuff on just a couple of archive grade discs (2 DVDs and 1 Blue Ray) and can stick it in a closet or on a shelf and leave it pretty much forever. Even if the stuff rated for 100 years only lasts 50, assuming I’m still around and assuming the nukes don’t fly, I’d be in my 70s and wouldn’t have any problem re-burning media or shifting formats to whatever is new by that time.

Bonus: If the nukes do fly (or we had some sort of Carrington event) and you’re a hardcore doomsday prepper optical media is the only thing that wouldn’t be affected by electromagnetic interference. Plus it’s probably the only medium that could outlast an apocalypse. Not that I would expect data to be anyone’s top priority in such a case though.

I’ve also thrown together a few DVDs with backups that are a little less critical. As I’ll touch on later, I’ve thrown some games or downloaded videos on DVDs as well. I didn’t use the archival grade ones, just Verbatim DVD+Rs. I am fairly partial to Verbetim media, it’s got a really stellar reputation and has worked great for me so far (without too much extra pricing - on a medium that’s already really cheap).


Dropdown: Physical Media

It also doesn’t hurt to be able to consume media in CD/DVD/Blu Ray form as well. I gave away my old DVD reader a long while back, but streaming services are expensive and everyone in this space probably has complained at some point about the idea you don’t really own your stuff (even if you ‘purchased’ it outright on streaming services). I’ve recently purchased a few pieces of physical media again, and with the rising prices and fragmenting of streaming services I wonder if physical media could actually make a comeback. The average American household spends $61 a month on streaming services, and while I certainly wouldn’t allocate that much of a budget to media; with prices of used movies and TV seasons running as cheap as ~$2.50-$6 it’s weird seeing such a vast price difference. Even a $13 brand new box set from Time Warner is like the price of a month’s worth of Netflix. And that comes with piracy levels of owning media.

As for the inconvenience of physical media, it’s pretty easy to just rip an iso and copy it wherever. Using VLC I can copy an ISO to any device (even my phone) and play it just like I would any other video file. Some people go the extra mile and setup something like a Plex server or NAS to effectively make their own streaming service with ripped media and/or torrents, but I don’t watch nearly enough media for something like that to be worth it.

If you’re looking for cheap used media, ThriftBooks and Declutter both seem to have really good prices for what they have in stock, especially with TV seasons. EBay, if you sort by cheapest, seems to have the best prices for individual movies and anything else that comes on a single disc.

Types of Optical Media, and reviewing past burnt disks

First off, if you’re looking into what to use there are CDs, DVDs, and Blu Rays. The long in the short of it is CDs are pretty small and there are not a lot of benefits to them nowadays. CD/DVD readers/burners are cheap, and DVDs offer a decent amount of options and decent space for the price. Blu rays are the cheapest price per GB and most convenient for storing larger amounts of data on fewer disks, but Blu Ray burners are a bit more expensive.

The types of Blu Rays and DVDs you buy for storage have a very big impact on their longevity (which is why I’m using them), and it’s definitely worth considering before purchasing some. The Canadian Archive has a convenient cheat sheet on their longevity that I’ve referenced a lot. A lot of people think it’s underestimating their longevity to be on the safe side, and they don’t take into specialty media (M-Discs come to mind), but it’s definitely worth referencing. A few types of media worth considering include:

  • M-Discs - They’re rated for up to a thousand years, but have a price tag to come with them. The M Disc company recently went bankrupt, but you can still find M-Discs for sale or find licensed M-Discs by Verbetim. Still a bit too expensive for my taste and I don’t really have a need to store data for that long.
  • JIS X6257 Discs - Japan has laws mandating 100 year archives of certain documents, and any DVD or Blu Ray that complies with this standard is expected to last for at least that long. As far as I’m aware, they’re difficult to find in the West.
  • Verbatim Archival Blu Rays - If you have a Blu Ray capable burner they’re a very good option in my opinion. They’re rated for up to 100 years, and run 3.2 cents per gigabyte of storage on Amazon.
  • Verbetim Archival DVDs - These ones have a corrosion resistant gold foil layer and are also rated for up to 100 years. They’ll run you about 50 cents per gigabyte, but they’re compatible with a lot more (and cheaper) drives.
  • Generic Blu Rays - you can find non-archival Blu Rays for as cheap as 1.6 cents per gigabyte of storage, making them the cheapest form of storage by far. Just no guarantee how long they’ll last if you’re planning on long term backups/archival.
  • DVD+R - These will still last 20-50 years according to the Canadian Archive chart, though the Verbatim DVD+Rs claim a lifetime warranty and run about 6 cents per gigabyte.

Disk Burning, Storage, & Care

A few tidbits I picked up that could come in handy if you are to burn a disc for archival purposes.

The first thing worth mentioning is my use of par2, which can generate parity bits to recover corrupted data if some is lost. It uses the Solomon Matrix, which is some form of magic math that lets you generate parity bits a % of the original file’s size, and use those to recover data if it’s lost. The cool thing, however, is that you can generate say 5% parity and then you can recover a corrupted file as long as 95% or more is still there REGARDLESS of what part of the file was corrupted. The CLI version of par2 was pre-installed on Debian, but I believe there’s a GUI version out there as well and it’s available for Windows too. I generally generate parity files using Par2 CLI, then stick them in a /par/ folder on the disk in case I need to recover some corrupted data.

Next, it’s important to keep acid away from discs. Not just comically large vats of acid, but that includes most ink and paper as well. A cheap acid free felt tip pen or acid free marker is the way to go to mark your discs. It’s also important to keep the disks out of sunlight, humidity, and extreme temperatures. Further, if you’re burning a large number of files (say, photo backups) you’ll likely want to zip them (or rar, 7z, or your preferred container format) because doing too many individual files can cause burning to fail.

Before burning I also like to setup a text document to be added to the disc. Sometimes I go all out and include some ASCII art, but it’s mostly just to have some important info such as names + hashes of files, the date I burnt it, and an explanation of the disc.

Finally, I always find it useful to verify data after burning. I use K3B, so it’s mostly just a matter of selecting the files to burn and then checking the verify box. I’ve only had one burning fail (I tried burning my entire photo library unzipped to a Blu Ray), and it failed before it finished writing and got to the verification check, but having the extra validation that nothing got burnt incorrectly is still nice.

Reviewing past burnt discs

Covering this topic I thought I would go over a few different CDs/DVDs that I burnt in the past and still have around. As you’ll probably notice, I used a lot of cheap DVD-RWs, which are by far the worst kind of DVD for longevity. Before getting a proper storage case I also stored them in the worst possible condition: loosely on a desk, getting scratched up with the data side upward and exposed to light. As a bonus, most of these were probably rewritten numerous times to update backups. I don’t have any pre-2014 disks, and I do recall destroying most of my out-of-date backups, but at least some of the 2014 ones were probably initially much older and were re-written numerous times. At least the three 2017 ones were unscratched and likely new when I burnt them, then were immediately stashed into a jewel case under my bed at my parent’s house. Outside of the three 2017 ones, however, 2014 was probably my last round of backups on optical media until recently.

  • 2014 Clone of an Office 2012 installer - Unbranded DVD-RW - Heavily scratched but fine
  • 2014 Ubuntu 14.04 LTS - Memorex DVD-RW - Heavily scratched but still bootable
  • 2014 Minetest 0.4.9 - Memorex CD-R - Heavily scratched but fine
  • 2014 Photo Backups Pt. 1 - Memorex DVD-RW - Heavily scratched and unreadable: looked like the marker ate through the top of the DVD and was just as visible on both the top and bottom. I never ran this through recovery since it was getting late - although with my large sloppy writing I’d bet a large swath of data would have been unrecoverable.
  • 2014 Photo Backups Pt. 2 - Memorex DVD-RW - Slightly scratched but fine
  • 2014 Clone of my Thumb Drive - Memorex DVD-RW - Slightly scratched but fine
  • 2016 Ubuntu 16.04.01 LTS Live Image - Unbranded DVD-RW - slightly scratched but bootable
  • 2017 Chromium Browser + Source Code - CD-R - Unscratched and fine
  • 2017 Photo Backups Pt. 1 - Memorex DVD-RW - Unscratched and fine
  • 2017 Photo Backups Pt. 2 - Memorex DVD-RW - Unscratched and fine
  • 2017 Video Backups - Memorex DVD-RW - Unscratched but corrupted. 99.99 percent of data was recoverable with GNU Disc Recovery.
  • 2017 Clone of my thumb drive - Office Max DVD-RW - Unscratched and corrupted: was able to recover 100% of data with GNU Disc Recovery

Archival Projects:

Beyond regular backups, I’ve also taken to making a few archives of some other stuff. In all reality this is a lot less important, some of it was just some stuff to work on when I had a whole bunch of very long work shifts in a row during quiet hours. Still, it usually doesn’t hurt to have a copy of data stashed away, and it might be worth the 25 cents of a DVD+R disk to hold onto. I don’t have quite the same level of redundancy as my important backups here, just one copy of these on some Verbatim DVD+Rs instead of using the premium archival grade stuff (although I may switch to using Verbatim Archival BRs since they’re cheaper for large clumps of data).

However, despite the saying, the internet isn’t actually forever; so it might not hurt to hoard a copy of anything you’d miss if it went offline. Websites, accounts, and services disappear all the time in normal times, and having a stashed copy can come in handy, and has for me in the past. Nowadays, however, there’s even more of a risk of things going missing. Web3 is really starting to kick off; and large platforms are having potential financial trouble thanks to economic issues, realizations that infinite growth is impossible, and realizations that you do actually need to have a monetization plan once you do hit market saturation. Throw in a bonus that Archive.org decided a pandemic invalidated copyright laws, and they might not be around for long either. Lose the mainstay services and archives of them and there’s a chance that the 2000/2010’s internet could be lost over the coming decade or so.


Dropdown: Web3

Most of the blogs I follow, and presumably most of the people who follow me, are on the fediverse. Great people and all, but I can imagine that when I said ‘Web3’ a million iPhone microphones picked up something along the lines of this as is customary if the word ‘Web3’ is uttered on the fediverse. The long in the short of it is that Web3 is referring to a more decentralization of the internet. ‘Web1’ was the early internet, ‘Web2’ was when big platforms came in and consolidated most of everything, and ‘Web3’ refers to the idea that people are starting to push away from the more centralized platforms into a more decentralized ecosystem. That includes this blog, Activity Pub & other decentralized social media protocols, most any decentralized tool that uses cryptographic identifiers (e.g. IPFS), blockchains (a.k.a distributed databases not controlled by a single authority), and a whole lot more.

If you’re thinking NFTs when you hear ‘Web3’, that’s probably because NFTs’ only marketing gimmick was they were a certificate to an image sold using ‘Web3’ & adjacent technologies (a cryptographic hash stored in a blockchain). A lot of people probably first heard the term ‘Web3’ when learning about NFTs or something. A lot of the comparatively older blogger types use the term ‘Indiweb’ or ‘Smolweb’ which is more or less synonymous albeit more focused on personal websites.

Games

I purchase most of my games on Steam, I know, shame on me when I could get them DRM free on GOG. Still, there’s actually a decent amount of games on Steam that are DRM free - especially a lot of older games and indie games. As somebody who always seems to be going back to older games, a lot of my favorites can be backed up without any special process. In those cases, install the game and go to the game’s files by right clicking, then zip the folder and place the zipped folder wherever you want to store your backup. You can then run your backup by opening the executable, or if it’s a Windows game on 🐧 open it with Wine or import with Lutris.

It also never hurts to throw in bonus stuff as well. In the case of Morrowind, in addition to burning the zipped folder of the installation, I also included the Windows and Linux versions of Open-Morrowind. Or maybe you also want to throw in a backup of your save files on a game, mods you like, or other related things. And of course there’s more than just Steam: GOG installers would be easy to backup, I can’t imagine a single open source game that comes with any form of DRM, and I’m betting something like Prism for Minecraft would allow you to backup an appimage alongside a signed into an account with game files downloaded to run offline.

Am I going to lose access to my Steam library or Minecraft anytime soon? I doubt it. The probability isn’t zero, however, especially when considering my projected lifespan as a timeframe. Still, it’s mostly just the feelgoods of actually “owning” a game. That, and maybe some random day my internet will be down and I’ll want to play some Far Cry 2 and install it with my archive.

Software

Software is another thing that might be handy to keep around, and something I also stashed a bit away on optical media. Obviously plenty of software like web browsers are useless once they are out of date, and libre software especially is pretty easy to get ahold of if you need to re-download it. Depending on the type of software, however, it might not hurt to have a local copy. I’m betting there were a lot of people disappointed when they went to download Yuzu or Citra and found it gone. And sorry Nintendo, but sometimes I like to open up Dolphin and play some games I played when I was younger, so I might as well have a copy of Dolphin and some ROMs stashed away.

My Ventoy thumb drive takes the place of having the random recovery or g-parted discs to boot into, but installation media is another form of software that might be handy to stash away. It never hurts to have some older stuff, like a Windows XP installation CD for nostalgia or software compatibility in virtual machines. I’m also thinking about setting up some future proofed installation media with modern Windows/Linux ISOs and basic software (e.g. java runtime environments) for a potential future when modern stuff becomes like XP/7 to only be run in a virtual machine for nostalgia or compatibility with current software.

Even entire LLMs can fit on a 25GB Blu Ray alongside an appimage of LM Studio, and some can even fit on a DVD.

DRM Free Online Videos (e.g. YouTube)

Back in 2009 Machinima released a video titled “SM64: the secret of L is real,” and while it’s not exactly a cinematic masterpiece me and my brother found it so funny at the time that we still occasionally reference it to this day. Machinima has since privated their videos after being bought and sold once or twice. Furthermore, if I ever have grandkids doing a report on what happened with Gamestop I’ll absolutely have to show them a copy of Big Boss’s video “The Absolute Chaos of r/Wallstreetbets Part 2 | GME”. Anything like Netflix is going to require WideVine DRM, but most social media style video and audio platforms (From YouTube to Facebook to Funkwhale) don’t have such restrictions.

For YouTube specifically as well as hundreds of other audio/video platforms YT-DLP is your friend. You can point it at a video, audio stream, playlist, or even an entire account and set it loose to grab whatever you want. For YouTube specifically I’ll place a dropdown below with two slightly more complicated command templates and an explanation.


Dropdown: YT-DLP Templates

Quick note: all these commands work for me because I installed YT-DLP with Pipx, if you’ve installed it with a different method you may need to replace “yt-dlp” with whatever you need for the way you installed it.

yt-dlp -f "bv*[height=360][ext=mp4]+ba*[ext=m4a]" --embed-thumbnail --output "%(upload_date)s-%(channel)s-%(title)s.%(ext)s" [url] is my default download command for anything I would want to archive. It grabs the thumbnail and names it with a nice name for archiving. Change the height from “360” to whatever resolution you’re actually looking for.

yt-dlp -f "bestaudio[ext=m4a]" --embed-thumbnail --sponsorblock-remove all --output "%(artist)s - %(title)s.%(ext)s" [url is my go to for archiving songs and audio content on YouTube

The simplest way to download with YT-DLP is to just go yt-dlp [link] --list-formats, then choose the format you like and run yt-dlp [link] -f [format] which is handy for a one off video or if you’re on a random non-YouTube website trying to grab something you don’t know the format of. However, you can add many additional arguments to the command like a custom name to save the file with or a set of rules to automatically convert the file after download.

--sponsorblock-remove [what-to-remove] is a great way to cut out sponsorblock marked segments, from non-music portions of music videos to your generic sponsor segments if you don’t want to include those as well. --embed-thumbnail includes the thumbnail in the file, which is a great little addition to an archived piece of media. --extract-audio --audio-format m4a isn’t usually needed on YouTube, but it’s great to pull a podcast or some other audio based content from other platforms that don’t have audio only formats to download. Generally all these little pieces of commands can be mixed and matched, and don’t need to be placed in the command in any particular order.

Finally, if you’re looking to avoid the command line, there are a handful of pieces of software that can let you download from within a GUI. Seal is a graphical tool for Android based on YT-DLP, and third-party apps like NewPipe/LibreTube/GrayJay (Android) and FreeTube (Win/Linux) can also download individual videos or audio files from within the platform(s) they support. Third party front ends like Invidious and Piped can also download videos if enabled by the person who’s hosting the instance, and if all else fails you can always resort to one of the YT-DL/YT-DLP based websites.


Websites

I used to use read it later apps, but I’ve since switched to just grabbing a copy of a page instead and it’s worked great for both archival and general use. I saved pages as PDFs for the longest time, but fairly recently switched to using the extension “SingleFile” on Mobile/Desktop to save web pages to read later. I can toss it in a Syncthing folder and read it on any device offline, or use something like Librera to have it read out to me like a robot voice podcast on mobile. While it’s mostly for reading things later, something like that is a great way to keep a permanent copy of a page I want to keep, and already helped me once or twice as a blogger I followed paywalled all her content.

Again, SingleFile (especially with Adblock) gives a perfect snapshot of a page for later viewing. Print to PDF, especially in LibreWolf/Brave’s reader mode, also does a great job at grabbing all the text into an easily consumable document. For the scrapers out there you can use wget (on any Desktop OS or Android via Termux) and run wget -mpEk "[url]" to grab an offline copy of an entire domain.

Misc Archives (e.g. Kiwix)

The last thing still sitting on the old hard drive I was using for archival data that I might switch to optical media is various archives like the ones you can find in Kiwix. Kiwix, if you’re unaware, is a tool that lets you download and view libraries of specific information for offline consumption - from Wikipedia to medical texts, Stack Exchange Answers, and Project Gutenberg’s catalog. I keep a couple smaller archives on my phone, but I’ve hoarded a couple hundred Gigs of Kiwix archives on that old hard drive I mentioned and might end up transfering the ones that fit on optical media to a Blu Ray or two.

Like the other stuff, a lot of it is more of adding to a small collection of data I want to have than having a utilitarian tool. Still, I have used my Kiwix archives before (mostly on my phone), and when the internet is down and I’m bored that’s the most likely time I’m looking to browse the Gutenberg Project’s catalog for something interesting. Besides, if the zombie apocalypse ever happens I got lots of digital goodies and reference materials.