WHATEVER YOU DO, DON'T REBOOT I experienced my first hard-drive failure last week. A four month old WD SN770 BLACK NVMe SSD in my Dell laptop failed without any discernibly notice or reason. Everything about my computer was fine about 7 hours earlier when I hibernated to disk for the workday. When I lifted the lid upon my return and proceeded to edit some Emacs buffers, I found that I was unable to save my work! Emacs complained about the files being read only. I investigated this behavior a little but not quite enough. Instead of looking more deeply into the current state of my computer and figuring out why it might have become read only I decided to do a power cycle[1]. I thought a quick reboot would fix everything... Big mistake. The computer no longer recognized the drive. BIOS wouldn't get past a warning: the PCIe slot was "empty". As soon as I realized what had happened my heart sank and my pulse quickened. I had just experienced my first hard-drive failure[2]. Of course I had a backup. Yes. Luckily, I'd gotten into the practice of using `rsync' to copy my home folder to an external drive (which is then backed up to a second external drive). Unfortunately, since installing Void over the summer I'd gotten lazy about backups. I hadn't set them to occur on a schedule, so my most recent backup was from 5 days earlier, and the one before that went a month back. I'm not a prolific coder, so 5 days lost was not a terrible amount of work gone. In fact, I'd been pretty sedentary on my computer those last few days. This is all that I'd lost: - Work on a Gopher front-end for libgen (~8 hours) - Various notes in my personal knowledge repository (unsure of scale) - Org markup for my previous phlog post Once I accepted what I'd never get back I set my mind to getting my computer operational again. The experience turned from a shock into a thrill. If you can believe it, I was actually excited that my hard-drive failed. I'd been given a reason to battle test my ability to restore from a backup. Again: this was my first hard-drive failure. Up until this point the purpose of backups had been purely theoretical. So I went out and bought a new SSD. I chose the cheapest compatible option I could find at the computer store, a 1TB Kingston SKC3000[3]. I installed the SSD, flashed the Void installer onto a drive, and setup my machine by following the Void full-disk encryption guide. I did all the usual system configuration for users, groups, packages, and services. I used `rsync' to bring the home folder over (using all the switches that seemed relevant, `pEogtUr'). I also ran `stow', which has been a newcomer program in my computer life for which I am very grateful, to setup links in `/etc' and my home folder. After three and one half hours I was back up and coding! Honestly, it was like I hadn't missed a beat. Most surprising about this experience is how I reacted. I thought I'd be angry, frustrated, or under the covers in tears. In actuality I was excited (finally my hard-drive fails! A right of passage! A new experience!) I guess it shows that I've matured and gathered some perspective in my life. Things could have been worse; they weren't. Things could have been better; but not by much. I'm now happily back in my computer sanctuary publishing new phlog posts, chatting with people on IRC, and learning more about how these incredible machines can work to make my life better. Reflections ---------------------------------------------------------------------- I've already made some changes to my situation to better protect against hard-drive failure. I've setup a backup schedule that runs every night. I've written some small scripts to make the system configuration of users, groups, packages, and services automatic. And I've figured out what I need to do next if I really want to take safe data storage seriously: - Find an off-site storage solution. - Use Blu-Ray disks as more reliable, long-term backup media. - Replace `rsync' with a program that can recover deleted files. It's possible to weather unplanned events, whatever they may be. But resilience in the face of these calamities takes careful planning and consistent procedures. In other words: predictability. Footnotes ---------------------------------------------------------------------- Footnotes _________ [1] I would later learn that NVMe drives go into read only mode when they fail. This "feature" is to give time so the data can be moved to another drive. If I had known this, I would have attempted to mount my external drive at `/tmp', which would have still been writable since it's on RAM. Then I would have proceeded with my regular backup routine and tried my luck at a `dd' of the full drive. [2] I'd always heard that hard-drives fail. I must have been lucky most of my life. This experience has given me a new outlook: don't trust a drive to last more than the current day. [3] Ironically, the three clerks at the computer store I sought for help didn't know much if anything about SSDs. I had to give myself a quick refresher on the technical jargon and what it all meant. I learned that: NVMe is the protocol, M.2 specifies the connector, 2280 specifies the dimensions, PCIe 4.0 is the interface and its backwards compatible with 3.0. |