Manually preventing data corruption in relatively static files

By Brian Tomasik

First published: 2019 Feb 27. Last nontrivial update: 2020 Feb 24.

Summary

This article is for people who want to prevent silent data corruption and data loss for relatively static files on their hard drives using a beginner-friendly and transparent procedure. If you have a Mac or many versions of Linux, this procedure requires no new software—just the shasum tool that comes preinstalled on your computer. The process involves storing two copies of each file that you don't want to get corrupted, creating a list of SHA-512 hashes for the files, checking the hashes on a regular basis, and correcting any instances of file corruption that appear over time. Alternatively, you can do this process without creating two copies of each file on the same hard drive as long as you have old enough external backups to restore from if need be.

A good explanation of file fixity for absolute beginners is Rudersdorf (2017, "Active Management").

Note: I'm a novice on this topic, so what I write here may be naive. This article represents my attempt to create a very simple anti-bit-rot system for people like me who don't yet know how to use fancier tools.

The procedures discussed in this piece require some manual steps. For a somewhat more automated approach to fixity checking, see the "Fixity checking" section of Tomasik ("How I organize ...").

Summary
Background
Restricting the scope of our task
Setup for our example
Step 1: Duplicate the data on the main hard drive
Step 2: Create a checksums file
Alternative: Just do Step 2
Keeping your checksums file updated
A review of the procedure
Marking files for inclusion or exclusion
Checking a subset of files
Testing if the system works
Noticing accidental deletions
- Using shasum to merely check file attendance
Using checksums when moving files to a new computer
Checksumming all files to verify attendance
Using Bash functions

Background

Digital data can be corrupted in all sorts of ways (Wikipedia "Data corruption"). Sometimes the human user doesn't notice the data corruption, making it "silent". Another name for silent data corruption is "bit rot", and I use those terms interchangeably in this piece. "File fixity" is the opposite of data corruption—it means a file remains bit-for-bit the same over time. One way to monitor fixity is with file checksums. In this piece I use "checksum" interchangeably with "hash". Finally, I use the words "folder" and "directory" interchangeably. (There's no particular reason for using both words. I was just too lazy to edit the article to consistently use just one word.)

For dramatic forms of data loss, such as failure of an entire hard drive, the solution is backups: regularly create additional copies of that data to restore from in case the master copy fails. However, what about smaller-scale data loss? For example, suppose a portion of a single file gets messed up, preventing the file from being able to be opened. Unless it's a file you use on a regular basis, you might not notice the problem for many years. Therefore, unless you have intact backup copies of the data from many years ago, by the time you notice that this file is corrupted, you may not have any uncorrupted versions of it to restore from. This is what makes silent data corruption in some ways more insidious than widescale data loss, since widespread data loss is easier to notice and fix right away.

Barthe (2014) describes his personal experience with bit rot on the Mac file system HFS+:

I have a large collection of photos, which starts around 2006. Most of these files have been kept on HFS+ volumes since their existence.[...]

The photos were taken between 2006 and 2011, most of them after 2008. There are 15264 files, which represent a total of 105 GiB. 70% of these photos are CR2 raw files from my old EOS 350D camera. The other photos are regular JPEGs which come from the cameras of friends and relatives.

HFS+ lost a total of 28 files over the course of 6 years.

Most of the corrupted files are completely unreadable. The JPEGs typically decode partially, up to the point of failure. So if you’re lucky, you may get most of the image except the bottom part. The raw .CR2 files usually turn out to be totally unreadable: either completely black or having a large color overlay on significant portions of the photo. Most of these shots are not so important, but a handful of them are. One of the CR2 files in particular, is a very good picture of my son when he was a baby. I printed and framed that photo, so I am glad that I did not lose the original.

This not an issue specific to HFS+. Most filesystems do not include checksums either. Sadly…

In a discussion about Barthe's article, user "timothy" says: "I wish I'd lost only 28 files over the years" (Slashdot "One Developer's ..."). Indeed, 28/15264 is only ~0.2%, or 0.03% per year over the 6 years. One could imagine much worse rates of bit rot, perhaps due to particularly faulty hardware.

There are many technical approaches to combating data corruption, such as the ZFS file system and Parchive software. As of early 2019, I haven't yet had a chance to explore these tools. In the meantime, I was thinking about what a poor man's version of these tools could be, for people who are newbies to this field and have relatively simple data-preservation needs.

Another option is to use a cloud-storage service that guarantees data integrity. Cloud storage comes with its own set of pros and cons compared with local storage. In this piece I'll assume you're storing files on your local computer. If you do use cloud backup, I would think it's safer to upload the files as a one-time operation rather than keeping them synced with your local computer, because if you accidentally delete a local file, the cloud copy disappears too. Question: would silent bit rot within a file on your local machine also sync to the cloud? Or not because the file system doesn't know the file has changed? Of course, if a cloud-synced file rots locally and then you edit it locally, at that point, the hidden rot would go up to the cloud and clobber the unrotted version. Therefore, if you rely on the cloud for protection against bit rot, I imagine that if you want to edit a file, it would be best to download it, edit it, and then reupload it?

Restricting the scope of our task

One of the reasons it's challenging to combat data corruption is that if you plan to edit the files regularly, you need a way to distinguish legitimate edits from illegitimate corruption. The anti-bit-rot tool Langa et al. (2013-2018) does this by storing in a SQLite database the modification time of files alongside SHA-1 file hashes. As best I understand it, the procedure is as follows. The next time you run the program to check file fixity, if a file has a different (presumably more recent) modification time than the one stored in the database, then apparently the user legitimately edited the file. A hash mismatch would of course happen, so no error is reported, and the database is updated. However, if the modification time of the file is the same as what's in the database while the hash still doesn't match, then that looks like data corruption.

This is probably the "correct" way to do things, but what if you're a bit lazy? Maybe you're not extremely worried about bit rot for most of your data. For example, if you have MP3 songs from iTunes purchased for less than $1 each, or if you have a library of scholarly PDF articles, then it's not disastrous if a few of these files rot over the years, since either they can be replaced or they're not that important to begin with. Data corruption mainly matters for files that you strongly don't want to lose.

In my case, the files where bit rot seems most worrisome are large videos (the originals of the videos that I upload to my YouTube channel) and zipped archives (such as zip files of website backups). Because of their size, the probability of corruption happening somewhere or other in these files over time seems higher than for small files. Perhaps some bit flips in video files would be undetectable, but it would be bad if the entire video became unable to open for some reason. Fortunately, these large archival files are also not edited very often, meaning that if you store hashes of them, those hashes won't need to be updated very often, if ever.

If you only have a few tens or hundreds of files that you really want to secure against corruption, you could do the manual file-fixity procedure that I'll describe in the rest of this piece. I haven't actually tried it out for very long yet as of early 2019, so I can't say for sure how well it works.

Setup for our example

In the rest of this article, imagine that we have the following file and folder structure:

-- my_static_files/
    | -- my_videos/
    |     | -- video.mp4
    |     | -- readme.txt
    | -- my_audio/
    |     | -- lecture.wav
    | -- my_websites/
          | -- a_website.zip
          | -- notes.txt

Assume we really want to protect the .mp4 , .wav , and .zip files but don't care as much about the .txt files because they contain relatively little content.

Step 1: Duplicate the data on the main hard drive

Create a copy of each important file and store it on the same hard drive as the original file. One approach is to create a little subfolder inside each existing bottom-level folder to store the copies. Let's call these little folders abrc , which stands for "anti-bit-rot copy". I prefer to add a date to the beginning of the file name of the copy to clarify when it was copied. After this step, our files look like this:

-- my_static_files/
    | -- my_videos/
    |     | -- video.mp4
    |     | -- readme.txt
    |     | -- abrc/
    |           | -- 2017-04-18__video.mp4
    | -- my_audio/
    |     | -- lecture.wav
    |     | -- abrc/
    |           | -- 2019-08-04__lecture.wav
    | -- my_websites/
          | -- a_website.zip
          | -- notes.txt
          | -- abrc/
                | -- 2020-02-22__a_website.zip

If the abrc/ copies are date-stamped, you can add multiple versions of the file over time there if you want.

This duplication step alone provides almost all the bit-rot protection, and you could potentially skip the second step if you want to be even lazier. It's fairly unlikely that any single file will get corrupted, so the probability that two will both get corrupted is even lower. (That said, if both files are stored in a similar place on the hard drive, maybe it's not too unlikely that corruption in one will imply corruption in the second one...?)

Often, when people do backups, they blindly copy the latest version of each original file to the backup medium. However, doing this means that bit rot on the original will propagate to the backup. As user "mercenary_sysadmin" notes in VFON (2015): "you'll most likely overwrite all your backups with the rotted versions of some of your data before you realize you've got a problem." What I've proposed in this section is different, because you have two independent copies of the file on the original hard drive, both of which will get backed up, with neither one blindly overwriting the other. Of course, this doubles the storage requirements for the files that use this precaution. And it also makes editing the files more annoying, because the same change has to be made to both copies of the file, unless you don't mind and won't be confused by having outdated copies lying around; therefore, this solution is most suitable for very static files.

Step 2: Create a checksums file

If you're only concerned about data corruption so severe that it prevents the file from opening at all, then after you've done Step 1, you could theoretically check for bit rot on your files by opening them periodically. This would tell you whether a file had rotted and if so, whether the rot occurred on the original version or the abrc/ version. However, just being able to open the file doesn't show there wasn't any bit rot. Perhaps some bits got changed without rendering the file too corrupt to open. So what if you want to ensure that files don't have any bit flips at all?

One initial idea might be to diff the two files and ensure they match. However, if they don't match, you can't necessarily tell which is the good one and which one is corrupted, especially if they're binary files, which are harder to compare than text files. Theoretically you could overcome this by storing at least three copies of the file and then choosing whichever two files out of the three are the same as each other. This is the method of Sesame Street's "One of these things is not like the others" song.

However, the more standard method is to compute a checksum on each file and make sure the checksum doesn't change. NDSA (2014), p. 3: "If you have multiple copies of digital objects and you have stored fixity information to refer to you can refer to that information to verify which copy is correct and then replace the corrupted digital files or objects."

You could name the checksum file as something like sha512.txt and put it in the my_static_files/ folder. (In the "BagIt" specification (Wikipedia "BagIt"), this file would be called manifest-sha512.txt , but I don't want to call it a "manifest" because it doesn't necessarily point to all the files in the subdirectories. In our example, I'm not storing duplicate copies or creating hashes for readme.txt or notes.txt .)

To compute the hashes, you can use shasum , which is preinstalled on Mac and some Linux distros as of the late 2010s. If you want to add SHA-512 hashes for video.mp4 and abrc/2017-04-18__video.mp4 to the checksums file, you can do that by typing the following into the terminal after moving into the my_static_files/ directory:

shasum -a 512 my_videos/video.mp4 >> sha512.txt
shasum -a 512 my_videos/abrc/2017-04-18__video.mp4 >> sha512.txt

The >> operator appends output to the given file. Make sure not to type just > , since that will overwrite the entire existing file, as user "Cyclonecode" notes in Dchris and Community (2013).

After running those commands, you'll see this in the sha512.txt file:

d78abb0542736865f94704521609c230dac03a2f369d043ac212d6933b91410e06399e37f9c5cc88436a31737330c1c8eccb2c2f9f374d62f716432a32d50fac  my_videos/video.mp4
d78abb0542736865f94704521609c230dac03a2f369d043ac212d6933b91410e06399e37f9c5cc88436a31737330c1c8eccb2c2f9f374d62f716432a32d50fac  my_videos/abrc/2017-04-18__video.mp4

At any later point, you can check that the current (i.e., recomputed) hash of each file matches the older stored hash by running

shasum -c sha512.txt

You could filter that output to show only hash mismatches as follows:

shasum -c sha512.txt | grep -v ": OK"

You could do the simpler grep -v OK , although this would also filter out lines with files that contain "OK", such as OK_State_University.pdf .

You can add as many checksums as you want to sha512.txt and check them as often as you want. Digital Preservation Coalition (n.d.): "Files should be checked against their checksums on a regular basis. [...] As a general guideline, [...] checking hard drive based systems might be done every six months. More frequent checks allow problems to be detected and fixed sooner, but at the expense of more load on the storage system and more processing resources." A question I have: Is verifying checksums too frequently a bad idea because it wears down the relevant parts of the hard drive faster? Or maybe if you're merely reading data, the amount of wear and tear caused is negligible?

If only one of video.mp4 or abrc/2017-04-18__video.mp4 is corrupted while the other matches the checksum, get rid of the corrupted version and copy over the uncorrupted version again. If both files are corrupted, you could look in your backup files to see if any of those versions are still uncorrupted.

Ideally, every time you're about to edit an important file, you'd want to verify its checksum first, since if the file already has bit rot, that bit rot would hide amongst the legitimate file changes that you're making. However, since this check takes time, it seems like overkill to do it before you edit anything but your most precious files.

Adding lots of checksums at once

If you want to add a whole directory of files at once to your checksums file, you can use the following command, which I revised a bit from the command used by Barthe (2014):

find . ! -name ".DS_Store" ! -name "sha512.txt" -type f -exec shasum -a 512 {} \; >> sha512.txt

This says to find everything recursively in the current directory that doesn't have the name .DS_Store , that also doesn't have the name sha512.txt , that is a file (rather than a directory, symbolic link, etc), and then execute the specified command on each one. The .DS_Store omission is only relevant for Mac users. And omitting sha512.txt is a hack I made up because if you don't do it, the sha512.txt file that you're writing to gets included in the list of files that have their hashes computed, even if the sha512.txt file didn't exist before running this command.

If you only want to match certain types of files, you can modify this command. For example, if you wanted to match only .mp4 or .wav or .zip files (case-insensitive), but not .txt or other files, you could run

find . ! -name ".DS_Store" ! -name "sha512.txt" \( -iname *.mp4 -o -iname *.wav -o -iname *.zip \) -type f -exec shasum -a 512 {} \; >> sha512.txt

Watching progress as `shasum -c` runs

If you have a lot of big files to check, running shasum -c sha512.txt may take a while, perhaps several hours. You may want to watch its progress, but if you filter out the OK lines by running shasum -c sha512.txt | grep -v ": OK" , then if all lines are OK , you won't see any output and won't have any indication of progress until the command finishes. One way around this problem can be to instead run this command:

shasum -c sha512.txt | tee temp.txt

This will print progress to the screen and save it to the file temp.txt . Then, when that command finishes, run

cat temp.txt | grep -v ": OK"

to see if there were any non-OK lines. Once you're done, you can delete temp.txt .

Choice of hashing algorithm

You could use other hash functions besides SHA-512, but since computational speed isn't very relevant for a small number of files, I don't see much reason to use anything less powerful. For example, if you used the weaker MD5 algorithm, there would be an (extremely tiny) chance that a malicious actor could replace your legitimate file with a bad version whose hash matches (see the comment by Alexander Duryee in Owens (2014)). Of course, even a cryptographically safe hash like SHA-512 is useless if the malicious actor could just edit the list of checksums to match the newly added bad file. So in the unlikely event you have reason to worry about this risk, you'd presumably want to store the checksums file separate from the data files, in a place where malware or rogue employees couldn't reach it.

In 2019, I ran on my computer a simple comparison of MD5, SHA-1, SHA-256, and SHA-512 when computing the hash for a 6.5 GB file. I ran the computations a few times, flipping around the order on different runs. I measured the literal amount of wall-clock time it took the operations to complete, though this of course includes time spent on other tasks that my computer was running. The results varied a bit from one run to the next, but a reasonable average might be that MD5 and SHA-1 each took ~2.5 minutes, SHA-512 took ~3 minutes, and SHA-256 was somewhere in between. So differences in computational cost seem pretty negligible. In fact, it's possible that any observed differences here were just noise. Langa et al. (2013-2018) claims that "bandwidth for checksum calculations is greater than your drive's data transfer rate", which I interpret to mean that hard-drive speed is the bottleneck? Ma (2015) agrees: "almost all time are on reading the file content. The algorithms and the tools themselves are not yet the limitation. The disk I/O speed is."

Alternative: Just do Step 2

Just as you could do Step 1 without Step 2, you could also do Step 2 without Step 1. Rather than duplicating files on the same hard drive, you could replace files on your main hard drive that have become corrupted using your backups on other storage media—as long as you still have old enough backups to perform such a restoration. For example, if you verify checksums once per year, then you should remember to always keep old backups for at least a year, ensuring that you have a copy from before the bit rot was introduced. Tunstall (2016) makes the same point: "You should probably scrub your data at least as often as your oldest tapes [i.e., backup copies] expire so that you have a known-good restore point."

As long as you have the discipline to perform checksum verifications regularly, then doing just Step 2 without Step 1 is arguably better than doing both Step 1 and Step 2, because doing Step 2 without Step 1 saves on storage space and is easier (since you don't have to create abrc/ folders). Of course, doing both Step 1 and Step 2 seems slightly less likely to lose data, just because you have more total copies of the data lying around, both on your main hard drive and in the backups.

Keeping your checksums file updated

Maybe the most important thing to remember with the manual file-fixity system outlined in this piece is that you need to keep the sha512.txt checksum file updated with any changes in the files to which it points. If you merely move the location or change the name of a file without altering its contents, you can just edit the sha512.txt file directly to change the file's path. If you change the actual contents of the file, you have to delete the old line from sha512.txt and rerun the shasum command to add the new hash.

If you forget to update sha512.txt after editing a file, when you later run shasum -c sha512.txt , you might mistakenly think data corruption has occurred when it hasn't. One clue that this might have happened would be if the "Last Modified" date of the file according to your file system is different from the date stamp in the file name of the abrc/ version of the file. For example, suppose you create video.mp4 and abrc/2017-04-18__video.mp4 on 2017 Apr 18. Then you mindlessly update video.mp4 on 2017 Jun 05. When you see a hash mismatch for video.mp4 , you can see that the "Last modified" date is 2017-06-05, which differs from 2017-04-18, indicating that video.mp4 was legitimately modified.

What if you're skipping Step 1 and don't have files in abrc/ folders? One solution could be to keep a record of the date every time you run shasum -c sha512.txt . Each time you run it, you should fix any issues until all the hashes match. Then, suppose that the next time you run shasum -c sha512.txt a few months later, you get a hash mismatch. Is it bit rot, or did you just edit the mismatching file and forget to update its hash? To help figure this out, you can look at the "Last Modified" date of the file and see if it's more recent than the date when you last ran shasum -c sha512.txt . If so, then you probably just edited the file.

The annoyance of keeping the checksum file updated when you edit files to which it points is a main reason you may not want to add checksums for simple readme.txt kinds of files. Not only are they generally less important than the main data they describe (meaning that the benefits of fixity checking are lower), but they're also more likely to be edited over time than are large binary files (meaning that the costs of fixity checking are higher).

Duryee (2017) confirms (p. 1) the idea that renaming file or folder names can mess up a fixity-checking system based on a file's path:

Consider the following common scenario: an archive decides to implement a new directory standard for its digital repository, and changes the locations of its digital assets accordingly. Most checksum tools, which use filepaths as the unique identifier of a file, will cease to recognize the digital objects post-move - they will report that one object was deleted and one object was created.

Duryee (2017) goes on to describe how the "Fixity" tool created by the firm AVPreserve avoids this problem.

If you want to move or rename a lot of files at once, then instead of updating the path of each file in sha512.txt , you could do the following. Run shasum -c sha512.txt to make sure all the files are currently ok (or, if not, fix them). Then move or rename the files, delete the old sha512.txt , and regenerate a new sha512.txt , since hopefully merely moving or renaming the files shouldn't have corrupted them. (When you "move" a file on the same hard drive, the file's contents don't actually move on the physical disk, so I assume the moving process shouldn't introduce data corruption.) If you want to be extra sure that nothing went wrong, you could take the list of hashes (without file paths) from the old sha512.txt file, sort them, take the list of hashes from the new sha512.txt file, sort them, and diff the lists to ensure they're the same, thereby showing that the files all remained the same in their contents. Extracting just the hashes and sorting them can be done in a program like Excel, or on the command line like this:

cut -d ' ' -f 1 sha512.txt | sort > sorted-hashes.txt

A similar idea could apply if you want to edit a bunch of files at once. First check that all the files are ok, then make the edits, and then regenerate the checksum file immediately afterward. In this case, the list of checksums won't be the same before vs. afterward.

A review of the procedure

Here's a summary of how to implement this system. I'm writing it from the perspective of only doing Step 2, not Step 1, since I like that method the best.

If you have a new file to preserve against corruption, add its hash to sha512.txt .
If you move or rename a file, update its path in sha512.txt .
If you edit a file:
- If it's a really important file, first check it against its checksum to make sure it doesn't already have bit rot before being edited.
- After the file is edited, remove its old checksum from sha512.txt and add its new checksum instead.
If you want to delete a file because you don't need it anymore, also delete its checksum line from sha512.txt .
Every ~6 months, run shasum -c sha512.txt to verify fixity.
- If any files aren't found, see whether you moved them and update their locations in sha512.txt . If you can't find the files on your main computer, did you accidentally delete them? If so, see if you have them in your backups and restore them. If you intentionally deleted them, remove their lines from sha512.txt . After all this, rerun shasum -c sha512.txt .
- If any checksums don't match, is it bit rot or an edit you made? Check the file's "Last Modified" date against the date when you previously successfully ran shasum -c sha512.txt . If "Last Modified" is more recent, you probably edited the file, and you should update the file's hash in sha512.txt . If not, it may be bit rot.
  - If it might be bit rot, pull some previous versions of the file off of your backups. First, look by eye or using diff to see how the backups differ from the non-backup version. If the non-backup version has edits that you recognize, then you probably legitimately modified it and forgot to update its hash.
  - If it really does look like bit rot, then compute checksums on the backup versions and see if any of them match the checksum recorded in sha512.txt . If so, copy that version of the file to your computer, replacing the corrupted one.
  - If you can't find any uncorrupted version of the file, cut your losses and pick one of the corrupted versions to be the new baseline. Replace the original hash with the hash of that new baseline. At least you can monitor it to prevent further corruption.
- Once all the checksums match their respective files (i.e., shasum -c sha512.txt returns all OK values), record today's date as the new latest date when you successfully ran the check, for your reference next time.
Do regular backups of your files (including the sha512.txt file itself), and keep old backup snapshots for at least 6 months (maybe even a year or more) in case you need to restore files from them.

Marking files for inclusion or exclusion

In the earlier section "Keeping your checksums file updated", I explained that if you want to move or rename a lot of files at once, rather than editing the sha512.txt file by hand, you can run shasum -c sha512.txt to verify that all files are uncorrupted, then move the files around, delete the old sha512.txt , and regenerate a new sha512.txt using a command to create the whole sha512.txt file at once, as I explained in the earlier section "Adding lots of checksums at once".

This same approach is also useful if you want to add a bunch of new files to sha512.txt without having to append them one by one. As you accumulate new files, you can put them in your folder system without bothering to append them to sha512.txt individually. Instead, once every few months or whatever, you can add the whole bunch of new files at once, by

running shasum -c sha512.txt to verify that all the files you're already monitoring are uncorrupted
deleting the old sha512.txt
regenerating a new sha512.txt that includes all the files within your folder system, both the old files and the new ones
rerunning shasum -c sha512.txt to make sure the new checksum file is good.

If you want, rather than deleting the old sha512.txt right away, you could rename it, such as to sha512_old.txt , and then diff it against the new sha512.txt .

What if you want to exclude certain files from sha512.txt , perhaps because you edit them regularly? If you're regenerating sha512.txt in bulk to include all subfiles of some folder, you have to manually remove from sha512.txt the files you don't want to monitor every time you regenerate sha512.txt . You could write a simple script with a blacklist to automate this removal. But if you're able to change the names of folders and files, another approach can be the following.

For any file or folder that you don't want to include in sha512.txt , include _dicft in its file or folder name. For example: readme_dicft.txt or temp-backups_dicft/ . This annotation is short for "don't include checksum for this". I intentionally made this initialism strange and unpronounceable so that it's less likely to already appear by chance in file and folder names. Of course, you could make up any other convention you want.

Then, to exclude these items from sha512.txt , you would add

! -path "*_dicft*"

in the find command that's used to generate sha512.txt . For example:

find . ! -path "*_dicft*" ! -name ".DS_Store" ! -name "sha512.txt" -type f -exec shasum -a 512 {} \; >> sha512.txt

Using -path rather than just -name here ensures that the pattern matches folder as well as file names. Matching folder as well as file names allows you to exclude all the files in a folder just by annotating the folder name, without having to annotate each of the possibly hundreds or thousands of files within the folder.

Excluding selected files and folders is good if you want to include most of your items in sha512.txt , but what if you only want to include a small fraction of items in sha512.txt? Rather than labeling items to exclude, you could label items to include, by annotating them with _yicft , which is short for "yes, include checksum for this". And to your find command you would add

-path "*_yicft*"

If you want, you can use both _dicft and _yicft annotations and add the following to your find command:

-path "*_yicft*" ! -path "*_dicft*"

Different conditions in a find command are implicitly separated by a logical AND operator, so both conditions have to be met for a file to be included. Therefore, including _dicft anywhere in the path acts as a "veto" against inclusion, regardless of whether the path also contains _yicft . This means you could mark some folder for inclusion with _yicft while still excluding some subfiles and subfolders within that folder. For those who want to use this method, I'll write out the full command so that you can just copy and paste it from here:

find . -path "*_yicft*" ! -path "*_dicft*" ! -name ".DS_Store" ! -name "sha512.txt" -type f -exec shasum -a 512 {} \; >> sha512.txt

Checking a subset of files

Suppose you have a single large sha512.txt file, and running shasum -c on the whole thing takes several hours. Maybe you just want to check a small subset of the files quickly. Because sha512.txt files have no header row and consist only of a bunch of independent lines, this is easy to do by selecting specific lines.

As an example, imagine that you have a top-level folder my-documents/ , which contains your sha512.txt file. Within my-documents/ is a folder named multimedia-files_yicft/ , and within that is a folder named good-music/ . Suppose you want to rename good-music/ to be called good-audio/ instead. Before you rename good-music/ , suppose you want to check that all files in it are ok, without having to run shasum -c on the much larger set of files within my-documents/ . To do this, you can do the following from within the my-documents/ directory:

cat sha512.txt | grep good-music/ > temp.txt
shasum -c temp.txt | grep -v ": OK"

Make sure the shasum -c line returns no results, indicating no problems. Then you can delete temp.txt .

Since you're renaming a folder, you also need to modify the main sha512.txt file to update its paths. One way is to do a global find+replace in the file from good-music/ to good-audio/ . Another option is to temporarily remove the good-music/ lines and then add back the checksums after you've renamed the folder:

# Filter out the "good-music" lines from sha512.txt
cat sha512.txt | grep -v good-music/ > temp.txt
rm sha512.txt
mv temp.txt sha512.txt

# Do the folder renaming
mv multimedia-files_yicft/good-music/ multimedia-files_yicft/good-audio/

# Add the "good-audio" files back to sha512.txt
find . -path "*good-audio*" -path "*_yicft*" ! -path "*_dicft*" ! -name ".DS_Store" ! -name "sha512.txt" -type f -exec shasum -a 512 {} \; >> sha512.txt

This approach of removing and then adding back the checksums is unnecessarily complicated for this particular example, since it's easier to just find+replace from good-music/ to good-audio/ in the sha512.txt file. However, removing and then adding back checksums is the easier method in other situations, such as if you wanted to make lots of customized tweaks to individual file names within good-audio/ , without having to manually make all those same changes within the sha512.txt file.

By the way, in the above list of commands, technically I could have omitted the line

rm sha512.txt

because the subsequent mv command should already overwrite the old version of sha512.txt . I just prefer to be more explicit, especially in case, for example, mv is aliased on the current machine to mv -n (meaning "do not overwrite an existing file").

Testing if the system works

lrq3000 (2015-2018) provides a number of useful file-fixity tools. One of them:

filetamper.py is a quickly made file corrupter, it will erase or change characters in the specified file. This is useful for testing your various protecting strategies and file formats (eg: is PAR2 really resilient against corruption? Are zip archives still partially extractable after corruption or are rar archives better? etc.). Do not underestimate the usefulness of this tool, as you should always check the resiliency of your file formats and of your file protection strategies before relying on them.

TODO: I'd like to eventually try this tool to test out the file-fixity approach presented in this piece, though my approach is so simple that I doubt it wouldn't work.

I'm also curious how resilient plain-text files are to bit rot.

Noticing accidental deletions

If a file has moved or been deleted, shasum warns you that "1 listed file could not be read". This is actually a really nice guard against accidental deletion of files—whether by you or by a buggy program or script. One could even argue that this functionality is more useful than the checksums themselves, since accidental file deletion is plausibly at least as severe in terms of the average amount of silent data loss it causes as data corruption is. (Accidental deletions are plausibly less frequent than ordinary bit rot if you're pretty careful when managing your files, but if accidental deletion does happen, you could lose a decent chunk of files all at once, possibly without noticing the problem for a while.) If files have been deleted (rather than just moved or renamed), you can hopefully restore from a backup before they disappear entirely.

This check of whether files have been accidentally deleted is only possible when using a checksum file separate from the files it points to. If the cryptographic hash of file X were stored in file X's own extended attributes, as is done in the approach of Bartlett (2014), then if file X is accidentally deleted, I assume there's no readily available external record to show that it's missing. Also, extended attributes are less portable across file systems than an external .txt file is. (In fairness, Bartlett (2014) does also include an option to export hashes to an external file.)

Using `shasum` to merely check file attendance

"File attendance" is a term that I mostly see used by AVPreserve people with reference to their "Fixity" tool. Rudersdorf (2017, "AVPreserve's ...") at 4m56s: "when tested over time, checksums can also monitor for file attendance. That's to make sure that files have not moved from their intended location. That's what 'file attendance' means. The files are where they're supposed to be." Lyons (2019) explains (p. 62): "Checksums can also be used as an inventory to monitor file attendance or identify if a file is new (the checksum signature has never been produced before), removed (a checksum is missing from a list), or moved (the checksum appears with files in another location)."

I'm not sure exactly how broad the term "file attendance" is, but I'll use the term in its most simple sense, to mean "checking whether a file at a given path exists."

When manipulating files and folders on your computer, there's often a small chance that you'll delete the wrong thing, causing you to lose some files that you intended to keep. If you don't use those files regularly, you might not notice this accidental deletion for a long time. As one guard against this, you could add, say, ~20 randomly selected files scattered throughout your folder system to a checksum file and verify them periodically, just to make sure they exist (proving that you haven't accidentally deleted a folder that contains them). If the files aren't static, then doing this would incur the hassle of keeping the hashes up to date. However, you could just ignore hash mismatches and only use shasum -c to check file attendance for these ~20 random files. In other words, just make sure that when you run shasum -c , there's no message that "WARNING: N listed files could not be read" for some N. If you do this, you might want to use a different checksum file from your fixity-checking checksum file, so that you can distinguish when you care about hash mismatches from when you don't. For example, you could call this separate file sha512-for-attendance-only.txt . When running shasum -c sha512-for-attendance-only.txt , you can ignore hash mismatches. Of course, you would still have to keep the file paths updated in sha512-for-attendance-only.txt if files are renamed or change their location.

Using checksums when moving files to a new computer

Suppose you're moving to a new house. Before you depart your old house, you might make an inventory of the items you have (at least the most important ones). Then after you've reached the new house and begin unpacking, you can check that you didn't lose anything. While it's perhaps overkill in real life, we could imagine that you also take a picture of each item before moving so that you can verify by eye that the item wasn't damaged during the move.

Checksums enable the same sort of verification for transporting collections of digital items, such as when you move your files from an old computer to a new one. You can put all your files in a folder on your old computer (let's call it my-docs/ ), run the command in the "Adding lots of checksums at once" section within my-docs/ to checksum all the files, move my-docs/ to the new computer, and then run shasum -c on the checksums file to make sure that no files got omitted or damaged during the moving process. (This is the basic idea behind the "BagIt" convention.)

I call the checksum file used for moving sha512-of-everything.txt because it contains checksums for all files, as opposed to the regular sha512.txt file that only has checksums for files whose fixity you're tracking permanently. sha512-of-everything.txt can be deleted once you successfully move to your destination.

If you want, you could use the same idea of generating a sha512-of-everything.txt file for all your data prior to a major update to your operating system, just to check that you didn't lose data during the update. This may be overkill for all but the largest operating-system updates, though.

Checksumming all files to verify attendance

If you're particularly worried about safeguarding your data, you could actually keep around the sha512-of-everything.txt file on a permanent basis rather than deleting it after migrating to a new computer. As you edit and move around your personal files, sha512-of-everything.txt would accumulate errors. To deal with this problem, you could run shasum -c on it regularly (say, once a month) and make sure that the errors it shows aren't unexpectedly numerous. You could also skim the errors and verify that they look roughly consistent with the edits you made over the past month. For example: "Yep, I recently edited that file that has a hash mismatch. And yep, I recently moved that file to a new location." Then you could generate a refreshed version of sha512-of-everything.txt and use that for your check next month.

In my opinion, the main value of doing this would be as a sanity check on file attendance: making sure you haven't lost large numbers of your files for some reason. Unless you carefully pore over the sha512-of-everything.txt errors, it's less likely this approach would help verify file fixity apart from attendance, because in the course of quickly skimming the list of errors, you could easily miss a hash mismatch that's actually due to bit rot.

Using Bash functions

Some of the Terminal commands presented in this piece are long to type, and while you can copy-paste them, it can be more convenient if you run them a lot to create Bash functions for them, so that you can run a command by typing a short function name. In Tomasik ("How I organize ..."), in the section "Links to my scripts", you can see a file called the-actual-bash-aliases-file.txt . It contains some custom Bash functions that I use, including a few related to checksums.