How I organize and back up data on my computer

By Brian Tomasik

First published: . Last nontrivial update: .

Summary

This page describes various conventions and procedures I've developed for structuring the files stored on my computer's hard drive. I also share a Python script for automating most of the steps involved in backing up my hard-drive data to an external drive.

This piece assumes a Mac or Linux operating system, but many of the principles would transfer to Windows. I've optimized my approach for my own particular needs, not really worrying about whether it's useful for anyone else. So I see this article as mainly a possible source of ideas for readers rather than something they'd necessarily want to imitate fully.

I remain a novice in this area, and suggestions on how to do things differently are welcome.

Contents

Notes about this piece

This article is a brain dump of conventions and ideas about organizing personal data that I've accumulated over time. I often made up my own terminology and guidelines based on my own experience and experimentation. However, I imagine that a lot of these ideas have been invented many times before—and some of the ideas are so obvious that there was hardly anything to invent.

Here are some limitations of the approach I present in this article:

In this piece, I use the terms "folder" and "directory" interchangeably, for no reason other than that I'm not disciplined about sticking to one word.

Overview of my folder hierarchy

To start off I'll give a picture of the structure of my folder system. I'll then explain and refer to this throughout the subsequent discussion, calling it the "overview diagram". Folder names end with the / character. The ~ at the top is the home folder on Mac or Linux. Indentation shows that files and folders are nested inside other folders. Items are sorted in alphabetical order, except that all folder names sort before all file names. I omit most of the contents of most of the folders, just showing a few things for illustration. I made up a lot of the more deeply nested file and folder names; I'm just using them as examples.

~/
    .bash_aliases
    files/
        debs/
            processed/
                backups_ntmc/
                    GitHub/
                        2017-04-29/
                    Google-Drive/
                        2018-08-16/
                        2019-04-03/
                    my-websites/
                    the-Wikipedia-page-on-woodlice_s2017-04-18.html
                my-YouTube-videos/
                    originals-of-individual-videos/
                        2014/
                            Artificial-general-intelligence-and-international_trunc/
                    readme.txt
            unprocessed/
                bug-videos-to-do/
                home-videos/
        temp/
        yebs/
            processed/
                career/
                computer/
                    data-storage/
                        backups-to-external-drives/
                            dlbuod_2019-07-01_eom.txt
                    programs/
                        active/
                            7-Zip/
                                custom-settings.txt
                                installation.txt
                                notes.txt
                        no-longer-used/
                effective-altruism/
                financial/
                    taxes/
                        ESPP/
                            how-to-manually-calculate-compensation-income_ntor.txt
                        tax-year/
                            2013/
                                charity-donation-statement_scdi2025.pdf
                        Mars-taxes.txt
                        Venus-tax-guidelines_r2018-07-02.pdf
                letters/
                    2008/
                        card-from-John-Doe_i2008-02_p2016-01-04.pdf
                    2017/
                        Question-about-insect-suffering.html
                    unknown-year/
                        birthday-card-from-grandma_npy.pdf
                medical/
                    issues/
                        active/
                            2018_dry-eye/
                                warm-compress/
                                    usage-instructions-from-Amazon_s2018-11-03_ncsd.txt
                misc/
                    check-me-often/
                        calendar.txt
                        files-to-reopen.txt
                        important-recurring-reminders.txt
                scripts_ytmc/
                    monitor-data-changes/
                        run.py
                        ignore.txt
                    vxhx/
                        backup_vao2019-10-02T13-27.py
                        backup_vao2019-10-05T05-02.py
                    backup.py
                    create-a-version-snapshot.py
                    dir-splitter.py
                    normalize-file-and-folder-names.py
                    Swiss-archive-knife.py
                    the-actual-bash-aliases-file.txt
            unprocessed/
                documents/
                downloads/
                todos/
                    2019/
                        02a.txt
                        02b.txt
                        02c.txt
                        03a.txt

In later sections I'll explain various things to notice about this organizational system.

As you can see, I use various abbreviations when naming things. Abbreviations can be useful to concisely describe some concept or label that you write frequently. That said, if you use abbreviations, you'll probably want to record their meanings so that 10 years from now you aren't puzzled by them. For me, this article serves the purpose of documenting some of my abbreviations.

In the overview diagram, you can see a folder scripts_ytmc . It stores various scripts I wrote, some of which I'll mention later in this article. Following are links to uploaded versions of these scripts:

If you were to run the scripts on your computer, they should just have the .py extension, but my uploaded versions have a .py.txt extension so that they open in the browser as text files that you can read, rather than causing the browser to ask if you want to download them. If you do want to download a script, you can right-click and save the file. Once downloaded, you can remove the .txt extension.

Keep in mind that some of my scripts are pretty hard-coded to my particular setup, because I'm optimizing for convenience for myself and don't know if anyone else will find these scripts worthwhile. I offer my scripts just in case there are parts of them that you'd find useful—and so that people more experienced than I am can offer suggestions on how to improve them.

I'm hosting my scripts here rather than on GitHub in order to keep my data all in one place (on my website) and avoid taking a dependency on a third-party service that might disappear or at least severely degrade in quality within 10 or 20 years. A given script is entirely contained in one file, so downloading it from my site is trivially easy.

My .bash_aliases file

In the subsequent discussion I occasionally refer to my Bash aliases and functions, which I explain in this section.

In the overview diagram, you can see a .bash_aliases file under the home directory. My .bash_aliases file actually only consists of the following line:

source ~/files/yebs/processed/scripts_ytmc/the-actual-bash-aliases-file.txt

This line loads aliases and functions stored in the-actual-bash-aliases-file.txt . The reason I point to another file rather than writing the aliases and functions directly in .bash_aliases is that I back up files within the files/ directory but not files outside of it, and I want my custom aliases and functions to be backed up. When I'm setting things up on a new computer, once I have my data under files/ in place, all I need to do is create the one-line .bash_alises file that points to the-actual-bash-aliases-file.txt in order to make my aliases and functions work.

Rather than source'ing the-actual-bash-aliases-file.txt from .bash_aliases , you could alternatively source it from .bashrc . However, depending on your system, .bashrc may already exist and have code in it, while .bash_aliases is probably empty or nonexistent by default. So by creating and editing .bash_aliases you can avoid editing an existing file that has default contents written by other people.

In the previous section I linked to an uploaded version of the-actual-bash-aliases-file.txt . If you read that, you can see that I put "my" at the beginning of the name of each function, as an easy way to guarantee that their names won't collide with existing commands, and to show that they're functions I wrote.

The last line of the-actual-bash-aliases-file.txt loads another Bash-aliases file that stores additional Bash functions intended just for my use rather than for sharing publicly. If you don't have this additional-Bash-aliases.txt file on your computer, then if you use my the-actual-bash-aliases-file.txt , you'll get a warning about not being able to find the additional-Bash-aliases.txt file. If so, you can either create your own additional-Bash-aliases.txt file or comment out the final line from the-actual-bash-aliases-file.txt .

Explaining the folders

The files/ directory

I use the files/ directory to hold all the data I care about on my computer, with a few exceptions like .bash_aliases that have to lie outside it. I can ignore anything outside of files/ when thinking about backups and what would be lost if, say, my hard disk were to die suddenly. Ordinarily computers include built-in folders like ~/Desktop/ , ~/Documents/ , ~/Downloads/ , and so on, but I don't use them, or if I do use them, it's only very temporarily, and then I move the files into my main folder hierarchy. That's because data not inside files/ isn't backed up and therefore is at risk of disappearing.

Virtue (2010-2017) recommends the same idea in his Tip #3. He explains his point in the context of Windows, but the same idea applies to any operating system. He advises: "Every data file (document, photo, music file, etc) that you create, own or is important to you, no matter where it came from, should be found within one single folder[...]. In other words, do not base your folder structure in standard folders like 'My Documents'. [...] If you only have one hard disk (C:), then create a dedicated folder that will contain all your files – something like C:\Files."

Instead of the built-in ~/Downloads/ folder, I have my own downloads/ folder, as you can see in the overview diagram. Its path is ~/files/yebs/unprocessed/downloads/ . You can change your browser's settings to automatically save files to this folder rather than ~/Downloads/ .

If you nest your downloads/ folder several layers deep inside files/ , navigating to it takes longer than if it's right under your home folder. One solution to this problem can be to create symbolic links, shortcuts, bookmarks, etc to commonly used internal folders. Another option is to create a Bash alias or function that navigates to the internal folder; as you can see if you open up the-actual-bash-aliases-file.txt , this is done with the myd function. I like this latter approach because then I never have to worry about symbolic links, which add complexity relative to a hierarchy that only consists of files and folders. (If you browse the man pages for various command-line utilities, you'll often see special cases where symbolic links are involved. I prefer to avoid worrying about that altogether.)

Custom settings for programs

Your hard drive stores various settings and other files relevant to installed programs, which are not contained in files/ . You could back up these other files and maybe even restore them directly on a new hard drive if your current one fails, or when migrating to a new computer or reformatting your current computer. I've never done this because pointing a newly installed program to an old configuration file sounds potentially brittle and error-prone? I prefer to instead manually keep a record of any time I change the custom settings of a program I use. That way, on a new computer, I can freshly install my programs without any legacy baggage, and I can just reproduce by hand whatever settings changes I care about. In the overview diagram, you can see the path ~/files/yebs/processed/computer/programs/ . This is where I store my custom settings, installation instructions, and other information about various programs. You can see the active/ subfolder, which houses documentation about all the programs that I'm actively using (rather than about old programs or programs I'm considering trying in the future). When I get a new computer, I can carry over this folder of active/ programs and follow along the installation.txt and custom-settings.txt files to see how to install and configure my programs on the new machine. In the overview diagram I use 7-Zip as an example program for which one might have those .txt files. I also show a notes.txt file that records random other notes about the program.

In his Tip #34, Virtue (2010-2017) mentions the same idea of recording "Notes you've made about all the specific customizations you have made to a particular piece of software (so that you’ll know how to do it all again on your next computer)".

Here's an example of the kinds of instructions I have in mind that you would write in a custom-settings.txt file for a program, which in this case we can imagine is some particular text editor:

In File > Preferences:
* change the default text size from 12 to 16
* change the tab width from 4 to 2

In the plugins menu, remove all the preinstalled plugins except:
Foo Plugin
Bar Plugin
Baz Plugin

yebs/ and debs/

yebs stands for "yes, encrypt before saving", and debs stands for "don't encrypt before saving". Here, "saving" means "backing up to some external location, such as an external hard drive, a Blu-ray disc, the cloud, etc." It's good to encrypt at least some parts of your data in case bad guys get their hands on your backups. A flash drive or external hard drive could be lost or stolen. If you back up to a cloud service, your account could be hacked, either in a targeted fashion or if there's a widespread security flaw or data breach on the part of the cloud provider.

The yebs/ directory includes most of my notes on various topics, my todo lists, documents and letters I've scanned, tax information, medical information, and a variety of other things. By default I put files in yebs/ unless I know they don't contain sensitive information, so that I won't have to think too much about whether they do or not. This is similar to the idea of shredding all the letters you get in the mail so that you don't have to think about which ones do and don't contain info that could be used by identity thieves and the like.

Even though yebs/ is my default directory, it's pretty small—just a few gigabytes. That's nice because then zipping and encrypting the folder only takes ~10 to ~20 minutes. If I have a large folder of data I don't access very often, I can pre-encrypt it and then store it in debs/ , so that I don't have to re-encrypt it every time I make a backup. I add the _ee tag to the end of files that have been pre-encrypted, so that I can find them all in the event that I, say, need to change the password. I would find them by going to the files/ directory and entering this into Terminal:

find . | grep _ee

(I called the _ee part of a file name a "tag" because it serves the purpose of annotating the file. It indicates that the file has the _ee property—i.e., that the file is already encrypted. I'll discuss various other file tags I made up throughout this article.)

The debs/ directory is backed up unencrypted. This means it doesn't have to be zipped up before backup, which is good because mine is over half a terabyte. Most of that size is accounted for by video and audio, including digitized homemade tapes from when I was a kid, originals of the videos on my YouTube channel, raw video footage not processed yet, and so on. Data that's not personal can usually go here, such as a music library or a collection of academic papers.

There are approaches to encryption that overcome the issue of slowness when re-encrypting data each time it's backed up. I think there are encryption-aware sync tools that only make updates to what has changed, even for encrypted data. Another option is to use full-disk encryption on the external drive. Either of these methods would avoid the need to separate yebs/ from debs/ .

Regarding the option of full-disk encryption of an external drive, I'm wary of trusting random hardware manufacturers to the task (Nichols 2018), but tools like VeraCrypt would presumably work. I haven't yet looked into this topic in much depth, and maybe it would make things easier. However, I'm wary of encrypting more data than necessary because it seems to also increase the risk of making the data unrecoverable. For example, if the external drive gets damaged or suffers bit rot, could this prevent you from recovering any data on it, due to the full-disk encryption? (I don't have enough expertise to comment on that.) In contrast, with unencrypted drives, in the worst case you could probably take it to a data-recovery service to pull the non-damaged data off the drive, including small encrypted backup files that it may contain. Of course, one reply to this argument could be that you should just have lots of full-disk-encryption backup drives, so that a failure in any one or two isn't catastrophic. That works as long as the hypothetical catastrophe for full-disk encryption isn't systematic across all instances of it.

I don't have settled views on this and may change my mind in the future as I learn more. I'm wary of things I don't yet understand, and that includes how likely data loss is with full-disk encryption. By the way, I think it's a good idea to use full-disk encryption on your computer's internal hard drive. I'm not overly worried about data loss in this case because if the full-disk encryption on your computer runs into problems, you should be able to restore your data from backups.

Note that some of the content in debs/ , while unencrypted, might nonetheless be zipped up, to avoid having massive numbers of individual files on your computer. Having huge numbers of individual files is perhaps most likely if you have numerous website or source-code backups. As an example of why this could matter, I found that my Blu-ray burner kept spinning indefinitely when I tried to burn several hundred thousand files to a disc (Tomasik "Archiving ..."). Zipped files are good candidates for files you may want to monitor with checksums (Tomasik "Manually ..."), to ensure that bit rot doesn't render them unable to open.

Why did I include a "yes" at the beginning of "yes, encrypt before saving"? Why not just call the folder ebs/ ? It doesn't matter much either way, but one benefit of yebs is that it's not a substring of debs . If you used the name ebs , then when grep'ing or Ctrl+f'ing your notes or scripts for ebs , you would also match debs . That's not true if you use the name yebs .

temp/

Sometimes I want to play around with how a Terminal command works or write a quick Python script to test something out. I do these things in the temp/ folder. I like keeping those experimental activities far away from data I care about, although maybe the risk of accidentally messing up my data is roughly the same regardless of where I do such things.

temp/ can also be used when testing that zipped and/or encrypted files actually open properly. For example, if you have an old encrypted archive on an external drive, you might want to every so often make sure that you could get the data out from it if need be. (This is similar to doing periodic fire drills, making sure your emergency/backup system works.) To do this, you could pull the archive off the external drive, put it in temp/, extract it, make sure it opens, spot check a few extracted files to make sure they open, verify checksums if you have any, and then delete both the zipped and extracted data when you're done.

temp/ is the only folder in files/ that I don't back up, and ideally it's empty most of the time.

processed/ and unprocessed/

Many times you get a dump of files to your computer without having time to organize them right away. For example, imagine that you scan some old paper documents. You might create 50 scanned PDF files that you'd then need to label and put in the right places.

My unprocessed/ folder stores various documents that are in an unpolished, disorganized state. If and when I have time to go through and clean them up, I move them to a final home somewhere in the processed/ folder. "Processing" a file can include various things, such as

Files in the unprocessed/ folder are all implicit "todo" items because something should be done with them, even if that only means moving them to an appropriate location in processed/ .

Note that, in my way of doing things, "processed" files can also contain "todo" tasks and reminders within them. What distinguishes processed from unprocessed todo lists is that the processed ones are somewhat organized and somewhat readable, while the unprocessed ones are just randomly jotted thoughts. Unprocessed items are similar to the "capture" step (step 1) of the Getting Things Done workflow.

Virtue (2010-2017) discusses a similar idea as my unprocessed/ folder in his Tip #13, calling it an "Inbox" folder. He recommends clearing it out regularly and making sure it has fewer than 30 files in it, but for me that's impossible. I have too much backlog of random stuff.

I generate todo items for myself faster than I can finish or even fully organize them. (Yes, I know I have a problem...) For now I collect newly captured todo ideas in files in yebs/unprocessed/todos/ , as you can see in the overview diagram. I create a new .txt file for them every 1/3 of a month to avoid letting the files grow unreasonably big. This also helps me keep track of when they were written, which can be useful for remembering the context of cryptically written notes later on.

Scanned files, pictures, and various other non-todo-list files can await processing in yebs/unprocessed/documents/ .

If I haven't finished processing a file but want to at least put it in its proper long-term storage location, I save it somewhere under the processed/ folder but add to its file name the tag _npy , which means "not processed yet". You can see an example of this in the overview diagram: birthday-card-from-grandma_npy.pdf . To find all files with this tag, you could run

find ~/files | grep _npy

vxhx/

If you have a particularly important file that you want to make sure doesn't get messed up, it may be helpful to store previous versions of it, so that you can diff the current version against previous versions to make sure all the changes made to it were ok. This is common for code files but could also be true for certain .txt files or for draft essays in .html files.

Ordinarily this problem would be solved by a version-control system, but I don't have enough files like this to make it worthwhile to set up an actual version-control system. Instead I create a poor man's version, by storing past versions of a file in a vxhx/ folder that sits next to the current version. vxhx stands for "version history", where I added x's after the v and h letters in order to make the abbreviation fairly distinctive so as to avoid false-positive matches. One reason this matters is that you might sometimes recursively grep through your folders looking for some keyword, and you only want matches from current files, not old versions. In this case you could type something like

grep -rin mysearchword . | grep -v vxhx

which filters out results from version-history files. If the version-history folders were just called vh/ , there would be a lot more false-positive matches for the grep -v part.

(You'll see that a lot of my abbreviations in this piece use collections of letters that are unlikely to occur by chance in natural-language text. I do this for the same reason that I added x's in vxhx : to reduce the risk of false-positive matches when trying to search for a given abbreviation.)

The goal of filtering out version-history directories can also be achieved with a command like

grep -rin --exclude-dir=vxhx mysearchword .

in which case it's maybe less important for the directory name to be distinctive.

The most common case where I want to search my files is to find a keyword in my todo lists, in order to find todo items related to that word. My todo lists are almost always in .txt format. By default, grep searches through all files, including large binary files like videos, which can take a long time. So searching just through .txt files can dramatically speed things up and reduce false-positive matches. You can search through just .txt files by adding --include \*.txt in the grep command.

Typing out a grep command with all these options would be cumbersome to do on a regular basis, so I created a Bash function called mygrep as a quicker way to do a search with these options. You can see mygrep in the-actual-bash-aliases-file.txt . As an example of how it works, to search for the keyword "dentist", you would just type mygrep dentist within an appropriate folder to search within all subfolders.

An example version-history file is backup_vao2019-10-02T13-27.py . The vao part is an abbreviation I made up that stands for "version as of". In addition to the day, the datestamp also includes hours and minutes, in case you have multiple versions in the same day. T separates the date from the time; this particular convention isn't one I made up but rather is widely used, such as in ISO 8601.

It's easy enough to create a version by hand, by typing something like this:

cp myfile.txt vxhx/myfile_vao2019-08-02T01-17.txt

However, since I save snapshots fairly regularly, I created a Python script that automates the process. It's the script create-a-version-snapshot.py that you can find here. As a hacky way to allow for running the script from anywhere, I created a Bash function myvers , which you can see defined in the-actual-bash-aliases-file.txt . Then, from any directory, I can type

myvers myfile.txt

and this will automatically copy a timestamped version of myfile.txt to vxhx/ . The vxhx/ directory will be created if it doesn't exist yet.

check-me-often/

In the overview diagram you can see a check-me-often/ directory that contains a few text files. I created a Bash function myoft to navigate to this directory in Terminal.

calendar.txt is a plain-text calendar as described in the "Using a text file as a calendar" section of Tomasik ("How I use ..."). The Bash function mycal opens it from anywhere.

I often have a bunch of files open at times when I need to restart my computer. files-to-reopen.txt records files that I had to close prematurely before restarting and that I should reopen some time to finish work on them. Usually these are todo-list .txt files, but they could be anything—even web pages you didn't finish reading (although I tend to store web pages I need to reopen in unprocessed/todos/ files instead).

important-recurring-reminders.txt records things I need to remember to do on a regular basis. Usually I remember them on my own without needing to look them up on this list, but it's nice to have a list to refer to every once in a while to make sure I haven't been forgetting something on it. Here are a few example lines from my important-recurring-reminders.txt file:

# Every week

Use my retainer overnight.


# Every month

Run antivirus scan of full computer overnight.


# Every year

(at end of year) Sell some stocks to generate capital gains (https://briantomasik.com/saving-taxes-earning-case-premium-tax-credit/).

I try to only put the most important reminders in this list so that reviewing it isn't overly burdensome. I scatter other, less important reminders in other .txt files throughout all my folders, as I'll explain in a later section.

Organizing by content

When I was a kid, I went to the library and looked for books about dinosaurs. All I had to do was look up the location for one dinosaur book, and all the other dinosaur books in the library would be right next to it on the shelf. This is a main advantage of organizing information by topic as much as possible—everything related to a given project or theme is (in the ideal case) all in one place. Organizing your files by file type, size, date, or other properties doesn't seem nearly as useful (and besides, you can always do special searches based on file type, size, or date if you need to).

Unfortunately, my organization system isn't purely content-based, because of the yebs/ versus debs/ division and the processed/ versus unprocessed/ division. However, after those non-content-based distinctions are over with, I try to organize roughly by topic, although not in an overzealous way. (For example, I group all of my YouTube videos together, rather than putting different videos into different regions based on their topics. The fact that they are my YouTube videos seems to be their most relevant attribute, and I'm organizing based on that.)

There's a perennial debate among information organizers about the relative merits of using folder organization to find files versus using search based on keywords or other metadata. There are good arguments on both sides, and different approaches work better for different people. I fall strongly on the "folder organization" side of the debate for my own important files. I basically never use the built-in file-search functionality on my computer, and I only occasionally need to search for something using grep .

I think there are some domains where searching for a document is better than finding it based on a taxonomic-organization system. A classic example is web search. Searching for something on the web is generally easier than navigating a hierarchy. Putting a document into a topic hierarchy requires collapsing down the content of a document into one or a few categories. In contrast, searching against the full text of a document allows for more precise matching of the query against the document. One clear example of this idea is that you can search for a quote from a document and find it, while you generally can't search a category hierarchy for a quote and find the document that contains the quote.

Using search to find information probably also makes a lot of sense if you're looking for a document within the internal files of a large company, if you're unfamiliar with how its organization system works. The documents of this company are kind of like a smaller version of the world wide web.

When the number of documents is relatively small and I'm already familiar with them, I think finding files directly by navigating to them beats search. I already basically know what's going on with this small set of files, and search is usually unnecessary, especially when you have guidance from your folder structure. Searching for a file in this case would be like getting GPS directions to tell you how to get to a store in a small town that you've lived in your whole life.

If you have large volumes of data dumped somewhere and don't have a strong mental map of the contents, then I think search is useful. For example, if you have 10,000 relatively unimportant email messages, then organizing them and becoming familiar with their contents is probably unnecessary, and you should find emails by searching. In contrast, if you have 1,000 important files related to ongoing projects, organizing things carefully can be more valuable, and you're likely to be able to find many of these files by directly navigating to them.

Jacobson (2018) defends searching for documents rather than using a folder system. She begins with the following analogy:

Think about the last time you tried to find a baking dish or a set of utensils in a friend’s kitchen. If you didn’t have your friend standing over your shoulder and telling you where to look, you probably had to search through a few drawers or cabinets to find what you were looking for. Everyone has their own system for organizing their kitchen, and it may not make immediate sense to their visitors.

Yes, I can imagine that if it's someone else's kitchen. But what if it's your kitchen? Then you can know roughly where everything is.

One of the things I like about a deep folder structure is that it's a way to easily add lots of implicit metadata to a file. Barker (2012): "at the end of the day, why can't a folder structure just be considered another form of metadata? Putting a file 'in' a folder – how is that different from applying a tag?" For example, consider the document charity-donation-statement_scdi2025.pdf in the overview diagram. Its full path is

~/files/yebs/processed/financial/taxes/tax-year/2013/charity-donation-statement_scdi2025.pdf

We could think of most of the higher-level folders as metadata tags that one might wish to apply to the document. If we were writing metadata as hashtags, we might write

#yebs #processed #financial #taxes #2013

Just by plopping the file into that location of the folder hierarchy, we've essentially added all those tags for free. If you have 100 image files and put them in a folder nested five layers deep, that's kind of like adding 5 * 100 = 500 metadata tags all at once. User "xrimane" makes a similar point in Reddit ("Zen ..."): "You remember that picture you once[ took], but don't know neither the date nor which camera? A folder 2009-12-05 Eiffel tower will be helping tremendously without obliging you to tag each and every picture." Of course, I assume there are ways to add tags in bulk in tag-based organization systems too.

If you want to find a document with a given metadata keyword—say, "taxes"—you can either navigate to its corresponding folder (namely ~/files/yebs/processed/financial/taxes/ ) or do something like this:

find ~/files/ | grep -i taxes

If you wanted to add more tags to a file than were implicit in its parent folders, you could store them in the file name. For example, you could name a file like this:

charity-donation-statement_scdi2025_paperwork_records_scanned-documents.pdf

Then this document would be included in the results for the following search:

find ~/files/ | grep -i paperwork

The program TagSpaces (which I've never used) also stores tags in file names. TagSpaces (n.d.):

for example if you tag the file "img-9832.jpg" with the tags "sunset" and "bahamas" it will be renamed to "img-9832 [sunset bahamas].jpg". The Pro version of TagSpaces, has an option to save the tags in sidecar files.

Personally I find that tagging documents with keywords is usually a waste of time, because I rarely need to find documents by keyword search, and when I do, grep'ing through file bodies usually works well enough. That said, I do tag files with other kinds of metadata besides keywords; the _scdi2025 thing in the aforementioned file name is one example. I'll explain this and other similar tags later in this article.

Following are two reasons I don't like relying on search for finding important files.

1. Search can be messy (unless aided by additional systematizing)

Imagine that you have a bunch of different backup snapshots of a website on your hard drive. Suppose you want to find the HTML file for the page called "All about dreams" in the backup snapshot from 2018-04-22. The easy way to do this would be to have the snapshots organized by date, so that you could visit the snapshot for the desired date and then navigate to the file that we'll imagine is called

all-about-dreams/index.html

But suppose you're opposed to organizing using folders and instead just plop all your website backups into one big pile. You try to find the desired file using your file explorer's search box. You don't want to search for "index.html" because that would bring up way too many false-positive results. Maybe you could search for "all-about-dreams". But even then, you have to find the version of that folder corresponding to the correct snapshot date. If you're just plopping files randomly into a big pile and relying on search to find stuff, you may not have properly recorded the snapshot date. If you did record the snapshot date, that's great, but doing that presumably required as much work as putting the snapshot into a folder would have required.

The way I see it, unless you're ok with your files being a big mess, then you either have to use a systematic folder structure or you have to use systematic tagging, to make sure you can clearly tell what's what. Entering this metadata seems like roughly the same amount of work either way. Barker (2012): "Complaining about people sorting things into folders but pretending that they'll be diligent enough to valiantly apply metadata is probably a little optimistic, isn't it?"

Sometimes it is ok for your data to be a big mess, such as if you have a huge collection of low-value files that are too unimportant for it to be worth the time to organize them, whether using folders or metadata tags. In cases like this, imprecise search is fine because that's the best you can do if you want to try finding something.

2. Search assumes you know what to look for

Suppose you're writing a survey of the animal-welfare movement. You've spent several months taking notes, collecting documents, and interviewing people. Now it's time to assemble that information into a coherent article.

If I were doing this, I would create a folder for the current article and put all relevant documents inside that folder (possibly using subfolders as well). However, let's imagine that you don't bother organizing your files and instead rely on search to find them. You might search for keywords like "animal", "welfare", "suffering", etc. Doing that will bring up many of the relevant results (along with false-positive matches). But what if some of your documents didn't contain any of those words? Maybe you remember that you had a conversation with Bob Smith, and you can search for his name to find the notes from that interview. But maybe you also had a conversation with Jane Baker that you forgot about. Because you forgot about the Jane Baker interview, you don't remember to search for her name, and that document gets overlooked. Maybe there was also an image file associated with the Bob Smith interview that searching for "Bob Smith" doesn't pull up because it contains no text and you didn't write "Bob Smith" in its file name. Or maybe you wrote "Robert Smith" in its file name, and your search didn't match that. And so on.

This situation is a huge mess because while search can be pretty good at finding a document that you know exists, it doesn't necessarily find documents that you've forgotten about. It doesn't allow for enumerating all the documents relevant to some project at once. The exception would be if you were systematic about adding a standardized tag to all relevant documents, like #animal-welfare-article . Doing this would be essentially the same as putting everything in an animal-welfare-article/ folder.

A common complaint about the modern web is that it tends to create echo chambers where people are exposed to opinions similar to ones they've already heard, and people may not learn about new topics beyond what they're already looking for. Searching for files has the same problem: you can find what you know about, but you miss things you've forgotten about (unless those things happen to be returned during a search). In contrast, if you organize documents into folders, you can systematically review the files in that folder, including ones that you didn't realize existed.

letters/

The letters/ directory stores selected letters, emails, and other conversations that seem worth saving. The word "messages" might be more appropriate than "letters" because this folder can include digital messages as well as paper letters. However, if I named the folder messages/ , it would share two first letters with the medical/ directory, which would make navigating more cumbersome on the command line, because I would have to type three characters of the directory name before tab completion would uniquely identify the directory I intended.

In terms of the "folders versus search" debate, my letters/ directory falls largely on the "search" side of things, unlike most of my other directories, which are more organized. The only folders within letters/ are for the year when the message was initially sent. Beyond that, the messages are just in a big pile, unless I choose to create a few subfolders within a given year for certain topics. If I have several different files that are all part of the same conversation, it helps to create a subfolder to put them into.

The reason not to bother organizing most messages is that it's usually not worth taking the time to do so, because these messages are generally not that important, and for any given message, the probability that I'll ever need to refer back to it is generally pretty low.

Mann (2007) argues that in the case of email, it's not worth spending time to categorize everything into an orderly system. He says (at 18m49s): "Be honest with yourself: what is the payoff? [...] I'm not a librarian."

However, there are some messages that I do organize in a more librarian-like way, by putting them in other folders than the letters/ folder. For example:

For these kinds of messages that contain specific information that I definitely want to have access to, I like putting them into their proper locations because

  1. These messages are more important than most other messages, so the benefits of having them organized are more likely to outweigh the time cost required to organize them.
  2. It's easier to access the information contained in these messages when they're right next to the other files about the same topic.
  3. If I didn't organize these messages into their proper folders, I would probably forget that many of them existed. Unless I were actively searching through all my messages for a given topic, I would never find the message and therefore wouldn't be able to review the information it contained. Plus, searching for old messages doesn't always work, such as if you don't know the right keyword(s) or if the target file has no searchable text (as is the case for many scanned documents, screenshots, etc).

Unfortunately, the fact that I organize a few but not all messages in a librarian-like way means that there may be some uncertainty about whether a given message is in letters/ or whether it's somewhere else. A blog post I can't now find argued against organizing any of your emails for this reason; if they're put all in one place, you just need to search that place. I understand that perspective, but I think this problem is not a big deal for me because I don't search for archived emails very often, and if I was unsure whether the email had been put in letters/ or elsewhere, I could just grep -rI for it from a higher-level folder, like ~/files/ , rather than from within the letters/ folder.

The easiest messages to organize are those that don't exist. If there's minimal chance that I'd ever want to reread a given message or search for information it contains, I can just delete it and never need to deal with it again.

Explaining the file names

Characters in file names

I use only alphanumeric characters, dashes, underscores, and periods in file names. I use periods only for file extensions (possibly including multiple extensions like in myfile.tar.gz). I avoid spaces in file names because they can cause annoyances on the command line. I avoid any other characters just in case they might cause problems of one sort or another.

My Bash function mynn calls my Python script normalize-file-and-folder-names.py , which automatically renames all file and folder names under the current directory so that they only have alphanumeric characters, dashes, and underscores in the non-file-extension part of the name. This is particularly useful if you download a dump of files from the web that don't fit your naming rule, or if you have a bunch of files from your past self before you adopted this naming rule. The script allows you to confirm each proposed renaming before doing it, but you can also choose to let the script just run on its own.

Formerly I used underscores to separate words in a file name. Perhaps this was based on my habit of using underscores in variable names when programming, since programming languages usually don't allow dashes in variable names. However, I've since moved to using dashes to separate ordinary words in file names, so that I can use underscores to separate more logically distinct parts of a file name instead. One example of this we saw earlier was backup_vao2019-10-02T13-27.py . Other merits of dashes are that they can be typed without pressing Shift and that dashes rather than underscores are the preferred way to separate words in URLs.

I prefer using lowercase letters in file names because then I don't have to press the Shift key as much when navigating through folders in the Terminal. So, for example, I name files readme.txt rather than the more common README.txt . I still use capital letters where they would be required in English. You can see this in the overview diagram in cases like GitHub/ and Google-Drive/ .

As a side note, while I didn't always, I now prefer navigating through my folders in the Terminal rather than in a GUI navigational file manager. A main reason is that I worry less about the scenario of accidentally messing things up by clicking or dragging the wrong thing; in Terminal, you have to press Enter before anything happens. I also find that nagivation through folders can sometimes be faster in the Terminal, especially if a GUI file manager would be slow in rendering lots of thumbnails (although thumbnails can be turned off). Finally, Terminal allows for quick access to lots of command-line tools that the GUI doesn't have.

trunc

Long, descriptive file and folder names can be nice, but unfortunately there are various limits on the maximum length of a file name and of a file path depending on the file system. It seems generally good not to go overboard with the length of a file or folder name, since even if your current file system supports it, maybe you'll want to move the file to another device that doesn't, or maybe some applications choke on long files or paths. I add _trunc to the end of a file or folder name to indicate that it has been truncated.

I save the original version of each of my YouTube videos in a folder named after the title of the video. However, some video titles are rather long, so I truncate the folder name. An example you can see in the overview diagram is the folder Artificial-general-intelligence-and-international_trunc/ , which is the truncated folder name for a video whose full title is "Artificial general intelligence and international cooperation [audio, poor quality]". That full title is 83 characters long, while my truncated version is only 55 characters (ignoring the ending / ).

One could argue that it's not important to explicitly mark a name as having been truncated, since most of the time that fact should be pretty obvious. One reason this label may be useful even if the truncation is obvious is that it makes clear that I did the truncation myself on purpose, rather than the truncation having been done by an automated process that could have cut off valuable information. For example, when you export Google Drive documents using Google Takeout, long file and folder names get truncated without warning; see the section "Truncation of file and folder names" of Tomasik ("Some tips ..."). This is bad because you might not realize the truncation happened, and it might cut off tags or other important metadata about the file. If you ever notice a truncated name that doesn't end with _trunc , you should investigate whether the name was cut off in a way that you didn't intend.

i , p , r , and s

Here's a summary of the tags discussed in this subsection, where the example file names using the tags are taken from the overview diagram:

Tag Example use in a file name Example use in a text document Meaning of the tag
icard-from-John-Doe_i2008-02_p2016-01-04.pdfi:2007-02-17initial creation date
pcard-from-John-Doe_i2008-02_p2016-01-04.pdfp:2015-11date processed
rVenus-tax-guidelines_r2018-07-02.pdfr:2019-10-08date of last complete review of this file
sthe-Wikipedia-page-on-woodlice_s2017-04-18.htmls:2006date on which I saved this content from the web

In the overview diagram, one made-up file name is card-from-John-Doe_i2008-02_p2016-01-04.pdf . We can imagine that this is a scanned version of a handwritten card sent in the mail by a notional John Doe. The _i part of the file name indicates when the letter was written. In this example, I imagined that I only knew the year (2008) and month (Feb), not the exact day. Meanwhile, _p indicates when I finally moved the file into some location within my processed/ folder.

_i can often be omitted if the document itself contains the information, such as is the case for most emails or non-handwritten letters.

_p is also often not that important. You already can tell that a document has been processed if it lies under the processed/ folder (and doesn't have _npy in its file name), and usually it's unnecessary to know the exact date when that processing occurred. That said, recording this date can sometimes be useful. As an example, suppose you're processing video files with movie-maker software. Imagine that in May 2016, you learn some new fact about the correct way to use the movie maker, and videos you processed before that date are buggy and should be fixed. If you have _p stamps on your videos, you can see which ones were done before 2016-05 and therefore need to be fixed.

While "processing" a file means cleaning it up and making sure it's readable for the long term, a "complete review" of a file means reading the entire thing in order to remind myself about the information it contains and to fix any errors or outdated information. Knowing the date of last complete review may be useful because you then have some idea of how likely information in a file is to be stale, or whether you're overdue for re-reading a file that contains some reminders to yourself. This date isn't something you can tell just from the "last modified date" that your computer tracks in file properties, since you might have modified the file in trivial ways without actually rereading it, or you might have reread the file without modifying it.

To illustrate the difference between _p and _r , imagine that you bought a new electronic device, and you scan its user manual into a PDF so that you don't have to keep around the paper version of the manual. If you properly rotate the scanned pages to make them upright, name the PDF file, and move it to its proper storage location, the document has been "processed". If this was done on, say, 2016-02-19, you could add _p2016-02-19 to the file name if you wanted. However, you probably haven't actually read through the whole manual, so there hasn't been a complete review of the document.

On the other hand, imagine that you have a messy collection of notes that you'd like to clean up eventually but don't have time to organize now. You might read through those notes in order to remind yourself what they say and correct any errors. In this case, you could label the notes as having undergone a complete review even though you can't yet label them as processed.

The articles on this website have two dates in the byline section: "First published" and "Last nontrivial update". The "First published" date is essentially equivalent in meaning to an _i date. In an ideal world the "Last nontrivial update" date would be equivalent to an _r date, meaning that the article has been fully reread to make sure it's up to date. However, in practice, I usually don't have time to completely reread an article while making an update to it. Therefore, the "Last nontrivial update" date is closer in meaning to the "Last modified" date that your computer would track automatically for a file, except that I only change the "Last nontrivial update" date when there's a "nontrivial" change to the web page, such as adding a few new paragraphs.

As you can see in the above table, the _s tag refers to when I saved something from the Internet to my computer. That date is probably different from the inital creation date of the content itself by its author. And it's not necessarily the date I processed or reviewed the file (though it may be). For web pages, knowing when a snapshot was saved is potentially useful information because it tells you how old the snapshot is, whether you might want to get a fresher snapshot, etc.

If you save a lot of files, applying these tags to all of them would be a lot of work and may not be worth the effort. If you're saving or processing numerous files at once and they all go together in the same folder, you could apply _s or _p just to the folder that contains them rather than to each individual file.

All of these tags are highly optional, and I may not bother adding them to file names unless doing so seems useful. In general these tags should just be seen as possible ways to label data that you particularly care about, not an albatross around one's neck. If there's data you don't care much about, you can either delete it (to avoid any further maintenance costs associated with managing it) or just quickly save it without any special tags.

So far in this subsection I've been talking about applying these tags to file or folder names. However, if the file is editable, you can alternatively store this information in the body of the file itself. The third column of my summary table shows how I format metadata written inside the file. As an example, imagine that I'm writing a file car-notes.txt that stores things I want to remember about my car. The file might look like this:

i:2018-02-23
r:2019-03-18


# Basic info

My car model is blah blah blah ...

My engine is blah blah blah ...

The initial creation date and date of last complete review are stored in the text file's contents rather than in its name.

For .txt files I prefer storing metadata inside the file rather than in the file name because at least the date of last complete review is likely to change a lot. If you store this information in the file name, the path of your file will regularly change, and you have to close the file from your text editor before making the file-name change.

For more static file types like MP4 and PDF, I usually record this metadata in the file name because it's more cumbersome to add this information into the body of the file. You could add a text box inside a PDF document to store this metadata, but that can be messy. I also incline toward storing this kind of metadata about HTML files in the file name itself—even though an HTML file is readily editable—because it would be easy to miss this metadata information in the jungle of tags and <head> information that an HTML file often contains, especially if you view the HTML file in a browser rather than a text editor.

Why do I not put a space after the colon when writing something like i:2018-02-23 ? I think it doesn't matter much either way, but below are a few reasons to omit the space.

If you want to make note of information like the date of last complete review of a file, you have to record it manually. A file property like atime ("time of last access") doesn't work because you might open the file just to look at something briefly, not to review the whole thing. In general, I dislike relying on a file's properties as tracked by my computer to store any information I care about, not just because those properties aren't very informative but also because I assume these properties could be lost if I move to a new operating system or whatnot. For the same reason, I don't want to manually color-code my folders; maybe there's a way to transfer a folder's colors if I move from Windows to Mac or vice versa, but I don't want to have to worry about that. And colors or other properties are perhaps even less likely to transfer if I upload a folder to cloud storage and view it in a browser-based user interface, while information encoded in file and folder names would survive even in this scenario. I prefer to indicate things using bare text that's guaranteed to persist through whatever transformations the file undergoes over the coming years and decades.

ntmc

The tag _ntmc on a file or folder stands for "not the master copy". Its complement is _ytmc for "yes, the master copy", although _ytmc is the default for most files on my computer, so it's rarely necessary to label them as such.

Below I've selectively quoted Spacey (2019) to explain the way I use the term "master copy":

A master copy is an original work from which copies are made.

[...] all changes are applied to the master. Changes to the master may be released, at which point all copies should be updated. [...] Any changes made to these copies is considered invalid.

Essentially, "master copy" is the opposite of "backup copy". All changes to data (file edits, file creation, file deletion, or moving a file) should be made to the master copy of the data, and eventually backup copies will be updated to reflect those changes as well. (However, backups should not be synced instantaneously to the master copy, or else they're not really backups, since any mistakes, accidental deletions, etc on the master would also happen on the so-called backups right away, preventing you from restoring to an older version.)

Keeping track of this distinction is really important. The reason is that when you know something is _ntmc (i.e., is a backup), you can delete it and replace it with a new backup version without worrying about whether it contains anything you need to save or review first (unless, of course, you discover problems with the master copy that call for restoring from a backup). If something isn't explicitly known to be a mere backup, then you might wonder whether it has something unique in it that's not present anywhere else. You might feel the need to go through it and check if it has anything unique before you can delete and replace it.

I've learned this lesson the hard way a few times over the years. One example was in 2018 when I was preparing old home VHS and audio cassette tapes for digitizing. In several cases I found tapes that had duplicate content, with one presumably having been copied from the other. However, because I wasn't super careful about labeling everything back when I created those tapes (in the late 1990s and early 2000s), I couldn't now be sure whether the seemingly duplicative tapes were actually identical or whether I might have, say, added some extra stuff to the end of one of them that I didn't add to the other. So I had to fast-forward through each tape to see what was on them. (Unlike with digital files, you can't do random access on tapes, making this fast-forwarding process somewhat time-consuming.) What I should have done when I first created the tapes was to mark the copy tapes as _ntmc , which would have been an indication that "No, there's nothing uniquely interesting on this tape, because it's merely a backup copy of some other tape that you also have. You need only digitize the original."

And of course, to follow this instruction, you should remember to never add anything unique to an _ntmc location, only to the master copy. You can picture an _ntmc location like a garbage can, perhaps one that takes many years to be emptied. While the garbage can is full you can retrieve stuff from it if need be. But eventually that stuff will be gone, so you better not put the sole copy of anything that you want to keep permanently in the garbage can.

In the overview diagram, you can see the _ntmc label used for the folder backups_ntmc/ , which stores backup copies of various cloud data. For example, you might have some Google Docs on Google Drive. The versions on Google Drive are the master copies, since they're the "latest and greatest" versions that you make changes to. If you use Google Takeout to export a backup of those files (Tomasik "Some tips ..."), that backup is _ntmc . Any edits you might make to the offline export won't be saved, because you're not editing the master copy. (Note that if you store some backups of local hard-drive data in Google Drive, then those files aren't master copies. They should be stored in an _ntmc folder on Google Drive.)

What was once _ntmc can become _ytmc if the master copy disappears. For example, if you export some Google Docs to your computer and then delete them from Google Drive, the offline export has just become the master copy, like a prince who becomes king when his father dies.

To take another example, if you receive a paper letter in the mail and scan it, the paper letter is still the master copy, with the scan being _ntmc . However, the moment you shred the original paper letter, the scanned document becomes the master copy. It's the new "latest and greatest" version that you'll be working with going forward.

Applying _ntmc to a folder name means that everything under that folder is not a master copy, unless an item is explicitly marked with _ytmc to indicate otherwise. There might also be some implicit, common-sense cases where files within an _ntmc folder are obviously _ytmc . For example, you might have a collection of _ntmc website backups that contain some readme.txt files, which are documentation about the backups and are only present next to the backups, not on the original website. The readme.txt files ought not be lost when the backups themselves are being replaced with newer versions.

In the overview diagram, you can see that I added a _ytmc tag to the scripts_ytmc/ folder. This isn't strictly necessary because most data on your main hard drive is already implicitly _ytmc . However, I added that tag just to make it super clear to myself that the master copy of those scripts is in that folder, rather than, say, here on my website, where I've uploaded copies of many of those files.

All data on all backup drives, discs, etc is likely to be _ntmc . I don't explicitly label data on backup drives as _ntmc because I want to keep the names of the high-level folders the same on backup drives compared with my internal hard drive (see Tip #12 of Virtue (2010-2017)). However, I implicitly know that backup drives don't hold the master copy of my data. Even if you see a _ytmc subfolder on a backup drive, don't believe it: that folder is really _ntmc by virtue of being on a backup drive. In other words, the explicit _ntmc and _ytmc labels should only be regarded as meaningful on your internal hard drive (or wherever you store the master copy of your files). These tags are just a guide, helping you keep these issues in mind, rather than something to take overly literally. At the end of the day, you have to use common sense.

One arguably pedantic question is whether data created by other people counts as _ntmc . For example, suppose you download the PDF of an academic paper and save it on your hard drive.

In my opinion, the best answer is to regard the saved PDF as a master copy unless you plan to redownload the paper and replace your existing version at some point in the future (to get any new edits that may be made to the online version), in which case the PDF on your hard drive is _ntmc . Part of what the _ntmc label implies is that "this is just a copy that can and should be overwritten in the future with a fresher version". If you're never going to refresh the PDF, then it's essentially a master copy to you.

In any case, I don't think the issue is that important, partly because you can generally be less cautious when handling copies of data made by other people, because if you mess something up, you can just redownload it. You need to be more careful with files you created yourself, because you're the one responsible for making sure nothing bad happens to them.

ncsd

The _ncsd tag stands for "no change since downloaded". This means that I haven't edited, customized, annotated, etc a file. This is useful to know because it means that if I ever stop needing a file and decide to get rid of it, I don't need to look back in the file to see if it has any notes/edits that I made that I want to save. Since the content is entirely the way I got it when I downloaded it, I can just throw it out, especially since I can probably just download it over again if need be. In contrast, if I added custom edits or notes to a file, I'd want to make sure none of them need saving before trashing the file.

_ncsd is actually the default for most files I download from the web, such as music audio files, PDFs of academic papers, email discussions, and so on. Usually those aren't edited or annotated. I don't bother adding the _ncsd tag 99% of the time because it's pretty self-evident, but I might add it sometimes, especially if the file I've "downloaded" is a .txt file containing text I copy-pasted from a website. Usually .txt files on my computer contain custom notes that I wrote, so it seems useful to indicate when a .txt file is not my own writing and is purely something I got from the web.

In the overview diagram you can see the file usage-instructions-from-Amazon_s2018-11-03_ncsd.txt . It contains instructions on how to use a "warm compress" to treat dry eye. I copied the instructions from part of the Amazon product page into a .txt file on 2018 Nov 03. I mark the file as _ncsd to clarify that there's no custom commentary from me in the file. That means if I read the .txt file, I know that everything it says is exactly as it was written on the web page from which I copied it. It also means the file doesn't contain any unique-to-me content that I might want to hold on to for the long term.

One exception to the rule about not doing any custom edits in an _ncsd file is that I still might add a brief explanation of the file using a custom XML tag called <metadata> . For example, here's what the top of the usage-instructions-from-Amazon_s2018-11-03_ncsd.txt file might look like:

<metadata>This text is copied from [the Amazon product page](https://smile.amazon.com/gp/product/B004385RPS/).</metadata>


Directions
Use daily to relieve dry, irritated eyes.
...

However, this metadata isn't the kind of unique-to-me content that I would want to make sure gets saved for the long term.

The opposite of _ncsd would be _ycsd , meaning "yes, this file was changed since it was downloaded". Like with _ncsd , it might be pretty obvious when this label applies to a file, so I would mainly only use it when there's ambiguity and I want to make things extra clear. For example, if I've made annotations to a PDF document downloaded from the web, I might label the file _ycsd so that I don't later casually delete the PDF on the mistaken assumption that it contains no original content of mine.

_ncsd is sort of similar to _ntmc in that both labels say "yes, you can blindly get rid of this in the future without checking that it contains anything needing saving". However, the labels are also somewhat different. _ntmc is used when I plan to refresh the content in the future from the master copy, overwriting older versions. _ncsd content might well be the master copy (relative to me) because I don't plan to ever download it again, but I still want to indicate that it can be casually deleted if I want to in the future.

scdi

I actually have a third label saying that the given content can be deleted without further review: _scdi . This stands for "saved; can delete in [some year]" or, more verbosely: "I've checked that all content worth saving from this file, if any, has already been saved elsewhere, so after holding on to this file for a time, I can delete it (if it's a digital file) or dispose of it via shredding (if it's a physical paper) in [some year]."

The _scdi label is useful because it lets you clearly mark something as done and not needing to be looked at again while the contents of the document are fresh in your mind. If you just deposit the document in a (digital or physical) folder and come back to it a decade later, you might have to skim the document over again to see what it is and whether it contains anything in need of saving. It seems better to do that work ahead of time, while you already know what the document is and that it has nothing further of value. The _scdi label is kind of like a long-term version of your computer's Trash bin. You can "trash" an item as soon as you know you're done with it, but you hold on to it for a while longer just in case you ever find that you need it again.

A classic case where this would apply is tax documents, which you need to hold on to for some number of years before you can clean them out. In the overview diagram I have an example file charity-donation-statement_scdi2025.pdf , where the _scdi tag says it can be deleted in the year 2025. I calculated 2025 as follows for someone living in the USA. The containing folder for the PDF document is for tax year 2013. Those taxes would be due in Apr 2014. In Mengle and Lankford (2019), 10 years is the longest period listed for which you might want to keep tax-related records, with some exceptions. Usually you can get rid of records much sooner than 10 years, but to keep things simple one could just use 10 years as a conservatively long retention period. 10 years from Apr 2014 is Apr 2024. Adding a few extra months, you can finally delete those records starting in 2025.

Mengle and Lankford (2019) list some cases where you might keep records longer than 10 years:

Probably there are further situations like these.

Then there's also the matter of the tax returns themselves. Bieber (2018): "You should also keep copies of your tax returns forever so you can prove you filed your taxes if the IRS says you didn’t, as there’s no time limit on the IRS bringing a case for failure to file."

To be on the super safe side, maybe you'd want to just keep all government-relevant financial records indefinitely? Even if so, there are plenty of other cases where the _scdi tag applies. For example, you might get rid of your old apartment's lease agreement a few years after you move out of the apartment.

If your records are on paper rather than being digital files, you can write something like "scdi2025" directly on the paper.

Instead of labeling each file individually, you could instead create a special (digital or physical) folder labeled scdi2025 and put all the relevant files in it. However, this approach requires that you take the files out of their current locations, which may (or may not) make it harder to find them if you end up needing them.

Sometimes I write temporary scripts to do one-off things on my computer. For example, I might use a script to rename a group of files in a systematic fashion. When I'm done using the script, I could delete it. But given that I spent some effort writing it and checking its correctness, I don't want to just delete it right away, in case I might need to refer back to it. And in fact, some of the code I used in this script might come in handy for some future script, so maybe it'd be worth keeping the script for a while. At the same time, I want to clearly indicate that the script is no longer needed, so that I don't later stumble upon it and wonder if there's something I'm supposed to do with it. In a case like this, I could keep the script but add the tag _scdw , which stands for "everything worth saving is saved; can delete whenever I want". In other words, I don't need the file anymore, and I could delete it anytime or not at all. Here's a hypothetical file name with this tag: moving-script_scdw.py . This is a subtly different tag from _ntor (which I discuss next), because even if a file is tagged with _ntor , I still intend to keep it indefinitely and don't want to delete it, because it may contain some useful reference information. In contrast, a _scdw file has no specific important information and can be deleted any time I feel like it.

If you want to indicate that a script or other file is outdated and shouldn't be used again but you want to keep it indefinitely just for reference or sentimental value, then you wouldn't apply an _scdw tag, because you don't want to delete the file. Instead you could apply the tag _tio , which stands for "this is outdated". Anything tagged with _tio should not have any ongoing todo items or reminders in it, which means that a _tio item is also implicitly an _ntor item, which is a tag I explain next.

ntor

_ntor stands for "no todos or reminders". Its opposite is _ytor , meaning "yes, this document has active todos or reminders".

In spring 2019, I had to figure out how to do taxes for sale of employee stock purchase plan (ESPP) shares, which is surprisingly complicated. During that process I created several .txt files of notes to myself. Some contained todo items about my plans for selling ESPP shares in future years. Some contained reminders about how to do taxes for ESPP sales. And some of those .txt files were just "reference material"—notes about things I did or discovered that I wanted to save in case they might be useful but that didn't constitute ongoing action items. An example you can see in the overview diagram is the file how-to-manually-calculate-compensation-income_ntor.txt . This records what I discovered about how to calculate so-called "compensation income" when doing ESPP taxes. I labeled this file as _ntor because I don't actually have to reread it in the future. I created other todo documents that explain to my future self what numbers to put in what boxes when I'm doing taxes in subsequent years. However, I wanted to save these notes in case I would ever have reason to look that information up at some point.

_ntor is a way to say to your future self: "When I come back to this project later, I don't have to spend time rereading this particular file unless I want to. It's here strictly for reference." Like _ntmc , _ncsd , and _scdi , it's a way to save reading time for your future self by preemptively saying "Nah, you don't have to revisit this" while you have it fresh in your head what the contents of the file are. However, unlike with _scdi , by default you don't plan to ever delete an _ntor file, since it might indeed be useful in the future.

In my way of doing things, a file can go under the processed/ directory even if it has todo items or reminders in it, as long as the file is written clearly enough to be understood by my future self even if I don't get back to it for a few years. However, one could imagine a stricter definition of processing, according to which a file isn't really processed unless all the todos and reminders in it are closed out. That's what _ntor is: it indicates that a file has been "fully processed" in this strict sense. (One could imagine creating a new high-level folder like fully-processed/ to store fully processed files separate from the regular processed/ files. However, I think this would be cumbersome because then you'd have to mirror your folder structure in two places. For example, you'd need a taxes/ESPP/ folder under processed/ to store the ongoing todo items and another taxes/ESPP/ folder under fully-processed/ to store the fully processed items. You might have to constantly switch back and forth between the two versions of taxes/ESPP/ to open the relevant files. It seems simpler to me to keep everything under one processed/ folder and just use _ntor to make this distinction instead.)

Most non-.txt files are already implicitly _ntor . For example, a backup of a website or a home movie I filmed when I was in middle school has no ongoing action items or reminders associated with it, once it's filed in the proper place, appropriately documented, and so on. I only apply the _ntor tag when it's not obvious whether a file is _ntor , which means I usually only apply it to .txt files.

Similarly, many .txt files are implicitly _ytor , so I don't bother labeling them with that tag. The _ytor tag is mainly useful for items that you wouldn't normally expect to contain todos, such as a video file or scanned letter.

Earlier I mentioned that _tio implies _ntor , i.e., all outdated items have no active todos or reminders. However, not all _ntor items are _tio . For example, a photo of a sunrise from 2010 falls into the _ntor category, because there are no todos or reminders associated with it. However, it's not "outdated", because "outdated" means something is not applicable or appropriate to use anymore. Examples of outdated items are old Python scripts that won't ever be executed again, information that you know is no longer correct, and so on. In contrast, an old sunrise photo may well be used again. For example, you could upload the photo to your website and add it to some article.

Todos and reminders

Following is the full text of the fictional file called Mars-taxes.txt :

i:2019-09-14
r:2019-12-04


Here are some notes about Mars taxes. As [one web page](https://www.youtube.com/watch?v=dQw4w9WgXcQ) noted:

    > Martian taxes are some of the most complicated in the solar system.

Martian tax brackets are 11%, 22%, and 44%.

pri1remr: Pay Martian taxes every October.

pri1todo: Ask whether I qualify for the low-income-astronaut tax credit. @interplanetary-tax-office

pri3todo: In 2018, I used the MarTax software to file my returns, but I could see if there are alternatives. i:2018-07-17

For background reading on Mars taxes, see the Wikipedia article.

As you can see, reminders and todo items are prefixed respectively by priNremr: and priNtodo: for some value of N . In my system, N can be 1, 2, 3, or 4. 1 means that the item is of high importance, and I should try to make sure it happens, hopefully soon. 2 means the item is of medium importance, and while I'd like to get to it within a few years, it's not terrible if I don't. 3 means the item has low importance, so it's probably ok if I never get to it. And 4 means the item is unimportant, such that I probably shouldn't bother doing it, but I'm writing it down just in case I change my mind or in case the idea somehow proves useful in the future.

As you can see in my fictional text file, not every line in a todo text file needs to have a priNremr: or priNtodo: prefix. You can also write lots of exposition, copy quotes from web pages, and so on. The prefixed lines just highlight the most actionable parts to look at in the file.

The @interplanetary-tax-office tag is a "context tag" as explained in Karbassi (2017), indicating where the task will be completed. I usually omit this because most of my tasks are done at home.

The i:2018-07-17 thing means the todo task was written on 2018 Jul 17. I omit this date from more than 99% of my todo tasks, because usually it's unnecessary, and writing it would be a waste of time. However, it's occasionally useful to include a date for context. For example, if you have a todo item that says "reply to Bill's email", then adding a date to this todo item will ensure that you can figure out which email from Bill is being referenced.

You could also add any other tags you want, such as #hashtags for keywords. The bottom of Karbassi (2017) suggests writing a due date like this: due:2010-01-02 . I generally prefer to keep todo items simple so that they're quick to write. The whole point of a todo line in a file is to complete it and then delete it, not to make the todo description fully comprehensive.

The benefit of these standardized conventions is that they allow for systematic searching. For example, if I wanted to look up all of my highest-importance reminders throughout all my .txt files, I could type this in Terminal:

cd ~/files/
mygrep pri1remr

To search for all low-importance reminders or todo items containing the substring "Mars" or the substring "financial" or the substring "tax credit", and not containing the substrings "2016" or "2017", I could do this:

cd ~/files/
mygrep pri3 | grep -e Mars -e financial -e "tax credit" | grep -v -e 2016 -e 2017

If you're planning to visit the interplanetary tax office and want a list of all things you should do there, type:

cd ~/files/
mygrep @interplanetary-tax-office

And so on. Note that if you're searching for a hashtag using mygrep , you need to enclose it in quotation marks, like this:

cd ~/files/
mygrep "#taxes"

A shorter way to write todo prefixes

In 2021, I changed the way I write the todo and reminder prefixes. Instead of writing

pri1todo: reply to Bob's email

pri3remr: Avoid candies containing shellac.

todo: look into the tax question for Jane

I now would write

1t: reply to Bob's email

3r: Avoid candies containing shellac.

_t: look into the tax question for Jane

In other words, priNtodo: is shortened to Nt: , and priNremr: is shortened to Nr: . If I don't want to have to bother picking a priority, instead of writing todo: I write _t: , and instead of writing remr: I write _r: .

The reason I made this switch was to reduce the effort of writing these prefixes. Also, when the prefixes are short, I find that they're a bit less visually distracting, allowing for more quickly reading the content of the todo or reminder itself.

However, this new convention has some downsides too, which explains why I didn't use it from the beginning.

With the old convention, if you wanted to list all todos of any priority, you could just do mygrep todo: . With the new convention, you could try mygrep t: , but it would likely return way too many false-positive matches. So probably you have to search for each priority separately: mygrep 1t: and mygrep 2t: and so on. Or do this:

grep -rn -e 1t: -e 2t: -e 3t: -e 4t: -e _t: .

Fortunately, I found that I rarely wanted to do a search for all todos of any priority, so this downside isn't too significant for me. Usually I most want to search for priority-1 todos because they're the most pressing. I created a Bash function my1t for when I want to list all my priority-1 todos.

The other downside of the shorter convention is that you can no longer do something like this to show all priority-1 items whether todos or reminders:

mygrep pri1

Instead you have to combine the results of two searches:

mygrep 1t:
mygrep 1r:

But I rarely do searches of this type either, and doing two searches like this instead of one isn't onerous.

When I first wrote this article I used the longer priNtodo: convention, so that's what you'll continue to see in my examples in the remainder of this piece. I haven't bothered to update this article to the shorter Nt: convention, especially because some people might prefer the longer convention anyway.

A list of some context tags

As mentioned above, context tags allow you to annotate a line of text as something you should do or remember in a given context. For example, you might use the context tag @dad to annotate topics to bring up the next time you talk with your dad. Before the conversation, you could run mygrep @dad to list the relevant items.

If you plan to rely on a context tag as a guaranteed way to find todo items, then you should make sure you standardize what the tag will be called so that you'll write it the same way each time. If you sometimes write a tag as @dad and sometimes write it as @father because you haven't settled on a standard tag name, then when you search mygrep @dad , you won't find the @father items.

Following is a list of context tags that I invented for myself. When I suggest a generic template rather than a specific tag, I use [square brackets].

Context tag Description
@[person]This is used to tag things to ask a person, do with a person, etc the next time you talk or meet. Examples: @mom , @dad , @bro , @sis , @grandma , @Bob , @JohnSmith
@[online service]Used to tag things to do while logged in to the given online service. For example, @FB tags something to look at or do when you're on Facebook. @Amazon tags a note about something to look up or buy on Amazon.
@[doctor type]Questions to ask your doctor on the next visit, or other things to do related to that doctor. Examples: @primcare for your primary-care physician, @optometrist , @dentist
@[store or other location]Things to buy or do the next time you visit the given store or other location. Examples: @grocery , @postoffice
@mysiteThings to do on my websites, such as making an edit or addition to some article.
@showerTopics to think about in the shower. See the section "Planned shower thoughts" in Tomasik ("A Collection ... (2017)") for more discussion.
@checkSomething to check. For example, I might need to review a diff of changes I've made to a Python script to make sure the changes all look ok. This is best done when I'm in a "checking mood", which happens when I'm alert enough to competently review something but not motivated enough to do a more ambitious task.
@readSomething that I should read when I'm in a reading mood.
@jffThis stands for "just for fun" (Urban Dictionary "jff"). It designates a task that I could do when I'm in the mood for a fun break, though it's not something that ever needs to be done. For example, if there's an interesting Reddit thread that I don't want to waste time reading at the moment, I could save a link to it with the @jff tag. This means I could return to it at a time when I'm in the mood for reading something interesting, but there's no particular thing I actually need to get from reading it, so it's fine if I never read it. If I do actually need to read something, I would use the @read tag instead.
@mindlessA task saved for when I'm tired, lazy, or need to kill a few minutes waiting for something else. An example of a mindless task would be deleting unimportant emails from my inbox.
@musicA task at my computer that's sufficiently mindless that I could listen to music while doing it without reducing my performance at the task. (Such tasks are very rare because I'm bad at multitasking. As a result, I rarely listen to music except as part of movies or TV shows.)
@podcastsA task I can do while listening to podcasts or YouTube videos. Examples: trimming my toenails, washing dishes, unclogging the sink drain.
@houseworkA chore to do around the house. I like saving these up because then I can do them when I'm in the mood, such as when I'm mentally tired or want to think about something.
@eatSomething in my fridge that I should remember to eat soon before it goes bad. For example, I might write: @eat the rest of the lentils

As we can see, context tags can refer to a physical place (such as the grocery store), a digital place (such as Facebook), or a mental/mood place (such as being in the mood for mindless work).

Following are some examples of these context tags in action. Often when I write a todo item, I'm too lazy to add the priNtodo: prefix, especially if I expect to finish the todo item quickly or if it's in a .txt file that consists entirely of todo items. So many of the below examples omit the priNtodo: prefix.

clean hair out of my razor @podcasts

@jff https://en.wikiquote.org/wiki/David_Pearce_(philosopher)

pri2todo: Ask new @optometrist about my level of eyelash crust.

Finish reviewing @FB and @Slack notifications from 2019-10-02

in https://reducing-suffering.org/food-waste/ , "Washington" -> "Washington state" @mysite

That last line is telling me that in that article on my website, I should search for where I say "Washington" and consider replacing it with the text "Washington state" (to make it clear that I don't mean Washington, D.C.). I would review each change before making it rather than blindly doing a "replace all".

The @jff line means I could read the linked page some time if I want. Putting a bare URL next to @jff basically always means that the proposed action is to read the article. If the task were, say, to edit the article, I wouldn't use the @jff designation because this would then be a task that I actually wanted to complete, whereas with @jff tasks, it's fine if I don't complete them, and I probably won't ever get around to many of them.

Whenever a todo item has only one context tag, I could theoretically accomplish the same purpose as the tag by instead collecting all the relevant todo items into a file (or set of files) designated for that purpose. For example, I could have a jff/ folder containing a 2019.txt file that stores all @jff items from 2019. In an ideal world this might be a more elegant solution than tagging each todo line with @jff , but in practice I often don't do this because I would have to open the jff/2019.txt file every time I wanted to save something to it. If I want to just quickly jot something down in my current unprocessed/ todo list, I can use the @jff designation instead. In principle you could write a script to periodically move all @jff items to their own file, but I'm not convinced this would be worth the effort. Another possibility could be to write a Bash function that would, using a short Terminal command, open your jff/2019.txt file, so that you don't have to navigate to find the file each time, but I haven't (yet) done this.

I often omit context tags from todo items out of laziness. I may not want to expend cognitive effort coming up with the tag for a quick todo task that will be done soon anyway. Being thorough in applying context tags seems most important for contexts like a doctor visit where you want to be able to generate a comprehensive list of your questions, because if you forget one of them, you may have to wait another ~year for the next doctor visit before you can ask the question (unless it's particularly important, in which case you could call the doctor or schedule another appointment).

Other conventions

Sometimes I have a group of todo items that go together. It can be useful to indicate that they're part of a group so that I know they're related and so that I can tell when all the todo items applicable to that issue have been completed. I group items using an HTML-style tag <g> (which is a tag I made up that's short for "group") like this:

Look into bird-repelling stickers for the window.

pri2todo: Try logging in to my old Mars atTax account.

<g>
Learn about soundproofing tips for my room.

Look up ideas for reducing sound coming through door cracks.

Maybe call the person my doctor recommended for ideas on alternatives to earmuffs.
</g>

pri1remr: "Put down the duckie if you wanna play the saxophone."

If it's not obvious what the group is about, you could add an explanatory description inside an e="" attribute (which I made up and as far as I know is not part of real HTML). For example:

<g e="things to do at the rock store">
Does the store have any of https://en.wikipedia.org/wiki/Larimar ?

Ask the price to rent a shelf for displaying rocks

Sell them the fossil I found on 2014-07-18
</g>

In general, you can use HTML-style opening and closing tags when you want to make it clear where something starts and ends. Another example of this that comes up a lot for me is indicating updates to old todo items. I find that it's often easier to keep an old todo item's text and write updated information within it rather than having to rewrite the whole thing when I want to update it. For example, suppose I have this todo item:

pri3todo: Renew my interplanetary passport by 2021. I think I left my current one in my spaceship, so I should look to see if it's there.

Then suppose that I visit the website of the interplanetary government and learn that I actually have until Apr of 2022 to renew the passport. I could just rewrite the original todo item with the corrected information, but another option is to write the updated information inside a <ud> tag (which is a tag I made up that's short for "update"):

pri3todo: Renew my interplanetary passport by 2021. <ud>Actually, the site says I have until 2022-04 to renew.</ud> I think I left my current one in my spaceship, so I should look to see if it's there.

If you want to record the date on which you wrote that update, you could include an i: tag in it:

pri3todo: Renew my interplanetary passport by 2021. <ud>Actually, the site says I have until 2022-04 to renew. i:2019-10-18</ud> I think I left my current one in my spaceship, so I should look to see if it's there.

I find that many of my updates to older todo items involve just appending a quick note to the end of the original item, like this:

see if I need more fuel for my spaceship <ud>yup, I do -- I should get some soon</ud>

Unlike when you add an update to the beginning or middle of an existing todo item, the closing </ud> tag isn't really useful in a case like the above one because the update is the entire rest of the line. Writing the closing </ud> tag can be annoying, not just because doing so requires typing a few extra characters but also because I have to remember to add it at the end of the line, and keeping that fact in my brain's working memory requires a little bit of cognitive effort. Alternatively, I could type the closing tag ahead of time, before I write what goes in between the starting and closing <ud> tags. But in that case I'd have to press the left-arrow key on my keyboard a few times or use my mouse to put the cursor back in between the <ud> and </ud> tags. It would be preferable to not have to write the closing </ud> at all. For that reason, I made up the <udv> tag, which is a so-called "void tag" version of the update tag, lacking a closing tag. For example:

see if I need more fuel for my spaceship <udv> yup, I do -- I should get some soon

In standard HTML, <hr> and <br> are examples of void tags.

One other HTML-style tag I made up is <rttd> , which stands for "read this, then delete". Occasionally I encounter a few paragraphs of text that I'd like to read some time but don't want to read right now. I may want to close out the web page where the text is rather than leaving that page open until I read it. In such a case, I can copy the text to my current unprocessed todo list and enclose it within <rttd> tags, like this:

<rttd>
Thank you for signing up for Mars Maps.

...[more text here]...

We hope you enjoy our service.
</rttd>

As the tag says, once I'm done reading the text, I can delete it, since I don't need to keep it.

Fixity checking

Data that you're storing could get messed up in a variety of ways. For example, you or a buggy program could accidentally delete folders. You might accidentally edit or move a file that was supposed to remain fixed. Bit rot might corrupt a small number of your files over time.

Some cloud-storage services protect against bit rot, but data stored on such services would still be at risk of accidental deletion or accidental edits (or malicious destruction, if someone hacks your account). There are cloud options for immutable storage, but then you can't edit or delete a file even if you want to.

If you want to allow legitimate changes to your data while also watching out for accidental changes and bit rot, you need to review the changes to your data on a regular basis—at some level of thoroughness depending on how much you care about the data. In Tomasik ("Manually ..."), I described manually monitoring the fixity of a select group of files on your hard drive. As I played around with this process over time and learned the rhythm of fixity checking, I developed a plan for how to automate it. This resulted in the Python script run.py in the monitor-data-changes/ directory. You can find the link to the script in this subsection.

As you can see in the-actual-bash-aliases-file.txt , I run the script by typing mymonitor anywhere in the Terminal. That Bash function moves me to the monitor-data-changes/ directory and then runs python3 run.py . Theoretically you can run run.py in any location on your computer, but you have to always run it in the same place so that the script has access to its history files.

I drew inspiration for the script from, among other places, the "Fixity" tool created by AVPreserve (which changed its name to AVP in 2017). Rudersdorf (2017, "AVPreserve's ...") is a tutorial on AVP's Fixity tool that introduces a lot of the basic concepts in play, although my script diverges from the Fixity tool in some ways. Most of the general statements I make about AVP's Fixity tool in the subsequent discussion are based on what I learned from Rudersdorf (2017, "AVPreserve's ...").

I wanted to create my own script rather than using AVP's tool because the basic task is actually pretty simple, and I didn't need the GUI or other bells and whistles offered by their tool. I wanted to optimize the workflow of the script for the exact way that I do fixity checking. AVP's Fixity tool works on Windows and Mac but not Linux (GitHub "Linux/Ubuntu ..."). My script was built on Linux. It may also work "out of the box" or with a few tweaks on Mac, though I haven't tried it.

My script includes some global variables (in all capital letters) at the top, which you can tweak if need be. By default I monitor everything under the ~/files/ directory, but you can change that to monitor anywhere else instead. If you want to test the script on a small amount of data before running the real thing, you can change the monitored directory to point to your test directory. AVP's Fixity tool can monitor multiple different directories, but my script monitors just one.

Often you have files that you don't care much about. For example, in my case, I don't care about doing fixity checking for items in ~/files/temp/ . My script filters out files you don't care about by seeing if the full file path contains any of a list of given ignorable substrings. For example, by default, my script knows that the substring /files/temp/ can be ignored, so any file path matching that substring (such as, say, /home/yourusernamehere/files/temp/test-files/foo.tar ) can be ignored. You can add your own ignorable substrings by creating a file ignore.txt next to run.py (which you can see in the overview diagram at the top of this piece). Each line of ignore.txt should be a single ignorable file-path substring. For example, your ignore.txt might look like this:

/delete-me/
/my-website-from-2018/2018-02-08/
.DS_Store

Any file whose path contained at least one of those substrings would be ignored (including any .DS_Store files if you're on a Mac). By "contained" I mean that the substring can appear anywhere in the file's path. So for example, if you had a folder named I-think-.DS_Store-files-are-weird/ , files in that folder would be ignored when using the above ignore.txt file.

(AVP's Fixity tool also allows for filtering out ignorable files—see p. 18 of AVP (2018). Just from reading the user manual, I can't tell if the filter substrings match only file extensions, file names, or full file paths?)

AVP's Fixity tool allows for running scans on a set schedule, but I prefer to do scans manually. (AVP's Fixity tool can be run on demand, not just on a set schedule.) Using my script, a scan of lots of large files may take an hour or so and will hog disk resources, so I prefer to run it at a time when I'm not otherwise using my computer. Since I sometimes use my computer at any given hour of the day or night, I can't guarantee that an automated scan on a fixed schedule would occur when the computer is idle.

When you first run my script, you're asked if you want to monitor file types that are often small (os) or often large (ol). ol file extensions are specified as a global variable at the top of the script. Video files like .mp4 , audio files like .wav , image files like .png , and so on are classified in the ol bucket, with everything else going in the os bucket. The reason to make this distinction is that running a fixity check on large files takes a while—maybe like ~1 hour for 500 GB of data, or more or less depending on your computer's speed. (Computation is bottlenecked by checksum calculations, which seem to be bottlenecked by disk I/O. I've found that checksum calculations are about ~twice as fast on an SSD compared with a regular hard-disk drive, if I recall correctly.) Meanwhile, running the check on generally small files may take less than a minute, depending on the size of your data. Yet, those small files (like .txt or .html files) are also the files that are most likely to change a lot due to being edited. Therefore, these are the files that you'll probably want to review more often.

I run my script with the os choice roughly once a week. It's good to run the script frequently while you still roughly remember what files you've edited recently, so that if you see that a file has changed its checksum, you can quickly recognize that that's ok because you recently edited that file. Meanwhile, I run the script with the ol option about once a month—or even less often if I haven't really made many changes to ol-type files.

If you plan to move a lot of files at once (such as by renaming a relatively high-level folder), you may want to run the script first, make sure everything is good, do the renaming, and then rerun the script. You should be able to breeze through the second review of the changes because probably all that changed is that a bunch of files moved. Running the script before moving a bunch of files allows you to review changes that you actually might care about, without having them mixed together with the big mass of moved files you don't need to look at closely.

The reason that this script separates os and ol based on file types rather than the actual size of the file is that I wanted to avoid the confusion of a situation where a file could move from one class to the other merely by being edited: for example, imagine that editing a file increased its size so that it crossed a file-size threshold and moved from the small-files class to the large-files class. When I use file extensions to partition files into the os class or the ol class, a file would have to at least be renamed before it could switch classes from os to ol or vice versa.

If a file does change its file-type class, say from os to ol , this script will say the original file can't be found when you run the script for os files. That's because the file is no longer in the universe of os files that the script is looking at. Hopefully this doesn't happen very often, and if it does happen, you can just realize that the file wasn't actually deleted.

During the first run for a given file-type class (os or ol), my script records information about the files, including their paths and SHA-512 checksums, and saves all the information to a big .json file in a newly created "history" directory. (This .json file is similar in spirit to AVP's .tsv history file.) Then on the next run of the script for the same file-type class, the most recent history file is loaded, to allow for comparison of the previous information against newly computed checksums and other statistics based on the current state of your files.

A comparison of the most recent history file against the current state of your data begins with some overview information. The script alerts the user if key parameters have changed since last time, such as the directory being monitored or the list of ignorable file-path substrings. The script then shows changes in file and folder counts from last time to this time, broken down in various ways. The purpose of this is to provide high-level sanity checks regarding whether major data loss may have occurred, before worrying about which individual files may have moved or changed. I wrote these sanity checks as a replacement for the more manual data sanity-checking method I proposed in Tomasik ("Sanity ..."); I no longer bother doing that more manual method.

After finishing the sanity checks, the script lists files that have changed in various ways since the last time.

When a file's checksum has changed since last time, my script looks at whether the file's mtime (date and time of last modification) has also changed to a more recent value. If so, then probably the change in checksum is due to editing the file (hopefully intentionally). If the file's mtime is still the same as it was last time, then the difference in checksums may indicate bit rot—silent corruption of data on disk. I present these two cases separately because while you may want to just quickly skim through the list of files that have merely changed due to edits, you'll want to pay close attention to any files where bit rot might have occurred. The idea of using mtime to make this distinction was taken from Langa et al. (2013-2018).

Below are the four categories of changed files that I report. I refer to these as "category #1" through "category #4".

  1. possible bit rot
  2. files that were edited but not moved
  3. files that were moved but not edited
  4. files that can't be located (including deleted files)

I don't bother reporting files that neither moved nor changed checksums. Unlike AVP's Fixity tool, I don't bother listing files that were newly added, since there's nothing really to check about them until next time. (A newly added file has no existing checksum to verify.)

You can see the exact logic in the review_changes function in my script, but here's a pseudocode summary of how I categorize files based on whether they've changed in various ways:

if file's path is the same as before:
    if file's checksum is the same as before:
        Great. Nothing changed. No need to report this case.
    else:
        if file's mtime is the same as before:
            Report this as possible bit rot. (category #1)
        else:
            Hopefully the file was legitimately edited. (category #2)
else:
    if some file has the old inode value:
        if the checksum of the file with that old inode is the same as before:
            The file just moved. (category #3)
        else:
            Can't confidently find the original file. (category #4)
    else:
        Can't confidently find the original file. (category #4)

If a lot of files have moved or changed, the list of files printed to the screen during the review process will be quite long. Usually the files that changed are mostly within one or a small number of folders. On the review screen, you can enter f to type a substring of a file path that you want to filter out from the output, so that there's less to look at. (This has no lasting effects; it just declutters the output screen.) For example, if you moved a bunch of files from a folder named foo/ to a folder named bar/ , you could filter out the substring /foo/ from the list of moved files so that you wouldn't have to see them anymore, allowing you to focus on the smaller number of moved files that weren't part of the foo/-to-bar/ move. You can filter out multiple substrings if there are multiple classes of file paths you want to avoid seeing. This filtering is mainly just a convenience, but in a situation where so many files have moved that printing them out would more than fill your Terminal's entire scrollback history, filtering would be necessary in order to shrink the output size to something you could see.

If at any point during this review process you indicate that something doesn't look right (e.g., a file was deleted accidentally, or there's suspected bit rot), you can tell this to the script, causing it to abort without writing anything to disk. You would then go try to fix the problem, such as by restoring good copies of the files from backups. When you think your files are all back to being ok again, rerun the script and go through the review process all over again. Once it looks like everything is ok, the script gives you a final confirmation screen, and then the new state of your files is written to disk as the newest history .json file. This becomes the new baseline to check against on the next run of the script in the future.

Because this history file is only saved after you confirm that everything looks good, all the history snapshots stored on your hard drive should represent states of your data where everything was ok as far as data integrity. In other words, I avoid writing a "bad" history file where data is in a "bad" state, since you don't want that to become a baseline that you compare against in the future. (The exception would be if your data gets slightly messed up and you aren't able to fix the problem, such as because you don't have old enough backups to restore from. In this case you would bite the bullet and pretend to my script that everything is ok, since you have no other choice than to accept the current state of your data as the new baseline.)

AVP's Fixity tool doesn't have the same guarantee that all saved history snapshots represent good states; rather, I think their tool just runs, saves the results to disk, and waits for the user to inspect them. That's necessary if fixity checking is run on a fixed schedule in the background. Because I run my script manually in the foreground, I can review the reported changes while the script is still running and decide whether to save the new history snapshot to disk.

Because my script only saves a history snapshot after everything looks ok, you can determine when you last verified your os or ol data by looking at the datestamp of the most recent history file in the os or ol history directory, respectively.

When AVP's Fixity tool does a comparison of the most recent stored snapshot against the current state of your data, it saves this comparison as a "report" .tsv file. My script doesn't bother saving reports; the comparison is all done interactively in the Terminal as you're confirming that things look ok. I figure you just need to confirm that things look ok once and then move on, and it's not necessary to save a record of this process. In the worst case you could always reconstruct the comparison from two saved history files, although my script doesn't have functionality to do that out of the box. (You could do the comparison by hand or modify my script to do it.)

By default, history files are saved indefinitely, but you can delete old ones if you want. The script only uses the most recent history file, so theoretically you just need to keep that. I think it's reasonable to keep history files going back a year or two in case they'd come in handy, though perhaps you'll never use them.

In the rare event that one of your files is identified as potentially bit-rotted, you may need to restore a good copy of the file from a backup. However, rather than doing this blindly, you'd probably want to investigate whether the file really does have bit rot or if it just got edited without the mtime value updating. You could try opening the file to see if it opens and what it looks like. You could grab a copy of the file from a backup and compare that against the current copy, either by eye or with diff in the case of plain-text files. After this review, you can decide if you want to keep the current version of the file or restore to the backup version. You can manually check whether the backup version of the file has a checksum that matches the checksum recorded in your most recent history file. Run shasum -a 512 yourbackupfilenamehere to get the SHA-512 hash of the backup file. You can open the most recent history file and search for the name of your file to see what its checksum value should be. I think it's fine if this process of dealing with potential bit rot requires some slow, manual steps because you're unlikely to do it very often—plausibly only once every few years or so, depending on your situation and how many files you have.

My script identifies files based on absolute rather than relative paths. One advantage of doing this is that it makes it possible to change the directory being monitored (say, from /home/yourusernamehere/files/ to just /home/yourusernamehere/ ) while still recognizing files from the previous history file (because the absolute path of each file is still the same). If my script used paths relative to the monitored directory, that wouldn't work. However, using absolute paths poses a problem if you move your files to a new computer with a new username, since the username part of the absolute path will now be different. For example, /home/mynameisjoe/files/ on the old computer might change to /home/JoeShmoe/files/ on the new computer. To solve this problem, after I look up a file using its actual path name, I replace the /home/mynameisjoe part with ~ before comparing against the previous history file, so that a change in the username is harmless. (Langa et al. (2013-2018) solves this problem in a different way, by just using relative paths: "All paths stored [...] are relative so it's safe to rescan a folder after moving it to another drive.")

My script uses SHA-512 hashes, rather than less computationally intensive hashes like MD5. I originally chose SHA-512 because it seemed more generally useful. For example, in the unlikely event that you want the hashing algorithm to be cryptographically secure, you'd want SHA-256 or SHA-512 rather than a weaker algorithm. Also, since disk I/O is probably more of a bottleneck that computation, using a costlier algorithm seemed unlikely to affect the running time of the script. On the other hand, one could possibly make an argument against SHA-512 on the grounds that it unnecessarily wastes small amounts of electricity and therefore generates a tiny bit more greenhouse gases than something like MD5 would. I haven't bothered to estimate the size of this effect, but I imagine it's pretty minimal. Given the time that would be required to change my code and documentation to something other than SHA-512, I haven't bothered. I think this would mainly only be relevant if I expected my script to be widely used, which I don't.

While my script doesn't use OpenSSL for hashing, you could use it for a rough speed comparison on your machine, by running something like this:

openssl speed md5 sha1 sha256 sha512

My script lists the files and folders on the hard drive and reads through the files in the relevant file-type class (os or ol). It seems safest if you don't mess with your files or folders while the script is running. For example, if you move a file between when the script identifies its path and when the script actually reads it to compute a checksum, the file would fail to be found. Probably you could edit files without moving them while the script is running without any problems, but I haven't checked the code to verify this claim. I would rather just not touch my hard-drive files while the script is running. You can still browse the web or do other things on your computer without a problem. That said, I usually run the slower ol version of the script overnight, so that I won't be doing anything else on my computer while it runs. When I wake up in the morning, the script is waiting for me at the confirmation stage. At this point, it's done reading from my hard-drive files, and I think all the rest of the computation just uses data structures in RAM until the output file is written. So once you get to the stage of confirming that things look ok, if you don't want to walk through the confirmation steps right away, I think you can use your computer normally—doing other things, including editing and moving files—until you're ready to return to the confirmations.

Imperfect move detection

My script looks for files based on their file paths, such as ~/files/yebs/processed/career/foo.txt . If a path that was recorded in the history file is still present in the current state of your files, then the script can look at whether the checksum is the same as before. But what if there isn't any longer a file with that file path? This means the file was either moved or deleted. (I consider renaming a file to be a kind of moving. After all, renaming can be done with the mv command in Terminal.)

My script tries to locate a moved file by looking at the original inode value of the file and seeing if any new file has that inode value. If yes, then I compare the checksum of the file I found having that inode value against the original checksum. If the checksum also matches, it's very likely that I've found the moved file.

The idea of using inodes to detect moved files is explained by AVP in Duryee (2017). Since inodes are specific to Unix-style file systems, I assume my script wouldn't work out of the box on Windows. (Windows has its own version of inodes, so it could be done, but I haven't bothered.)

Unfortunately, inodes are not a perfect solution because a file's inode can change. Duryee (2017) explains (p. 2):

There are some notable exceptions to the handling of index values [i.e., inodes] within file systems. The behavior of certain programs will change the index location of a file. Many programs, such as text editors, do not edit files in place on the filesystem. Instead, they create temporary copies of the file, which then store the changes to the file being made within the program. When the changes are committed to disk, the temporary file (containing the changes) is copied over the original. This process is used by programs in certain cases because it allows for greater crash resilience. However, the temporary copy is a completely new file and it is given a new file index location. As such, after using a program such as TextEdit, you may find that Fixity gives an unexpected result for the file. This is due to the editing program’s behavior, and not an issue with Fixity or your file.

Not all text editors change inodes. For example, I've found that the GNU nano editor doesn't seem to. But other editors do.

Most of the time this changing of inodes isn't a problem because my script looks up a file based on its file path first, ignoring the inode. It's only problematic for a text editor or other program to change a file's inode if the file also moves at the same time, which means I can't find the new file based on either its path or its original inode. The moved+edited file would then be indistinguishable from a deleted file in the eyes of my script. AVP (2018) mentions the same issue (p. 27), using the term "index location" instead of "inode":

Most files maintain their index location value when they are edited. Certain applications (such as many text editors) will instead edit a temporary version of a file and copy that over the original upon saving. This will change the checksum and the index location, which will report correctly as a changed file. However, if this were coupled with a rename or move it would result in a report of 1 removed file and 1 new file.

(My script doesn't report on new files, so my script would just note the one file that may have been deleted.)

One could imagine fancy ways to try to tell that a file was merely moved+edited rather than deleted, such as looking for an apparently newly created file that's approximately the same as the old and apparently deleted one according to some definition of "approximately the same", such as in terms of file name, file size, most common words in the file if it's a text file, etc. Implementing such logic would add complexity to the script, and I'm doubtful it would be particularly valuable, so I haven't done it. In practice the main thing you should look out for when reviewing the files that have changed in various ways is to make sure the list doesn't contain a file that you haven't intentionally touched at all recently.

Suppose my script looks up a file based on file path and can't find it. Then it looks up the old file's inode and does find a file with a different path that has that inode. Why can't I then automatically conclude that the file moved? Why do I also verify that the checksum is the same before concluding that the file moved? The reason is that inodes may be recycled, so that another file might inherit the old, vacated inode of a previous file. On my own computer as of 2019, this inode recycling happens right away. You can try it for yourself and see if you find the same thing. In a temporary directory, run

touch one.txt
touch two.txt
ls -i

The last command will show the inodes of the files in the current directory. For simplicity, suppose that one.txt has an inode value of 1, and two.txt has an inode value of 2. Now open one.txt in a text editor, type a letter or two, and save. Running ls -i shows that the inode for one.txt has changed to some new value like, say, 3. Now open two.txt, make some random edit, and save. Rerun ls -i . What I see (and what you might see) is that the inode for two.txt is now 1. two.txt immediately recycled one.txt's old inode after it became available. So we can't just use inodes to identify moved files, since in this case we would conclude that one.txt moved to two.txt , when in fact what happened was that they were both just edited.

GitHub ("Polling ...") presents another example to illustrate why we can't use just an inode value to identify a moved file if inodes are recycled.

Duryee (2017) discusses the inode recycling issue (p. 2), suggesting that empirically it seems not to be a problem on Windows and Mac because "most filesystems assign index values sequentially, only repeating after exhausting all possible values". A comment on Stack Overflow ("Does recreating ...") reports that it seems like inodes are reused right away on ext4, while new (not immediately recycled) inodes are used on ZFS.

By the way, I assume inode values may not transfer from one computer to the next? Therefore, if you want to use my script to to verify your files after a move to a new computer, you should probably first run it on the old computer right before moving, leave all files where they are, migrate to the new computer, and rerun my script right away. All the files should be verified properly because my script doesn't use inodes if a file hasn't changed its file path.

As I explained, my script can't find a file that has both moved and changed its checksum at the same time, because the original checksum is used to confirm the identity of the moved file. This implies that if a file moved and has bit rot, my script won't be able to flag the bit rot, since my script won't be able to tell for sure that it's the same file as it knew about last time. Hopefully the inability to detect bit rot in this situation won't be a huge risk, because probably most of your files don't move most of the time, meaning that most instances of bit rot are likely to be detected. Also, if you remember for certain that you didn't edit a given file, then the fact that the file falls into category #4 (can't be confidently found) rather than category #3 (moved without being edited) would suggest that the file may have been corrupted.

One might wonder why I use both inodes and checksums to find moved files. Why not just use checksums? The reason is that you might have multiple files that have the same checksum. This problem would be especially severe if you have a number of empty text files throughout your different folders. Duplicates could also occur if you have multiple backups of your data. For example, if you have several non-zipped backups of a website on your computer, then many of the HTML files, image files, etc in those backups will be the same from one backup to the next.

Simulating bit rot

I define "bit rot" as a situation where data on disk gets corrupted without the user having deliberately or accidentally edited the data. Fortunately, bit rot is fairly rare, and I didn't observe it organically while testing out the mymonitor script. Instead, I did one simple test to simulate bit rot, and I found that my script did correctly identify the simulated bit rot.

My method of simulating bit rot was partly inspired by the test_helper.bash file of Langa et al. (2013-2018). I identified a file—let's call it file.html—on my hard drive that was already being tracked by my fixing-checking script. My goal was to modify some bytes in this file while keeping the file's mtime constant, since my script identifies bit rot when a file has changed checksum while keeping the same mtime value.

I first transferred file.html's access time and modify time values to some other file—let's call it temp.html—as follows:

touch temp.html -r file.html

Then I modified some bytes in file.html . I probably could have done this in a text editor, but just in case it mattered, I used a more raw approach instead, like Langa et al. (2013-2018)'s test does. I ran this command:

dd if=/dev/zero of=file.html bs=1 count=3 conv=notrunc

which I think writes three bytes of 0s over the beginning of the file. Then, I needed to reset the access and modify times back to the original values:

touch file.html -r temp.html

When I ran mymonitor again, my script correctly identified file.html as possibly having bit rot because it's mtime value was the same as before while the checksum had changed.

Backing up the data

Types of backups

There are many possible ways to store backed up data, including flash drives, external hard drives, optical discs, network-attached storage, cloud storage, or even printing out certain texts to paper. It's probably wise to use several methods, especially for data you can't stand to lose.

I use an external hard drive or external SSD as my primary backup "workhorse", using other methods as supplements from time to time. Hard drives can handle a much larger volume of data than Blu-ray discs or DVDs. It's very fast to transfer data to a local hard drive relative to uploading it to the cloud.

There are various types of backups one can create: full, differential, incremental, etc. So far I only use full backups because they seem sufficient as long as you have several full copies of varying ages and because incremental backups would add complexity to my workflow. However, many people do use incremental backups.

It's good to have an inventory of where all your backups (and original copies) live to keep track of them.

In the rest of this piece, I focus on how I do hard-drive backups. Because I do these the most often, they're the most in need of automation, whereas other, less frequent backup methods can often be done in a more ad hoc way.

Backup frequency

In my opinion, external hard drives used for backups should be disconnected from a computer 99% of the time, only being plugged in when needed. If you keep your external drive plugged in all the time, then while you're safe from certain classes of risks, such as failure of your computer's internal hard drive in many cases, you're still vulnerable to other risks, such as

If you keep your external drive disconnected most of the time, you can't do backups totally on autopilot, since the drive needs to be plugged in before a backup. However, because I encrypt some of my data, there's already a manual step in my workflow to enter the encryption password, so I'm already signed up for some amount of manual effort when doing a backup. (It seems to me like a very bad idea to store your encryption password on your computer for access by an automated backup script, unless there are clever security measures to protect it that I'm not aware of.) If you wanted, I guess you could have one external hard drive that's always plugged in to your computer for receiving automatic scheduled backups and then have other external drives that are disconnected; this seems like overkill to me, and it still runs into the problem of needing to be around to enter your encryption password even for the automatic backups.

If backups require manual intervention, it's probably wasteful of effort to do them daily. I find that doing them roughly every two weeks seems reasonable. The rest of this subsection presents an example calculation for deciding on your backup frequency.

Suppose that each manual backup takes you ~15 minutes of human labor. (The computational processing may run for more or less time than that.) If you do N backups per year, you'll spend N * (15/60) = N/4 hours per year on these hard-drive backups.

A new backup occurs every 365/N days. On average, a distaster would strike in the middle of two backup rounds (sometimes sooner after your last backup, sometimes later), causing you to lose in expectation (1/2)*(365/N) days of work. This "work" that you lose will be a combination of stuff you do for your job (if you use your own computer for your job) as well as stuff you do as a hobby, for entertainment, etc. Assuming you're awake for ~14 hours per day, this means losing the new files, edits, and deletions that you did over the course of (1/2)*(365/N)*14 hours. Of course, many hours of your day aren't spent changing data on your computer's hard drive. You also spend time reading, socializing, sending emails, editing Google docs, and so on. The productive output of most of those activities isn't usually lost when your computer hard-drive dies. Let's assume that only 1/5 of your waking hours actually go toward creating or changing data on your computer's hard drive. (This fraction depends a lot on your job and lifestyle. For example, the fraction might be less than, say, 1/20 for a restaurant waiter or above 1/2 for a novelist.) So losing the data on your computer's hard drive since the last backup actually only costs (1/2)*(365/N)*14*(1/5) hours of time.

Suppose the probability of loss of the data on your computer's internal hard drive due to any number of factors (mechanical failure, theft, ransomware, etc) is about 20% per year. That estimate seems reasonable to me based on my past experience, though maybe you could reduce this number if you're particularly careful to keep your computer safe and check your disk's health from time to time. Then in expectation you'd lose (1/2)*(365/N)*14*(1/5)*0.2 = 102/N hours per year.

We want to choose N to minimize the sum of the backup time cost plus the data-loss time cost. Define

f(N) = N/4 + 102/N

In reality, N can only take on non-negative integer values, but we can pretend that the function f is continuous on the non-negative real numbers so that we can use calculus to find the optimum. Set the derivative equal to 0:

f'(N) = 1/4 - 102/N2 = 0
1/4 = 102/N2
N2 = 409
N = 20

We can verify that this is indeed a minimum rather than a maximum of f by seeing that the second derivative at N = 20 is positive:

f''(N) = (-2) * -102/N3 > 0

So you should do ~20 backups per year, which is close to one every half of a month.

One complication with the above calculation is that I counted all hours of time as equal, when in fact, some activities are more tiring than others. For example, some of the time one spends on leisure can be seen as an extra time cost due to time spent on cognitively demanding work (Cotton-Barratt 2015). That said, it's plausible this complication wouldn't affect my calculation very much because the time costs I compared (doing backups versus editing data on one's computer) are both for activies that are often reasonably cognitively demanding, so they may both require similar amounts of recovery time per hour of work.

Regularly vs occasionally used backup drives

I recommend backing up to at least two different external hard drives, ideally of different types and from different manufacturers in case one specific kind of drive turns out to be defective.

One reason this is a good idea is that it provides additional redundancy. For example, suppose that while you're backing up to your main external hard drive, a software or user error occurs that deletes the data from both your internal hard drive and the external hard drive that's plugged in. Or maybe some physical accident occurs while you're plugging in the external drive that harms both it and your computer. In cases like this, you'd still have a third copy of the data on another external drive.

Having at least two different external hard drives for backups is also useful because you can assign one to be backed up regularly (every half month), while the other one is only backed up occasionally (say, every 3 months). I'll refer to these as the "regular drive" and "occasional drive" hereinafter. This distinction is useful because if there's a problem with your data that you don't notice until, say, a month after it occurs, the data on the regular drive backed up every half month will already be overwritten, but the data on the occasional drive probably won't be, unless you just did a backup to that drive within the last month.

I maintain a policy of always running a fixity check on both os and ol files before backing up to the occasional hard drive, in order to check if there are data problems before syncing with that drive. If there are problems, you can restore data from the occasional drive as needed before it gets overwritten. After you verify that your computer's internal hard drive has its data in good shape, you can then overwrite the older version of data on the occasional drive.

In the overview diagram, you can see a file dlbuod_2019-07-01_eom.txt , which is short for "Date last backed up to the occasional drive: 2019-07-01. End of message." This empty file stores in the file name itself the last date when you did a backup to the occasional (every-3-months) drive. The _eom tag indicates that all the content of the file is in its file name; see the "eom" section of Tomasik ("How I use ...") for more explanation. After you do a new backup to the occasional drive, you can update the date in this file name.

The concern about overwriting old data too soon before you've discovered problems with the newer data can be alleviated further if you have more total copies. Unfortunately my backup drives as of 2019 only have enough space for one copy of my debs/ data. It would be nice if I could have several snapshot copies over time of the debs/ data on each drive. Fortunately, because my yebs/ data is so small, I do have space for multiple versions of it on each drive. I might store like ~10 total snapshots of it backwards in time. Having more snapshots is also nice because it makes it more likely that at least one of the encrypted archives will open properly and not be corrupt.

I use a different file system on my regular drive versus my occasional drive as a further way to have diversity in my backups. If the software on my computer that creates and works with one kind of file system happens to be buggy in some subtle way that I wouldn't notice until trying to restore files from the backup, then at least the other file system is likely to work. So, for example, I could format my regular drive as exFAT and my occasional drive as NTFS. Or I could format my regular drive as ext4 and my occasional drive as exFAT. And so on.

Setting up the backup drive

When you get a new external drive to use for backups, make sure it's formatted with the file system you want. Once you're ready to add data to the drive, then on Linux, you'd run the following commands to set up the folder structure into which your backup data will be placed. For the first command, replace the username and drive name with whatever they're called on your system.

cd /media/your_username/your_drive_name/
mkdir files
mkdir files/yebs
mkdir files/debs

Unlike on our internal hard drive, we don't need the temp/ directory on a backup drive. We need to create the yebs/ directory because we'll be plopping snapshot backup archives into it. And we need the debs/ directory to already be in place as the destination directory in our rsync command. (More on that later.)

My backup script

I wrote a Python script to automate the process of backing up my data to an external hard drive. It's the script backup.py that you can see in the overview diagram. As you can see in the-actual-bash-aliases-file.txt , I created a Bash function mybackup that allows for running this script in the Terminal from any location.

The script is written for Linux but could be made to work on Mac with a few tweaks, such as changing the /media/yourusernamehere path to /Volumes (or something like that; I haven't tried it). You'd also need to install command-line versions of GPG and 7-Zip on your Mac.

Before running the script, you should have already plugged in the external drive to your computer. The script assumes there's only one external drive plugged in at once (to keep things simple) and will harmlessly fail if there's more than one drive in your /media/yourusernamehere folder.

Sanity checks

A main priority with this script was making sure it wouldn't mess up my data, especially the master copy of my data on my computer. So I included numerous assertions and user confirmations throughout.

The script begins by checking that the required dependency programs are installed and fails if not. The script tries to figure out the computer's user name and asks for verification that it guessed correctly. The script checks that it's running in the right folder (the files/ folder), that the right subfolders exist, and so on. It then looks at the /media directory, guesses the name of the inserted drive, and asks for confirmation of that too. The script assumes that only one external drive is plugged in at a time and fails if 0 or 2 or more are plugged in.

Before doing anything further, the script creates a list of all the files and another list of all the folders in the user's home directory (except for some ignorable ones, like items in .cache/ ). At the very end of the script, these same lists are created again, and the script checks that they're equal to the original ones. In other words, no files or folders in your home directory were accidentally deleted or created. If the lists aren't equal, the script prints out which files and folders were added and/or deleted. This check at the end is in a finally statement, so it should run even if something fails earlier in the script. If the script does fail in the middle of running, you may have one or two stray files you need to clean up manually, and you'll be able to see which ones by comparing the files and folders from the beginning of the script to the end. That said, this script is unlikely to fail if you have everything set up correctly, and I put in lots of checks at the beginning to fail right away if things don't seem to be set up properly. Even if the script fails during the more important operations, I don't think this would result in harm to the data on your computer's internal hard drive. The only files the script deletes are files it created in the first place. Still, there's always room for oversights in the logic, which is why I included this check that your home directory's file and folder paths didn't change from beginning to end.

The script also does a slightly less rigorous check regarding files and folders on the external drive. It counts the number of files and folders in each of the debs/ and yebs/ folders before and after doing the backup, so that you can compare the numbers. Unlike your computer's home directory, the external drive should change in terms of its numbers of files and folders, as you add and remove files during the backup process. But at least the script lets you know how these numbers changed, so you can see if anything is unexpected. During the backup process, the yebs/ folder should gain no new folders and exactly one new file—namely, the newly added encrypted archive. The script fails if either of these isn't true. Changes for files in debs/ are less predictable, so the script doesn't fail on anything in this case; it just prints out the before versus after numbers for you to review.

Backing up debs/

The script asks you if you want to back up debs/ or not. If you decline, then only yebs/ will be backed up. Declining to back up debs/ makes sense if your debs/ data is larger in size than your backup drive, such as if you're backing up to a small flash drive. Even if you don't back up debs/ , the sanity checks earlier in the script required that a debs/ folder existed on the external drive. Therefore, you have to at least create a dummy, empty debs/ folder on the external drive regardless of whether it will be used. The reason for this is just that I didn't want to have to write special cases in the sanity-checking logic depending on whether you're doing a debs/ backup or not.

If you are doing a debs/ backup, the process is pretty easy because no zipping or encryption is involved. All I do is use rsync so that changes that have been made on the computer's internal hard drive version of debs/ get synchronized on the external drive's version. I first run rsync in "dry run" mode to generate a list of what files will get deleted and added. The user presses Enter to confirm, and then rsync runs again, this time for real.

rsync is called using Python's subprocess module. I also use that to run a few of the commands for the yebs/ backup.

Backing up yebs/

Before compressing and encrypting the yebs/ folder, I first create a checksum file for all its contents, called sha512-of-everything_made-just-before-backup.txt . On my computer, it takes a few minutes for this to finish. My script then checks the sha512-of-everything_made-just-before-backup.txt file. If there are errors, the script quits and lets the user figure out what's wrong. Otherwise, the script continues on to the step of compressing and encrypting.

A main concern I have when encrypting data is making sure I'll be able to open it again later, even as software tools evolve over time. One way to ensure that is to use free and open-source tools, so that old source code will always be available even if the software stops being maintained (such as in the case of TrueCrypt). Another way to help ensure the archives will still be able to open properly in the future is to use a diversity of different ways of compressing and encrypting them. In particular, my script chooses from among five different combinations of compression and encryption (although I only use two different encryption tools):

  1. 7-Zip compression and symmetric encryption
  2. tar the folder first without compression, then do 7-Zip compression and symmetric encryption
  3. tar with gzip compression, then symmetric encryption with GPG
  4. tar with bzip2 compression, then symmetric encryption with GPG
  5. zip compression, then symmetric encryption with GPG.

My script chooses one of these five options randomly unless you specify a particular one. That way, over time, I'll have a diverse collection of archive formats on my backup drive.

My script waits for the user before running the encryption step, for the following reason. rsync'ing your debs/ data and waiting for the yebs/ checksum file to be created can both take several minutes or more. You might walk away from your computer while those operations are running. However, if you let the script go straight on to the encryption step, it might fail, because GPG's password-entry screen times out after a minute. You have to enter your password right away after the GPG password-entry screen appears. That's why my script waits for the user to press Enter before proceeding with encryption. This waiting to press Enter has no timeout; you could come back a day later, and it would still be waiting for you to proceed.

After the new encrypted archive is added to the external drive, the script deletes the checksum file that it created.

My script doesn't automatically delete old archives from the external drive. I want to let the user do that manually, so that the user can decide how many old versions going back how far are needed. You can delete several old yebs/ backups in bulk, not needing to do it manually each time you make a backup, so automating the deletion process seems less useful than automating the creation process.

This script only works out of the box for an external drive, but you can still use this script to create encrypted archives that you then back up to other places as well, such as the cloud if you want to do that. Just run the script with an external drive plugged in, and then copy the encrypted archive it generates onto your computer so that you can back it up in another way as well.

Checksums for backups

As mentioned above, I create a sha512-of-everything_made-just-before-backup.txt file for the yebs/ data before backing it up. The purpose of this file is that when you later pull a yebs/ encrypted archive off of the backup drive and extract it on your computer, you can verify the integrity of the extracted files. To do this verification, you would go into the newly extracted folder and run

myshacheck sha512-of-everything_made-just-before-backup.txt

If this command returns no output, then everything was OK .

Theoretically I could do the same process of computing a checksum file for all the debs/ files before backing them up, but due to the volume of data, creating that checksum file would take at least an hour, and I don't want to have to do it each time I make a backup.

It's also worth noting that you already have checksums for your files in the "history" folders created by mymonitor . If you pulled yebs/ and debs/ data off an external drive and onto a new computer using the usual folder structure (i.e., both of those folders being located in ~/files/ ), you could run mymonitor (doing both an os run and an ol run) in order to compare the current checksums of the files against the previously recorded checksums saved in the latest history file. However, keep in mind that unless you ran mymonitor right before backing up your data, some of the file paths and checksums in your most recent history file will be outdated. So it may look like there was some data corruption on the external drive's data when in fact that's just because you edited the file since last running mymonitor . The sha512-of-everything_made-just-before-backup.txt file has the virtue that all checksums should be fresh, because it's created right before backing up. (One exception to this would be if you edit a file in yebs/ while the backup script is running, after checksums are computed but before the data is zipped up. In any case, it seems generally safest not to edit files on your computer while the backup script is running.)

Backups checklist

Before you run the backup script, you need to do a few other things first. This section presents a checklist to follow when backing up to an external drive. First I present a concise checklist of steps. This is what you would actually follow along each time you do the procedure. Then I later provide more detail explaining some of the steps. You wouldn't want to read this explanation each time you follow the procedure, which is why I'm separating it out.

The procedure includes lots of "if" statements. You can skip over anything indented below an "if" statement if the statement evaluates to false.

The procedure

Notes on some of the steps

Some of the steps in the checklist are self-explanatory or were already explained by previous discussion. Following are a few notes on some of the less obvious steps.

I suggest opening an old encrypted archive as practice to make sure your memory of the encryption password hasn't gone off track. It would be bad if your memory of the password mutated over time, such that you started encrypting files with a wrong password and then couldn't open them later using the correct password, or if because of the change in your memory of the password, you could no longer open older encrypted files that used the original, correct password. It's important to use an old archive rather than a recent archive for this test, since recent archives may have been encrypted with the new, wrong password in your head.

I'm not sure how necessary it is to ground yourself before handling external drives, but I did find several sources warning about the issue. For example, Menzies and Shallcross (2019) say: "Static electricity CAN damage flash drives! Before inserting a USB flash drive, be sure to discharge any electrostatic charge you may be caryying [sic] by touching something metal. Be especially careful during dry conditions." I assume the same idea would be true for external hard drives too. Menzies and Shallcross (2019) explain: "Internal and external hard drive housings will protect against most [electrostatic discharge] ESD, but a wrist cable and/or ESD protection mat should be used." You can search the web for {electrostatic discharge "flash drive"} for more sources. I found several sources saying electrostatic charge is a risk for storage media and several others saying it's not. TODO: Maybe some day I'll compile a list of more of these opinions. I assume like 99% of people don't worry about this issue, and I don't think it's a huge cause of data loss.

When testing that encrypted archives can open, you should pull them onto your computer and open them there, rather than opening them directly on the external drive, unless your external drive has full-disk encryption. That's because even if you immediately delete the decrypted files from the external drive, the decrypted data itself is still there hiding in the bits of the storage medium, until it gets overwritten by something else. (That said, it's unlikely that most bad guys would expend the effort to look for deleted-but-not-overwritten files.) In any event, pulling the data off the external drive and onto your computer seems like a better test to begin with, because it verifies that the data can in fact be pulled off the external drive and put onto another drive (namely, your computer's internal hard drive). As you can see in the-actual-bash-aliases-file.txt , I created a Bash function mysak that makes it very easy to extract encrypted archives.

The warning to only plug external media into computers you trust not to be malicious seems like generally good advice, and there are particular scenarios where doing this could be risky, though I don't know how big the risk is. Hoffman (2017): "an infected computer could reprogram a connected USB device’s firmware, turning that USB device into a malicious device. That device could then infect other computers it was connected to, and the device could spread from computer to USB device to computer to USB device, and on and on."

Burning optical discs

By the late 2010s, optical discs (CDs, DVDs, and Blu-rays) had become relatively uncommon, and they can be a pain to use. Optical discs tend to be slower, noisier, and less capacious than flash drives or external hard drives. However, optical discs might still be useful as an additional way to store data to make one's backups more resilient. For example:

It's unclear to me if these advantages of using optical discs as part of a backup strategy outweigh the hassle, because probably hard disks, flash drives, and SSDs would already be pretty robust if stored indoors in a place where temperature, humidity, vibrations, etc aren't too extreme. But if you're sufficiently paranoid about losing your data, doing optical backups as well could be useful.

I've written elsewhere about tips for burning data on Blu-ray discs (Tomasik "Archiving ..."). In this section I'll focus just on one part of the process of burning optical discs that I didn't discuss in that article: splitting one's data into chunks that can fit on a disc. This is a task sometimes known as "disc spanning". For example, if you're burning to 50 GB Blu-ray discs but you have 600 GB of data to back up, you need to split your data into at least 12 or 13 chunks.

One way to do this splitting is by hand. You can look at the sizes of various folders and find a subset of folders small enough that they can fit on a disc. If there's a single folder that by itself is more than 50 GB in size, you'd have to manually split up the files within it into chunks that are each less than 50 GB. If you only burn your data to optical discs once every several years, this hassle of manually splitting up your data might be tolerable because you don't do it very often. And if optical storage is just a "backup of last resort" for you, then probably you don't need to back up more than once every few years. Still, I like the idea of not having to do this manual splitting, so I looked for other options.

A second idea could be to use 7-Zip, which can split its archives into chunks of a given size. This would work, but I worry about what might happen if there was data corruption somewhere. Would that render the entire archive spanning all the optical discs broken? Or would it be possible to ignore the corrupted spot and still put the rest of the archive back together again? I don't know, and that topic sounds potentially complicated to understand for a novice like me.

A third solution is to split the data into folders of individual files (rather than a giant 7-Zip archive) in an automated way rather than by hand. That's what my script dir-splitter.py does. You can find it here. Before running it, you should examine and maybe adjust the global variables at the top of the script. You can set the size of your target disc (e.g., 50 GB), how much less than the target you want your data size to be (e.g., 0.9 to only collect 45 GB of data per chunk), what directory to split (the default is ~/files/debs/ ), and in what directory to put the script's output (the default is ~/files/temp/ ).

Your debs/ directory might have some data that you don't really need to back up. For example, I have some raw footage from filming bugs around my house that I don't really care about, as well as some old podcasts downloaded from the web that I'm unlikely to ever listen to. It may be a waste of effort to burn data like this to a Blu-ray disc. You can omit data you don't want to burn by adding a substring of the relevant path to the list OMIT_PATHS_WITH_THESE_SUBSTRINGS . You can see examples in the script itself.

The script lists all non-omitted files within the specified directory in alphabetical order. It runs through them one by one, looking at the size of each file, and grouping them together into a chunk until the size hits the threshold (e.g., 45 GB), at which point a new chunk is started.

In principle this script could directly copy the chunks of files into their own folders destined for burning. But doing this could overflow your computer's internal hard drive. For example, if your computer's hard drive is 1 TB and your debs/ directory is 600 GB, then it wouldn't be possible to copy all of that data and store it in chunked form on the same hard drive, since you'd need at least 600 GB + 600 GB = 1200 GB in total. Instead of copying the data on its own, this Python script merely outputs Bash scripts that you can use to copy a chunk of data one at a time later. For example, if you run this script on , and if you have 13 total chunks of data, then the output of this Python script will be 13 Bash scripts:

part-01-of-13_created2020-12-07.sh
part-02-of-13_created2020-12-07.sh
...
part-13-of-13_created2020-12-07.sh

If you then run the command

bash part-01-of-13_created2020-12-07.sh

you'll create a folder part-01-of-13_created2020-12-07/ and copy the first chunk of debs/ data into that folder. If you do this one chunk at a time, you won't overflow your internal hard drive unless it's almost full already.

Each Bash script starts by creating the folder into which to put the chunk of files. It then uses cp --parents to copy the files there while preserving their folder structure. Finally, my script enters the created folder and runs myshaofeverything to create a list of checksums of all the files in the folder; this list is named sha512-of-everything.txt . The script then runs myshacheck sha512-of-everything.txt to make sure these checksums are ok. Later on, if you want to check that the data on your disc hasn't been corrupted, you can rerun that myshacheck sha512-of-everything.txt command.

Finally, the newly created folder (e.g., part-01-of-13_created2020-12-07/ ) is ready to burn to an optical disc. I recommend writing this same name (i.e., part-01-of-13_created2020-12-07 , and also writing debs/ dir) with a Sharpie on the external case of the optical disc (not the disc itself) in order to label it. I included the date in these names because it may be useful to know that later on when you're looking at the name you gave to the optical disc.

To run my script after adjusting its global variables, just go to the scripts_ytmc/ folder and type python3 dir-splitter.py .

This script can be used for more than just Blu-ray discs. For example, if your debs/ directory isn't very large and could be backed up on DVDs, you can use dir-splitter.py in that case as well by setting SIZE_OF_TARGET_DISC_IN_BILLION_BYTES to 4.7 or 8.5 or whatever size the DVDs are. Note that dir-splitter.py fails if there's a file larger than the SIZE_OF_TARGET_DISC_IN_BILLION_BYTES threshold. For example, if you have a 6 GB file and are trying to back up to 4.7 GB DVDs, my script will fail.

If your yebs/ directory isn't too large, you don't need the dir-splitter.py script. You can just copy an encrypted yebs archive that you created using backup.py and burn that onto your disc, perhaps along with a checksum for the encrypted archive file, which you can create using my Bash function myshaforonefile . If yebs/ does need splitting, you can use dir-splitter.py to split it and then encrypt each of the parts separately (perhaps using the mysak Bash function).

Moving to a new computer or reformatting your current one

This section gives a checklist to follow when moving your data from an existing computer to a new computer, or when reinstalling your operating system from scratch on your current machine. The fixity-checking and backup tools we developed earlier can make this process easier.

Reasons to reinstall your operating system could be if it's running slowly, if you may have been infected with malware (Indiana University Knowledge Base 2015-2018), or just to generally keep things fresh. Reinstalling will get rid of extraneous programs and data that you no longer use. Going through the process of reinstalling the operating system from scratch also can be an exercise for you to make sure that everything you care about is actually included in your backups. That's a good thing to verify anyway because you want your backups to be comprehensive in case you need them, such as if your computer dies and you're forced to set up everything from scratch on a new computer. That said, reinstalling the operating system and setting up all your programs from scratch takes a lot of time, so I wouldn't do it very often. If you move to a new computer every few years anyway, you may never need to reinstall your operating system on your current machine.

There are tools to make moving to a new computer easier. For example, Mac has a Migration Assistant app. Windows 10 allows for syncing settings and data across computers using your Microsoft account. I worry that trying to automatically transfer everything over to a new computer might sometimes run into compatibility problems? I feel like doing the migration manually is more certain to work properly, and if I only do it once every few years, the time cost of doing it manually isn't that severe. I'd rather not migrate anything more than my raw data because I'd rather have fresh versions of programs that aren't burdened by old baggage. Also, if there might be malware somewhere on your old machine, I imagine that it's more likely to transfer to the new machine if you sync everything rather than if you just move data files? (If you sync apps from the old to the new machine, that includes syncing a malicious app.)

Apparently I'm not alone in preferring to migrate manually. Collins (2015) says (at 4m21s):

When we talk about doing backups in Linux, what's really important to back up are your personal files. The configuration of the software on your computer is really not that big a deal. As a matter of fact, if you did do a complete and total system backup [...], chances are it would take longer to do that than it would to just reinstall Linux and move your personal data from your "home" folder backup into the system. So I don't even bother with [full-system backups].

That said, I prefer to migrate even less than the entire home folder, which contains a lot of random data created by programs that I mostly don't need to keep. I migrate just yebs/ and debs/ .

The following subsections explain steps to manually migrate to a new computer. When I say "new computer" in the below procedure, it can equally well mean your current computer after reinstalling the operating system from scratch.

Step 1: Collecting all important data

The first step is to find everything on your computer that you want to save and make sure it's included in one of your backed-up folders, i.e., yebs/ or debs/ .

Since the contents of ~/files/temp/ won't get backed up, clean those files all out. Probably you can delete most of them. If anything needs saving, move it to yebs/ or debs/ . Or, if you want to delete the data from temp/ but don't have time to do it properly now, move the data to yebs/ or debs/ temporarily and deal with it after you've moved to your new computer.

Browse around the rest of your computer to see if there are any stray files and folders that you want to save. For example, look at Desktop/ , Documents/ , Downloads/ , Pictures/ , etc. See if there are any special folders that you want to keep that are in your home directory due to being created by a specific program for storing that program's data. Use ls -A when listing items so that you can see any hidden files or folders. If you find anything that needs saving, move it to yebs/ or debs/ .

Earlier I suggested making a folder in ~/files/yebs/processed/computer/programs/active/ to store information about each program (and each browser plugin) that you had to install yourself, explaining how to install it and what customizations you made to it. I also store notes about any preinstalled programs that I tweaked as well. This active/ folder will be your guide to setting up those programs again on the new computer. If you diligently recorded every program you installed and tweaked while you were using your computer for the past few years, then the documentation in the active/ folder should already be up to date. If you think you may have forgotten to make a documentation folder for one of your programs or forgotten to record some custom settings, then you can look through the programs on your computer and check that you have the needed documentation for all the ones you care about. If you miss one program or a few custom settings, it's not disastrous because you will probably notice that you need those things once you get to the new computer and can figure out how to reproduce them. At that point, you can document the program in your active/ folder so you'll have the information for next time. If there's a program in active/ that you don't want to reinstall because you no longer use it, then you can move its documentation folder to ~/files/yebs/processed/computer/programs/no-longer-used/ .

For each program you use, think briefly about whether it has data that you want to save. For example, your browser may have bookmarks that you'll want to export, assuming you don't have them cloud-synced. Perhaps you have browser plugins that have their own data that should be exported. For example, if you have an ad-blocking / content-blocking plugin, you likely have custom rules, whitelisted domains, etc that should be saved.

If you've never used a given program on your computer, or if you only played around with it but never did anything meaningful with it, then you should be able to ignore it, because it shouldn't have any important custom settings or data needing to be saved, and you don't need to reinstall it.

Step 2: Backing up the data

Now that everything needing to be saved is either in yebs/ or debs/ , you can back up those two folders.

First, run mymonitor for os files and then run it again for ol files to make sure that the data on your computer is in good shape and that you have checksums recorded for everything (except ignorable files that you don't care much about). If there's a problem, fix it, and then rerun mymonitor until everything looks ok.

The checksums that you just saved with mymonitor will be used to verify the integrity and attendance of your files once you move to the new computer, so don't modify any of your monitored files from this point forward, or else the checksums won't match. Or if you do modify a file, make a mental note that you did so, so that you won't be surprised when the checksum doesn't match later.

Now do the backup procedure to put the data on an external drive. (When you go through that procedure, obviously you don't need to repeat the mymonitor stuff that you just did.) Actually, I recommend backing up to at least two external drives, just to be extra sure that you have a good copy of the data. Test opening some files in debs/ on the external drives to make sure they work properly, and maybe pull the encrypted yebs/ archive off an external drive back to your computer to test opening it, to be sure it's ok.

It's most important to check that the data on the external drives is in good shape if you're going to be reformatting your current computer, since in that case, you won't have any copies of the data on your computer to fall back on in the event that the external-drive versions turn out to be bad. So make super sure that the external-drive data will work. If you have a spare computer that you trust, you could try pulling your data off an external drive onto that computer and opening some files, including the encrypted yebs/ archive, to make sure everything works.

Step 3: Before (re)installing the operating system

Among my folders in ~/files/yebs/processed/computer/programs/active/ is a folder documenting steps to follow when installing my operating system, including which options to select on the setup menus. I need to be able to read these documentation files while doing the installation. If you're setting up a new computer that has its own screen, such as a laptop, then this isn't a problem, since you can open the documentation files on the old computer while doing the steps on the new computer. However, if you're getting a new desktop computer that will use the old computer's monitor, or if you're reinstalling the operating system on an existing machine, then you'll need to put these documentation notes somewhere else to be able to use them. You could print them out. Or you could copy them temporarily onto another computer (such as a laptop or tablet) so that you'll have them while your main computer monitor is occupied with the installation process.

When doing this, you should print or copy not just notes about installing the operating system but also notes about any other programs you need in order to be able to read your data backups. For example, if your yebs/ data is stored in an encrypted 7-Zip file on the external drive that you're using to transfer the data, then you should have notes on how to install 7-Zip on the new computer, since you'll need that to open the encrypted yebs/ archive.

At this point you can go ahead and install or reinstall your operating system. You can look up how to do that elsewhere.

If you're following along with the steps on this web page, then go off and do the installation, and then come back to this page when your new computer has its operating system installed and can browse the web.

Step 4: Setting up the new computer

Now that you've installed the new computer's operating system and have come back to this page, it's time to restore your files on the new computer. Run these commands:

cd ~
mkdir files
mkdir files/temp
touch .bash_aliases

Go into your files/ directory. Drag off the yebs and debs data from your backup drive. For debs data, grab the whole debs/ folder. For yebs , grab only the most recent backup snapshot. The yebs data needs to be decrypted and extracted on your computer after being pulled off the external drive. (Don't decrypt it on the external drive itself unless you have full-disk encryption there.)

Unfortunately you can't use the mysak command to decrypt and extract the yebs archive because you don't have your custom Python scripts set up yet, since the Python scripts are in the yebs archive itself. Instead, you'll have to decrypt and extract the yebs archive by hand. You can look up what command(s) to use by consulting the code of the Swiss-archive-knife.py script. Once the yebs/ directory is extracted, you can delete the archive file(s) from which it was extracted.

Next, add the one line to .bash_aliases that I described earlier. Quit Terminal and reopen it so that your custom Bash aliases and functions will now be loaded.

If the newly extracted yebs/ directory has a sha512-of-everything_made-just-before-backup.txt file from when it was backed up, run

myshacheck sha512-of-everything_made-just-before-backup.txt

to test the checksums. If that command returns no output, everything is good, so you can delete the sha512-of-everything_made-just-before-backup.txt file.

Run mymonitor for os files and then run it again for ol files to make sure the data transferred to the new computer correctly. In the unlikely event that there are problems, like missing or corrupted files, try to fix them using the data on your other backup drive or the original computer's internal hard drive if it still exists. (Or if you're nervous about the overall quality of the data transfer due to errors, then you might start the data-transfer process over again by doing a new backup on the old computer, assuming you still have the old computer intact.)

Test opening a few random files throughout yebs/ and debs/ , and make sure they look ok, as a further check that your data migrated properly to its destination.

If file and folder permissions got changed to 777 by being on the external drive's file system, see Tomasik ("Changing 777 ...").

At this point, your data should be restored. Now you can go through the folders in ~/files/yebs/processed/computer/programs/active/ and install each of those programs, making adjustments to the settings as needed. This is a big task, and you're unlikely to finish in one sitting. One way to keep track of what you have and haven't done is to copy the entire active/ directory to some temporary location, and as you do each step, you can delete the corresponding folder, file, or line of a file. Once the copied active/ folder is empty, you're done. If during this process you discover any changes that need to be made to your notes, make those changes on the original version of your notes (not the temporary copy).

You may want to do security-relevant setup steps first, such as turning on your software firewall if need be and maybe installing antivirus software.

In addition to customizing settings, remember to restore any data that your programs need, such as browser bookmarks.

You may want to keep around your old computer for a few weeks or months following the migration, so that if you forgot to transfer anything, you can go get it from the old computer. (Unfortunately, this doesn't work when reinstalling the operating system on your current machine, unless you saved a full-system backup.) Once it looks like the old computer is no longer needed, securely wipe and factory-reset it before donating, selling, or recycling it. You can look up guides on how to prepare your type of computer for transfer to someone else.

Things I would have done differently

If I were starting over from scratch the process of writing this article and its attendant Python scripts, I would have made a few small changes. I don't intend to make these changes now because doing so would involve a bit of extra work for relatively little payoff. Once my code has stabilized and been well tested, I'm reluctant to make changes to it in case I would mess something up. I think fixing the following things would only be worthwhile in the unlikely event that my organization conventions and scripts found widespread adoption, in which case even little tweaks could be pretty valuable. However, I'm almost certain that my idiosyncratic conventions won't become widely used.

How organized should you be?

Keeping one's data organized and backed up comes with costs and benefits. One has to try to find the optimal balance between being too sloppy versus being too perfectionist. This section discusses some personal thoughts on that question.

I enjoy some things that other people consider boring, such as reading fine print and carefully checking my work. The fact that I enjoy perfectionism was a main reason I did well in school, because the grades you get in school reward people who are perfectionist at schoolwork to the neglect of other aspects of life. One friend in high school asked me why I regularly spent ~7 hours taking details notes on ~20 pages of reading from the US-history textbook, and one reason I gave was that "Taking detailed notes is too much fun."

At the same time, excessive perfectionism is often a waste of time, so I now try to rationally assess roughly how perfectionist I should actually be about a given activity, and I often consciously choose to be less perfectionist than I would naturally incline to be. I still often err more on the side of quality over quantity than I should for certain kinds of work, partly just because I like perfectionism, but I recognize that excessive attention to detail is a form of selfishness rather than something that's necessarily optimal from an altruistic utilitarian perspective.

I suspect that many people avoid this problem by getting bored, but I don't really get bored (other than to get fatigued after doing the same thing for a while, which I don't consider to be the same as boredom). I love understanding every last detail of something.

I think the appropriate degree of perfectionism varies a lot depending on the activity. For example, when you're setting, updating, and storing passwords to online accounts, you should pay extremely close attention to what you're doing and avoid losing track of the password. This is especially true when encrypting your own files, because there is no "Forgot your password?" reset button for an encrypted file. It seems important to at least be highly organized regarding your passwords, keys for two-factor authentication, backup drives, and such, to avoid locking yourself out of accounts or losing your data. Regularly testing your backups to make sure they work also seems like a low cost to incur in exchange for the benefit of reducing the risk that your backups won't work when you finally need them.

On the other hand, some ways of staying organized on the computer have much less value. One example I discussed earlier was that most archived messages in ~/files/yebs/processed/letters/ probably aren't worth organizing.

I have a lot of files in my unprocessed/ folders because I've never gotten around to organizing them into their proper resting places, and for some of them, maybe I never will. Perhaps it's ok to have some dumps of low-importance data that you may not ever sort through or review. There are a hundred sources of digital clutter, and the firehose of new data never stops. We have to figure out which clutter is worth addressing and which can be left alone.

I'm often astonished at how most people breeze through setting up online accounts or installing software without paying much attention. Probably I put more care into these things than I should, but I've also learned through experience that paying attention when setting things up can pay dividends. Once, in 2013, I was trying to reduce perfectionism and quickly went through an install screen for a popular program. Soon thereafter I discovered that my browser was messed up with a crappy toolbar. It turned out there was a fine-print checkbox regarding this toolbar during the installation process that I had missed. Since then I carefully read the fine print during installations and account setups (except for the full legal terms and conditions). Often there are checkboxes about getting marketing emails from an online service that you have to uncheck during account setup, which one would miss by just rushing through the process. In cases like this, it's not clear if "perfectionism" is actually a net waste of time, because deleting and unsubscribing from junk emails also takes time.

Similarly, several of the organizational ideas I discussed in this piece are designed to save you time in the long run, such as marking files with _scdi to tell your future self that you don't need to ever reread them before deleting them. In these cases, being organized may be a short-term cost that has long-term payoff, similar to reducing technical debt in the realm of software engineering. On the other hand, it's not always clear how big the savings in fact are. Maybe even without labeling a tax document with the _scdi tag, it still would be obvious that you could delete it in the future without needing to look at it. Or maybe you could just leave it sitting around in your files forever, rather than deleting it at all. There are a lot of disorganized people in the world who still manage to not lose that much time due to being disorganized, which can lead one to question just how important organization is.

Perfectionism about being organized can sometimes save time if it makes automation easier. It's more feasible to write a script to do something on your computer if all the files and folders being operated on follow a rigid pattern. Heterogeneity and clutter are enemies of automation.

If you do like being perfectionist, then it's good to also be minimalist. In other words, if you emphasize quality over quantity, you can't have too much quantity. I try to minimize signing up for new online accounts, installing new software, getting new devices, and so on, in order to reduce the number of things I have to maintain. In the past, I created lots of online accounts without thinking much of it, not realizing at the time the amount of long-term investment I was taking on. Here are some costs of having a non-throwaway online account:

One of the reasons I like storing data in my own formats offline is that I avoid all of these costs. There's no account to set up or close out. There are no marketing emails. The marginal costs of password protection and backups are basically zero because you're already doing that for the rest of your data. Of course, managing one's own data rather than using a cloud service also adds some costs. For example, hypothetically a cloud service could forward-migrate the file formats for your data automatically, while you'd have to do this manually if you store the data yourself.

I think it's good to avoid self-identifying as a perfectionist because too much perfectionism is something one should try and avoid, although one also has to respect the limits of one's psychology. Part of the problem with perfectionism is that it focuses too much on the small problems with a given target of focus while ignoring opportunity costs. Every hour of every day, possible opportunities to create value are slipping away forever. That is itself a major "problem" too, and one that ideally should be included within the scope of problems that a perfectionist is worried about. Of course, fretting too much about missed opportunities can itself be psychologically stressful and debilitating. Somehow one has to find a "middle way" between these extremes, though doing so is hard, and I often don't succeed at it.

I think I'm pretty good at avoiding perfectionism when blogging. I don't require that I thoroughly understand a topic before writing about it, because if I had to be an expert on a topic to write about it, I would write about hardly anything at all. Writing about big-picture topics by its nature implies that you can't be a thorough expert on the subject matter, because there's too much to know and too few years in a human lifespan.