Sanity checking the number and size of your personal files

By Brian Tomasik

First published: 2019 Jun 23. Last nontrivial update: 2019 Nov 01.

Summary

This page suggests a super simple (perhaps obvious) routine you could follow to measure the size of your personal data over time, as a way to help detect any major accidental data loss.

Summary
Motivation
Procedure
Including sizes of subdirectories

Motivation

There are some dumb ways in which you could potentially lose large amounts of data from your computer. For example:

If you're manipulating directories near the top of your file hierarchy, there's some risk you'll accidentally delete a directory that contained a lot of files.
Running rm -rf incautiously could wipe out a large chunk of data.
A buggy program or script, whether written by you or someone else, could wipe out data.
Your file system itself could conceivably lose data, though hopefully this is unlikely if the file-system software is relatively stable.

If a directory of your data is unchanging over time, a good way to ensure you haven't lost data is to create checksums for all the files in that directory, such as using the shasum tool (Tomasik "Manually ..."). However, often your files aren't static, and constantly updating checksums to track changes to the files would be too much work.

A simpler, less accurate alternative is to measure the total size of your files and make sure it doesn't change drastically over time, unless drastic changes are expected. You can also count the total number of individual files and total number of individual directories, and make sure those don't change drastically either, except when you expect them to. Then, if you do notice any dramatic, unexpected data loss, you can hopefully restore the lost data from older backups while you still have those backups.

If you access the files regularly to use them, then you're likely to already notice any major data loss when you try to navigate to a familiar location and don't see it. So the sanity check I'm proposing here is mainly for data that you don't look at very often but that isn't quite static enough for it to be convenient to use checksums to verify data fixity.

Or, if the data loss doesn't consist in wiping out an entire large directory but rather takes the form of losing, say, 5% of the files within your directories, you might not notice the data loss on your own, and a check like this would be useful.

Procedure

As an example, suppose you store your data in two main directories: Desktop/ and Documents/ . Within each directory, you can create a file called hofds.txt , which stands for "history of file counts, directory counts, and space used". I would store the data in this file in "comma-separated values" (CSV) format, maybe with a space after each comma to make things more readable. The first row contains the headers:

Date, File count, Directory count, Disk usage, Explanation for any large changes

Subsequent rows will contain the numbers that you measure. You can take these measurements periodically (say, every 6 months) in order to make sure you haven't accidentally lost a bunch of data. If you make big changes to your number or total size of files that are expected, you can document that in the "Explanation for any large changes" column.

First, navigate to the directory of interest—e.g., the Desktop/ directory. Then, on Mac and Linux, you can use this command to measure file count:

find . -type f | wc -l

This command measures directory count:

find . -type d | wc -l

And this command measures disk usage:

du -sh

Of course, even without using a command prompt, you can probably find the total size of and number of items in a directory using a GUI file manager.

You can manually copy and paste these numbers into hofds.txt . It's probably most convenient to put the newest date first so that if you have a lot of rows in the file, you won't have to scroll all the way down to see the latest info.

Here's an example of what your history might look like after three rounds of measurement:

Date, File count, Directory count, Disk usage, Explanation for any large changes
2020-04-02, 8216, 101, 2.1G,
2019-12-14, 8148, 99, 2.0G, deleted a bunch of junk
2019-07-01, 11989, 130, 4.2G,

Of course, you could alternatively store this information in a spreadsheet rather than a .txt file, but I like not needing to open bulky spreadsheet applications.

You could write a program to automate these measurements, but you'd still need to manually review them, unless you set up an even fancier system to send alerts if and only if one of the metrics decreases by more than some threshold percentage.

Including sizes of subdirectories

When you only keep track of statistics at the top level of a directory tree, you lose granularity about what's happening in different subdirectories. As an example, imagine that within Desktop/ you have one subdirectory for videos/ and another for todo-lists/ . Probably the videos/ directory will be vastly larger in size than todo-lists/ , and it may be hard to see how the size of todo-lists/ is changing because its size is a drop in the bucket compared with videos/ .

One solution to this problem can be to track sizes of the subdirectories as well as the top-level directory. The du utility allows you to show sizes for all subdirectories up to a given depth using the -d option. Say you want to get sizes up to the second level of subdirectories. You would use this command:

du -h -d 2

There's a tradeoff between tracking the size of too many subdirectories versus too few. Tracking too many makes it more time-consuming to perform the sanity checks making sure that you haven't lost lots of data, and you'll have more instances where sizes drastically fluctuate if you move lots of files from one subdirectory to another. Tracking too few subdirectories can hide potentially significant changes because you're only seeing one aggregate metric that's dominated by the largest files.

Since the output of du -h -d 2 will likely contain several lines, you may not want to use a CSV format to store this information, since the number of columns won't necessarily be fixed from one run to the next, and you'd have to copy and paste a lot of numbers by hand. Instead, you could just dump the output of the command directly into a file. You could name this file with the date and compare it to previous files that store the results of previous runs. You could collect all these files under a hofds/ directory.

In fact, you could dump all the metrics into a file at once by running the following command within the directory you want to monitor, assuming you already have the hofds/ subdirectory set up:

(find . -type f | wc -l && find . -type d | wc -l && du -h -d 2) > hofds/`date +'%Y-%m-%d'`.txt

For example, if you ran this command on 2019 Sep 15, it would create a file hofds/2019-09-15.txt . If you care to record the reasons for any big changes in these metrics from last time, you could open hofds/2019-09-15.txt and manually add that information to the bottom.

While you can compare a newly created hofds/ snapshot file against the previous one by looking at the two side by side, it's easier to compare them using a "diff"ing program, especially a graphical one. This allows you to focus just on changes: increases or decreases in counts and sizes, addition of subdirectories, and removal of subdirectories.

In principle, you could set the above command to run on a schedule, perhaps every day. Then you'd accumulate a bunch of snapshots over time. This might be slightly more helpful than just taking snapshots every once in a while, since it would allow you to zero in on exactly when some big change took place. In principle, you could even write a program that would parse the numbers from these daily snapshot files and plot them as a graph to make visualization easier. I worry whether these additional ideas would be overengineering what should be a relatively simple check, so for now I prefer to just create and compare the snapshot .txt files by hand once every few months. Maybe there are existing programs that already do these more sophisticated forms of monitoring of your data size?

Summary

Contents

Motivation

Procedure

Including sizes of subdirectories