Some tips on backing up your data

By Brian Tomasik

First published: 2016 Feb 29. Last nontrivial update: 2019 Feb 14.

Summary

This page explains how to export data for backup from Google and Facebook, as well as how I back up my websites.

Summary
Introduction
Website backups
How to collect data for backup
(Optional) Geomagnetic storms and EMPs
My backup schedule
Quirks and workarounds when downloading Google Drive data
Scratchpads vs. archive documents
I don't depend on others to preserve my writings
Why I upload old school papers
Why I love plain text
Footnotes

Introduction

You've probably had the experience of writing a long, thoughtful comment on a forum, pressing "Submit", and then finding that the comment didn't go through, causing your work to be lost. I find that this is an extremely frustrating experience (and one that has led me to do Ctrl+c on the text of my long comments before submitting them, in case I need to later Ctrl+v to recover the text). Compared with losing a single forum comment, losing your lifetime collection of writings and other content would be many orders of magnitude worse. This is why I'm rather obsessive about backing up content of mine that I value. (Content that I don't care about can be deleted to reduce clutter.)

An important fraction of my extended mind is composed of information in Google services, on my websites, etc. I expect the risk of losing such data is fairly low, but it might happen if someone hacked the accounts and deleted the data, or if I lost all ability to log in to the accounts, or if I accidentally deleted the accounts, or if the sites had major data-storage failures. Since it's cheap to back up the data, it seems reasonable to do so periodically.

Leo A. Notenboom's "You Can’t Assume the Cloud Has Your Back" is one of many introductions to the importance of backing up your data.

Every ~6 months, I do a full backup of my important data, including Google Drive, calendars, etc. If any files have non-public information (such as Social Security numbers or private conversations with friends), you should probably encrypt these files before backing them up.

I store my data both in the cloud and on discs at home for redundancy. This leaves at least one extant copy of the data in the event of a variety of possible disasters: house fire, theft, losing/breaking physical discs, cloud data loss, account hacking, and so on.

Website backups

My websites are the most dense form of important data that I don't want to lose. One copy of them lives directly on the servers. Most pages also have one or more backups on Internet Archive. I sometimes upload new iterations of particular pages to Dropbox, as a crude form of version control. However, it's helpful to have additional backups besides these.

For additional redundancy, I formerly used Dropmysite, which backed up monthly all the websites I maintained. It also allowed me to download zip copies of each site's WordPress database file (which contains essay texts) and WordPress "uploads" folder (which contains images and PDFs). I then uploaded these to the cloud. Note that a WordPress database .sql file contains potentially sensitive information, so if you do this, you should encrypt the file.

As of 2017, I've stopped using Dropmysite to save money and because I'm now just doing website backups by hand using website-downloading tools: HTTrack on Windows or SiteSucker on Mac. I do these downloads every few months and then upload to the cloud. Because these copies of the websites are downloaded from the public web, they don't contain any sensitive information and so can be stored unencrypted (and even unzipped).

I occasionally send copies of my website backups to one or more friends for them to back up so that even if all of my accounts went away simultaneously, someone else would still have the data. Since these sites are downloaded from the public web, there's no need to take precautions to keep the data private.

Finally, every few years, I plan to create paper copies of my websites, as discussed here. I completed my first paper backup of my websites in 2017.

How to collect data for backup

This section describes how to pull your data from various places.

Google Takeout

Google Takeout allows you to download roughly all of your Google content to a collection of zip files.

If you have multiple Google accounts (e.g., one for an organization that you're involved with), remember to do Takeout for all of them if you have permission to do so.

Google Takeout exports both the contents of a Google Doc (as a .docx file, or a .txt file if you so choose) and its comment threads, including resolved threads (as an .html file). The export doesn't seem to include version history for the Google Doc, even though the online Google Doc does have version history.^a

In addition to storing version history for Google Docs, Drive can store version history for regular files that you upload too, if you re-upload a file with the same name as the previous one to the same folder. You can choose for the old version of the file to be kept for 30 days or permanently. Google Takeout appears not to export old versions of a version-historied file—just the latest version of the file—which I tested for myself in 2018 with a simple text file that I replaced on Drive. Thus, if you really care about backing up file version history, you need to find another way to preserve it, since Google Takeout won't do so. (Of course, a hacky solution that works fine if you don't have immense amounts of version history to worry about is to datestamp files and upload all copies, such as uploading both myfile_2014Jun11.txt and myfile_2014Sep09.txt.)

Google Calendar is also included in Google Takeout, but if you want to download it individually, you can go to "Settings" -> "Calendars" -> "Export calendars".

Folders/documents you don't own

I think Google Takeout only exports Google Drive documents for which you're the owner. Unless you transfer ownership, you're the owner of the documents you've created.

What if you want to export all documents in a given folder, not just those you own? Or what if you want to export a whole folder that you don't own? In these cases, you can download a folder by clicking the down arrow on the folder path name -> "Download", which will download the whole folder as a zip file. Make sure you have permission to store the data.

YouTube

YouTube data can be included in Google Takeout. If you want to back videos up individually, you can download your own videos. However, if you download your videos one by one, their size is generally several times smaller than the size you uploaded them as. For example, I uploaded a 275-MB .mp4 video to YouTube, and when I downloaded it back from YouTube, the .mp4 file was only 43 MB. However, Google Takeout seems to return the original video file rather than the shrunken version.

Personally, I prefer to store on my own the original copies of the video files I create, rather than using Google Takeout from YouTube to get the video files. One reason is to ensure that I have two copies of the video files on the cloud rather than only having the single copy on YouTube. In addition, Google Takeout will refuse to export a video that contains even a small amount of copyrighted content; you'll get a message saying "Unfortunately the video [your video name here] was marked as un-exportable by YouTube so Takeout was not able to include it in your export."

I still do run Google Takeout on my YouTube data periodically in order to get the metadata about the videos, especially title and description. I upload these metadata files to Google Drive. Your thumbnail images sadly aren't included in the Google Takeout export, but you can download the thumbnail images based on the .json metadata files using a script that I wrote.

Facebook

Facebook also has a takeout feature. Click the down arrow in the upper right of any Facebook page -> "Settings" -> "General" -> "Download a copy of your Facebook data." Facebook groups aren't backed up in this process.

GitHub

You can search for some systematic ways to back up your GitHub content, but if you only have a few repositories, a simple solution is just to clone all of them and then back up those folders.

Stray documents on your desktop

I try to keep all important data in the cloud in case my laptop crashes, but you can also make sure that any data on your desktop that's not already stored with Google gets included in your backup.

(Optional) Geomagnetic storms and EMPs

This discussion has moved to its own page.

My backup schedule

Here's a summary of how I do periodic backups.

Do every 3 or 4 months:

Use SiteSucker to download the latest versions of important websites. I enter the following urls one by one:

- https://reducing-suffering.org/
- https://briantomasik.com/
- http://www.simonknutsson.com/
- https://foundational-research.org/
- https://casparoesterheld.com/
- http://www.wallowinmaya.com/
- http://prioritizationresearch.com/
- http://s-risks.org/
- https://was-research.org/

Once this completes, quickly spot check the downloads to make sure the needed content was retrieved. Zip the folders, name the zip file with the date of download, and upload to the cloud. I also keep at least one copy of my website downloads unzipped on the cloud just in case zip files for some reason fail to uncompress in the future. Plain text is relatively robust against data corruption.

Let me know if you'd like me to add your site to the list of sites I download periodically.

Every once in a while, I should also export and back up the redirects for my sites.

Do every ~6 months:

Send the latest zip file of the backed-up websites to a friend for him/her to back up as well.
Back up Google data. When doing Google Takeout, I only care about including the following products: Blogger, Calendar, Contacts, and Drive. I also export YouTube in order to get the metadata (as described above in the "YouTube" section of this piece); I don't need the Takeout-exported video files themselves, because I already back up the original version of any video file that I create immediately after I create it.
If I've been active on other platforms recently (such as GitHub), download that content as well.

Do every ~5 years

Create new paper printouts of my websites.

Quirks and workarounds when downloading Google Drive data

Multiple download chunks

With a default 2 GB size of download files, you may have multiple parts to your Google Takeout download. Unfortunately, these separate chunks may confusingly contain files from a variety of Google products and from a variety of different Drive folders. As someone explained on the Google Drive Help Forum: "I downloaded several folders with subfolders from Google Drive and I ended up with about 10 zip files on my MacBook. When I extract the zip files, I see that every zip file has a few files from each of the folders in my Google Drive."

To keep different products in separate chunks, you could export all the small products first (Blogger, Calendar, Contacts) and then export Drive separately. Better yet, you can use a bigger size for the download files than 2 GB so that everything fits in one download chunk; you may need to download as a .tgz file rather than a .zip file to make this work.

If your data is bigger than the 50 GB limit on how big a download file can be, one solution is to arrange your Drive data into top-level folders that are each less than 50 GB in size. While by default Google Takeout exports your entire Drive, you can also choose to export only one or a few top-level folders at a time. If each top-level folder is within the 50 GB limit, you can do Takeout on one top-level folder at a time, which downloads that whole folder in one piece rather than intermingled with other folders.

Truncation of file and folder names

Another annoying quirk of Google Takeout is that (at least as of Feb. 2018) it seems to truncate long file and folder names (those longer than 50 characters, or 54 characters if the file has 4 characters for an extension, such as the 4 characters in .txt). I verified that this problem occurs both for .zip and .tgz Takeout export formats. I don't think it matters how deep within Drive folders the file is, because I get the same problem even if the Takeout-exported file is in a top-level Drive folder. And while one might hope that Google Takeout would store untruncated file names in the overview index.html file that links to the individual files in your archive, that's not the case: this index.html file also truncates file names.

While file-name truncation may be merely annoying for most isolated files, it's a more serious problem in some cases:

If you rely on a file name to encode metadata about a file that's not stored anywhere else, then truncation may lose some of that information. If you use Takeout to export files, you'd want to check that if truncation occurs for a given file, the information in the truncated file name is also contained in the file itself, so that losing the full file name is not disastrous. For example, if you have a PDF file whose file name is the document's title, as long as the document title is also visible in the PDF itself, then truncation isn't disastrous because you could always reproduce the file name from the document title.
Suppose you're doing Takeout on a website folder where you have HTML files that reference other pages or uploads with long file names. For example, your HTML might refer to looooooooooooooooooooooooooooooooooooong_image_file_name.jpg, but Takeout would truncate the uploaded image file to looooooooooooooooooooooooooooooooooooong_image_fil.jpg, which would break the link to the file in the HTML document, causing the image not to be rendered on the web page.

Because of truncation of folder names, you also might have a situation in which two different folders get collapsed into one. For example, if you have a copy of the https://was-research.org/ website on Google Drive, then there will be one folder called analysis-lethal-methods-wild-animal-population-control-vertebrates/ and another called analysis-lethal-methods-wild-animal-population-control-invertebrates/. After truncation, both of these are shortened to analysis-lethal-methods-wild-animal-population-con/. As a result, Google Takeout puts the two index.html files for the two pages into the same truncated folder.

For some reason, Google Drive doesn't truncate file names if you right-click and "Download" a folder rather than using Takeout. Unfortunately, folders retrieved with right-click "Download" are limited to 2 GB, so your downloaded files will be split up into messy, intermingled chunks unless you download individual folders that are each less than 2 GB.

I wrote a simple Python program to search a directory recursively to find file and folder names that I expect would be truncated by Google Takeout. You could run this program within a folder before uploading it to Google Drive, or you could right-click and "Download" a folder from Google Drive and run this program on its contents to identify any files or folders that should be shortened. Of course, you can also just accept Google Takeout's truncation, but if you shorten the names yourself, you can make sure essential information doesn't get lost from the name, or take corrective action if essential information will inevitably be lost. Corrective action might include storing the information that was in the file name inside the file itself, or in the worst case, creating a separate metadata file to store the information.

Changing some characters in file names

Drive may make a few other minor modifications of file names when those files are downloaded in a folder, such as changing the following characters to the underscore character (_): question mark (?), ampersand (&), semicolon (;), and apostrophe ('). These modifications occur even if you right-click and "Download" a folder, not just if you use Takeout. These changes don't occur if you download only a single file, but they do occur when you download two or more files at once, maybe because when you download two or more files at once, they get zipped together?

That said, these changes are generally not too severe in terms of breaking possible HTML links to the relevant website files, since in most cases, few core website file names contain these characters. Plus, these modifications are minor enough that if someone ever did have to restore a website from your website backup, that person could probably notice these file-name modifications and fix the problem.

Potentially imperfect downloads

Google Drive's downloading of files isn't 100% flawless. One time, I downloaded a few tens of Google Drive folders collectively containing many thousands of files. I diffed these files against another copy of the same files that I had. Across those numerous downloads, I found one instance where Google Drive had apparently missed a single file (a .jpg file) from its download folder. The file was visible on Google Drive but somehow didn't make it into the download .zip file. I tried repeating the download, and again that file was missing for some reason. Maybe it was a problem with the file itself, but I could download that individual file just fine; the file just didn't show up when I downloaded a whole folder containing that file. Maybe there was an issue incorporating that file into a .zip archive for download. Anyway, fortunately, almost all of the folders I downloaded were apparently completely error-free.

As another check of the completeness of downloading Drive files, in March-April 2018, I ran Takeout on three very large Drive folders and then reran Takeout a few days or weeks later. In all cases, the Takeout-exported folders only differed in expected ways, according to the output of diff -r. That is, I didn't see any files unexpectedly missing from one or the other Takeout download, and the files themselves only differed if I had edited them in between downloads or if they contained unique per-download pieces of data. This gives some confidence that Takeout is generally pretty consistent at downloading your data (although this test by itself doesn't measure whether Takeout systematically misses some files).

(rare) Missing most files on a few downloads

In April 2018, I did a comparison of 41 Drive folders downloaded either by

right-click "Download" on the individual folder in Drive, or
Google Takeout of the whole set of folders.

For 40 of these 41 folders, the two download methods returned the same number of files in each folder (although for about half of these folder downloads, the reported disk space on my hard drive differed slightly between the two download methods for reasons I couldn't ascertain). However, for 1 of the 41 folders, the right-click "Download" method returned only about 12,000 files, while Takeout returned around 114,000 files.

This was the only folder of mine that contained more than 10⁵ files, so at first I suspected that the problem was caused by the large number of files in the folder. When I split up this 114,000-file folder into several smaller chunks, all of the chunks successfully downloaded fully, except for one chunk (with ~42,000 files), which still downloaded incompletely (returning only ~5,000 to ~6,000 downloaded files out of the ~42,000). I've successfully downloaded other folders with more than ~42,000 files, so maybe the problem was with this particular folder for some reason, rather than the raw number of files.

Why I avoid desktop sync

Google Drive has an option to sync files with your desktop, and I would guess that this would mitigate some of the annoyances of doing Takeout on Drive folders. However, I personally am afraid of desktop sync because if you or someone else (e.g., a laptop thief) deletes the contents of the synced folder on your desktop, all your cloud files get erased. (Thanks to a friend for this observation.) There's an important slogan that "sync is not backup".

I haven't looked into Google sync, but Dropbox enumerates a number of ways in which file syncing could accidentally erase your data from the cloud:

there are a few other causes of deleted or missing files:

Hard disk problems, including bad disks, or disks that are not ready when the Dropbox program runs. If Dropbox doesn't see known files when it starts, it will assume you removed them while Dropbox wasn't running.

Antivirus software that quarantines files. These programs can remove or block access to files.

Disk scanning software (for example, backup software) that locks out Dropbox and makes it appear as if files are missing.

Programs that modify files in your Home folder, which usually includes the Dropbox folder. Some audio and image management software programs do this.

Multiple linked computers with the same name. Perhaps you gave your PC to someone else to use, and they deleted some files. If this PC has the same name as your new computer, you might not realize a second PC is in the picture.

Scratchpads vs. archive documents

I regard email, Facebook, Slack, and other communication platforms as "scratchpads" for temporary discussion rather than formats for long-term archiving of information. You're unlikely to go back and revisit 99.9% of old conversations, so if such conversations contain important insights or information, you should transfer that information somewhere else for long-term storage while the information is still fresh in your mind.

If the information is of general interest, I try to add it to a page on one of my websites. Long conversations should ideally be distilled into more readable and Googleable blog posts. Private reminders can be stored in my private todo lists. And so on.

I don't depend on others to preserve my writings

Websites rise and fall over the years. Volunteers who maintain websites may abandon their projects as they move on to other phases of life. Companies go out of business. And even companies that don't go out of business may shut down particular websites. For example, in 2019, Google+ shut down and deleted all data on consumer accounts (Google 2019).

In other words, there's no guarantee that a website where you contribute content will still be online in 5, 10, or 25 years—unless that website is your own.

For this reason, I prefer to keep all of my important content on my own websites. I can guarantee that my websites will remain online for as long as I'm alive and healthy (and hopefully longer); I can't guarantee that for websites I don't control. If you contribute substantive, non-throwaway content to a forum or other platform owned by someone else, I recommend making a copy of that content on your own website, in case the original site disappears.

You could alternatively just back up your content to your private files and wait until the original site disappears to go ahead and publish the writings on your own site. This works too, but you might forget or be too lazy to restore your content later on. In some cases this approach may indeed be best, such as with video files, which are difficult to host on your own site (Hesketh 2013-2018).

Personally, because I can't control data on websites I don't own, I prefer to write substantive content primarily on my own sites, rather than on other platforms. Then I can link to this content on other platforms if need be.

Why I upload old school papers

I uploaded many of my old school writings to briantomasik.com. Why? A few reasons:

Some people may find them useful. For example, someone once contacted me to ask if he/she could use one of the documents I have here. (I said yes.) Of course, it could be bad if lazy students find my writings to copy them. Gradually over time, I'm trying to reformat my school assignments to remove the essay prompts, to reduce their Googleability.
Perhaps a tiny fraction of people who find my school papers will browse around other parts of the site and become interested in animal suffering or other more altruistically important topics.
These writings have sentimental value to me, and storing them on my site is an organized way to keep track of these files and make sure they're backed up along with the rest of my website. It's especially nice to have old documents converted to HTML because HTML files are human-readable even without browser rendering and are stored as simple text files, which means the documents will be more future-proof against format rot than if they're stored as, say, Word documents.

Suppose you're browsing through your old files, hoping to delete some of them to declutter your life. When deciding which files to preserve, one possible philosophy is that the documents that are worth saving tend to be those that you could make public and share with others. If you can't share the documents, why hold on to them? What value are they providing?

Of course, there are a few exceptions where it does make sense to preserve non-public documents, such as sensitive financial/personal data, private comments from friends that make you happy to reread, random notes and todo lists, etc.

An additional bonus of having valuable files public on the web is that they're more secure against data loss. For example:

Imagine that you have a severe house fire, and all of your computers, paper documents, flash drives, two-factor-authentication keys, etc. are destroyed. Suppose further that you can't remember the password to your cloud files or that you're locked out by two-factor authentication. If you can't plead with your cloud provider to let you in, your private files may be inaccessible at this point, but public files on your website should still be available. You can download your content from the public web and thereby recreate your website.
Imagine that you suddenly die or become permanently incapacitated, and no one else has access to your accounts. Suppose your cloud providers refuse to let your family into your accounts. If your important files are public on the web, your family and friends can still access them. (Hopefully the files will be downloaded for backup before your domain and hosting expire, causing your site to go offline. If not, most of your website content may still exist on the Wayback Machine.)

Of course, there are other mitigation strategies against these risks besides putting your content publicly online.

(Some of the above discussion was inspired by a comment I read long ago and can't now find.)

Why I love plain text

When I was in high school, I heard two people at school discussing how they enjoyed just writing in Notepad, without all the complexities of Microsoft Word. At the time, I thought to myself that they were being stubborn and ignoring the many benefits of Word, including the possibility of nicer formatting of text. Many years later, I too have come to appreciate the simplicity of writing in plain text, in part based on considerations regarding data preservation.

Text files are plausibly the most future-proof file format (other than, say, printing out text on physical paper). Text files are unlikely to become obsolete any time soon, and they can be easily read in raw form by a human. Wikipedia ("Plain text"): "The best format for storing knowledge persistently is plain text, rather than some binary format." Suchanek (2017): "There is one group of file formats that is completely unproblematic for archiving: plain text formats. These are all file formats that you can open with a text editor such as Notepad, VI, or TextEdit, and that are human-readable. These include TXT-files, code files, LaTeX files, CSV files, TSV files, and the like. There is no particular software needed to read them. They are thus completely safe for archiving." (Suchanek (2017) adds two caveats about this statement, though.)

In contrast, the old Word documents that I created in high school require special software to open, and all the fancy formatting that they contain makes it something of a chore to cleanly transfer the content to another format, such as HTML. (You can export a Word document to HTML, but it may be cluttered with lots of CSS styling that you mostly don't need.)

Fancy file formats are also less immediately amenable to manipulation with Unix utilities, Python scripts, etc. For example:

Suppose you wanted to combine together all your files in a given folder into a single file for printing out on paper. Doing this is extremely easy if the files are in .txt format but requires more code plumbing if the files are in .docx format.
It's easy to diff plain text files. diff'ing binary files is hard unless you export the contents to a text format.

Many cloud services allow you to store data in fancy ways. For example, you might have a todo-list web application that stores each todo item in its own card, with various options for annotating items, setting reminders, and so on. While this might seem great at first, I find that this complexity is often not needed, and a barebones list of todo items written in plain text works at least as well. If you end up leaving a fancy todo-list service, you might be able to export your todo items, but they may be in a complicated format. In contrast, the plain todo list never goes out of style and is easy to read and manipulate in a simple text editor. Text files are trivially easy to back up, while with a web application you have to figure out if and how you can export your data.

Writing text files is like painting on a blank canvas: you can create anything you want, in any style or format you want, limited only by your imagination. You're not limited to interacting with your data in fixed, predefined ways.

When writing plain-text files, you can use Markdown syntax to add basic formatting like _italics_ in a more readable way than writing HTML tags. This demo is a great way to play around with Markdown.

Footnotes

In July 2017, I verified this for myself as follows. I created a test Google Doc that contained a very large volume of text and let the Google Doc save itself. Then I deleted most of the text, leaving only a tiny amount of text, and then let the Google Doc save itself again. Then I ran Google Takeout. The downloaded .docx file was smaller in kilobytes than the amount of text I had added to the Google Doc originally, so the downloaded doc must not have contained the full version-history data anywhere. Moreover, the downloaded .docx was smaller than a compressed version of the original text using 7-Zip's most compact compression, so it's unlikely the .docx file was hiding the version-history content in compressed format. The large volume of text was still visible in the online version history for the Google Doc, though. (back)