by Brian Tomasik
First written: 2013 and 29 Feb. 2016; last update: 9 Feb. 2018
This page explains how to export data for backup from Google and Facebook, as well as how I back up my websites.
- 1 Summary
- 2 Introduction
- 3 Website backups
- 4 Backing up all my data
- 5 How to collect data for backup
- 6 (Optional) Geomagnetic storms and EMPs
- 7 My backup schedule
- 8 Quirks and workarounds for Google Takeout
- 9 I don't depend on others to preserve my writings
- 10 Why I love plain text
- 11 Footnotes
You've probably had the experience of writing a long, thoughtful comment on a forum, pressing "Submit", and then finding that the comment didn't go through, causing your work to be lost. I find that this is an extremely frustrating experience (and one that has led me to do Ctrl+c on the text of my long comments before submitting them, in case I need to later Ctrl+v to recover the text). Compared with losing a single forum comment, losing your lifetime collection of writings and other content would be many orders of magnitude worse. This is why I'm rather obsessive about backing up content of mine that I value. (Content that I don't care about can be deleted to reduce clutter.)
An important fraction of my extended mind is composed of information on Google, my websites, etc. I expect the risk of losing such data is fairly low, but it might happen if someone hacked the accounts and deleted the data, or if I lost all ability to log in to the accounts, or if I accidentally deleted the accounts, or if the sites had major data-storage failures. Since it's cheap to back up the data, it seems reasonable to do so periodically.
Leo A. Notenboom's "You Can’t Assume the Cloud Has Your Back" is one of many introductions to the importance of backing up your data.
My websites are the most dense form of important data that I don't want to lose. One copy of them lives directly on the servers. Most pages also have one or more backups on Internet Archive. I sometimes upload new iterations of particular pages to Dropbox, as a crude form of version control. However, it's helpful to have additional backups besides these.
For additional redundancy, I formerly used Dropmysite, which backed up monthly all the websites I maintained. It also allowed me to download zip copies of each site's WordPress database file (which contains essay texts) and WordPress "uploads" folder (which contains images and PDFs). I then uploaded these to the cloud. Note that a WordPress database .sql file contains potentially sensitive information, so if you do this, you should encrypt the file.
As of 2017, I've stopped using Dropmysite to save money and because I'm now just doing website backups by hand using website-downloading tools: HTTrack on Windows or SiteSucker on Mac. I do these downloads every few months and then upload to the cloud. Because these copies of the websites are downloaded from the public web, they don't contain any sensitive information and so can be stored unencrypted (and even unzipped).
I occasionally send copies of my website backups to a few friends for them to back up so that even if all of my accounts went away simultaneously, someone else would still have the data. Since these sites are downloaded from the public web, there's no need to take precautions to keep the data private.
Finally, every few years, I plan to create paper copies of my websites, as discussed here. I completed my first paper backup of my websites in 2017.
Backing up all my data
Every ~6 months, I do a full backup of my important data, including Google Drive, calendars, etc. If any files have non-public information (such as Social Security numbers or private conversations with friends), you should probably encrypt these files before backing them up.
I store my data both in the cloud and on discs at home for redundancy. This leaves at least one extant copy of the data in the event of a variety of possible disasters: house fire, theft, losing/breaking physical discs, cloud data loss, account hacking, and so on.
How to collect data for backup
This section describes how to pull your data from various places.
Google Takeout allows you to download roughly all of your Google content to a collection of zip files.
If you have multiple Google accounts (e.g., one for an organization that you're involved with), remember to do Takeout for all of them if you have permission to do so.
Google Takeout exports both the contents of a Google Doc (as a
.docx file) and its comment threads, including resolved threads (as an
.html file). The export doesn't seem to include version history for the Google Doc, even though the online Google Doc does have version history.a
YouTube backups are included in Google Takeout, but if you want to back videos up individually, you can download your own videos.
Google Calendar is also included in Google Takeout, but if you want to download it individually, you can go to "Settings" -> "Calendars" -> "Export calendars".
Folders/documents you don't own
I think Google Takeout only exports Google Drive documents for which you're the owner. Unless you transfer ownership, you're the owner of the documents you've created.
What if you want to export all documents in a given folder, not just those you own? Or what if you want to export a whole folder that you don't own? In these cases, you can download a folder by clicking the down arrow on the folder path name -> "Download", which will download the whole folder as a zip file. Make sure you have permission to store the data.
Facebook also has a takeout feature. Click the down arrow in the upper right of any Facebook page -> "Settings" -> "General" -> "Download a copy of your Facebook data."
Unfortunately, Facebook groups discussions aren't backed up in this process. There are tools for backing up Facebook groups that I hope to explore eventually.
You can search for some systematic ways to back up your GitHub content, but if you only have a few repositories, a simple solution is just to clone all of them and then back up those folders.
Stray documents on your desktop
I try to keep all important data in the cloud in case my laptop crashes, but you can also make sure that any data on your desktop that's not already stored with Google gets included in your backup.
(Optional) Geomagnetic storms and EMPs
This discussion has moved to its own page.
My backup schedule
Here's a summary of how I do periodic backups.
Do every 3 or 4 months:
Use SiteSucker to download the latest versions of important websites. I enter the following urls one by one:
- http://reducing-suffering.org/ - http://briantomasik.com/ - http://briantomasik.com/wp-content/uploads/ - http://www.simonknutsson.com/ - https://foundational-research.org/ - https://casparoesterheld.com/ - http://www.wallowinmaya.com/ - http://prioritizationresearch.com/ - http://s-risks.org/
The "wp-content/uploads" part of briantomasik.com is needed in order to force SiteSucker to get that content for some reason.
Once this completes, quickly spot check the downloads to make sure the needed content was retrieved. Zip the folders, name the zip file with the date of download, and upload to the cloud. I also keep at least one copy of my website downloads unencrypted on the cloud just in case zip files for some reason fail to uncompress in the future. Plain text is relatively robust against data corruption.
Let me know if you'd like me to add your site to the list of sites I download periodically.
Every once in a while, I should also export and back up the redirects for my sites.
Do every ~6 months:
- Send the latest zip file of the backed-up websites to at least 3 friends for them to back up as well.
- Back up Google data. When doing Google Takeout, I only care about including the following products: Blogger, Calendar, Contacts, and Drive. I also care about YouTube videos, but I back those up in real time whenever I create a new video and thus don't need to do so using Google Takeout from YouTube.b
- If I've been active on other platforms recently (such as GitHub), download that content as well.
Currently I don't back up shared Google Drive folders because other people are on top of that.
Do every ~5 years
Create new paper printouts of my websites.
Quirks and workarounds for Google Takeout
With a default 2 GB size of download files, you may have multiple parts to your Google Takeout download. Unfortunately, these separate chunks may confusingly contain files from a variety of Google products and from a variety of different Drive folders. As someone explained on the Google Drive Help Forum: "I downloaded several folders with subfolders from Google Drive and I ended up with about 10 zip files on my MacBook. When I extract the zip files, I see that every zip file has a few files from each of the folders in my Google Drive."
To keep different products in separate chunks, you could export all the small products first (Blogger, Calendar, Contacts) and then export Drive separately. Better yet, you can use a bigger size for the download files than 2 GB so that everything fits in one download chunk; you may need to download as a
.tgz file rather than a
.zip file to make this work.
If your data is bigger than the 50 GB limit on how big a download file can be, one solution is to arrange your Drive data into top-level folders that are each less than 50 GB in size. While by default Google Takeout exports your entire Drive, you can also choose to export only one or a few top-level folders at a time. If each top-level folder is within the 50 GB limit, you can do Takeout on one top-level folder at a time, which downloads that whole folder in one piece rather than intermingled with other folders.
Another annoying quirk of Google Takeout is that (at least as of Feb. 2018) it seems to truncate long file and folder names (those longer than 51 characters, or 55 characters with a 4-character filename extension).c While this may be merely annoying for most isolated files, it's a serious problem if you're doing Takeout on a website folder where, for example, you have HTML files that reference images with long file names. For example, your HTML might refer to
looooooooooooooooooooooooooooooooooooong_image_file_name.jpg, but Takeout would truncate it to
looooooooooooooooooooooooooooooooooooong_image_fil.jpg, which would break the link to the file in the HTML document, causing the image not to be rendered on the web page. For some reason, Google Drive doesn't truncate file names if you right-click and "Download" a folder rather than using Takeout. Unfortunately, folders retrieved with right-click "Download" are limited to 2 GB compressed, so your downloaded files will be split up into messy, intermingled chunks unless you download individual folders that are each less than 2 GB.
Drive may make a few other minor modifications of file names when those files are downloaded in a folder, such as changing the following characters to the underscore character (
_): question mark (
?), ampersand (
&), semicolon (
;), and apostrophe (
'). These modifications occur even if you right-click and "Download" a folder. That said, these changes are generally not too severe in terms of breaking possible HTML links to the relevant website files, since in most cases, few core website file names contain these characters. Plus, these modifications are minor enough that if someone ever did have to restore a website from your website backup, that person could probably notice these file-name modifications and fix the problem.
Google Drive has an option to sync files with your desktop, and I would guess that this would mitigate the annoyances of doing Takeout on Drive folders. However, I personally am afraid of desktop sync because if you or someone else (e.g., a laptop thief) deletes the contents of the synced folder on your desktop, all your cloud files get erased. (Thanks to a friend for this observation.)
Finally, I'll mention that Google Drive's downloading of files isn't 100% flawless. One time, I downloaded a few tens of Google Drive folders collectively containing many thousands of files. I
diffed these files against another copy of the same files that I had. Across those numerous downloads, I found one instance where Google Drive had apparently missed a single file (a
.jpg file) from its download folder. The file was visible on Google Drive but somehow didn't make it into the download
.zip file. I tried repeating the download, and again that file was missing for some reason. Maybe it was a problem with the file itself, but I could download that individual file just fine; the file just didn't show up when I downloaded a whole folder containing that file. Maybe there was an issue incorporating that file into a
.zip archive for download. Anyway, fortunately, almost all of the folders I downloaded were apparently completely error-free.
I don't depend on others to preserve my writings
Websites rise and fall over the years. Companies go out of business. Volunteers who maintain websites may abandon their projects as they move on to other phases of life. There's no guarantee that a website where you contribute content will still be online in 5, 10, or 25 years—unless that website is your own.
For this reason, I prefer to keep all of my important content on my own websites. I can guarantee that my websites will remain online for as long as I'm alive and well (and hopefully longer); I can't guarantee that for websites I don't control. If you contribute substantive, non-throwaway content to a forum or other platform owned by someone else, I recommend making a copy of that content on your own website, in case the original site disappears.
You could alternatively just back up your content to your private files and wait until the original site disappears to go ahead and publish the writings on your own site. This works too, but you might forget or be too lazy to restore your content later on. In some cases this approach may indeed be best, such as with video files, which are difficult to host on your own site (Hesketh 2013-2018).
Why I love plain text
When I was in high school, I heard two people at school discussing how they enjoyed just writing in Notepad, without all the complexities of Microsoft Word. At the time, I thought to myself that they were being stubborn and ignoring the many benefits of Word, including the possibility of nicer formatting of text. Many years later, I too have come to appreciate the simplicity of writing in plain text, in part based on considerations regarding data preservation.
Text files are plausibly the most future-proof file format (other than, say, printing out text on physical paper). Text files are unlikely to become obsolete any time soon, and they can be easily read in raw form by a human. Wikipedia ("Plain text"): "The best format for storing knowledge persistently is plain text, rather than some binary format." Suchanek (2017): "There is one group of file formats that is completely unproblematic for archiving: plain text formats. These are all file formats that you can open with a text editor such as Notepad, VI, or TextEdit, and that are human-readable. These include TXT-files, code files, LaTeX files, CSV files, TSV files, and the like. There is no particular software needed to read them. They are thus completely safe for archiving." (Suchanek (2017) adds two caveats about this statement, though.)
In contrast, the old Word documents that I created in high school require special software to open, and all the fancy formatting that they contain makes it something of a chore to cleanly transfer the content to another format, such as HTML. (You can export a Word document to HTML, but it may be cluttered with lots of CSS styling that you mostly don't need.)
Fancy file formats are also less immediately amenable to manipulation with Unix utilities, Python scripts, etc. For example, suppose you wanted to combine together all your files in a given folder into a single file for printing out on paper. Doing this is extremely easy if the files are in
.txt format but requires more code plumbing if the files are in
Many cloud services allow you to store data in fancy ways. For example, you might have a todo-list web application that stores each todo item in its own card, with various options for annotating items, setting reminders, and so on. While this might seem great at first, I find that this complexity is often not needed, and a barebones list of todo items written in plain text works at least as well. If you end up leaving a fancy todo-list service, you might be able to export your todo items, but they may be in a complicated format. In contrast, the plain todo list never goes out of style and is easy to read and manipulate in a simple text editor. Text files are trivially easy to back up, while with a web application you have to figure out if and how you can export your data.
Writing text files is like painting on a blank canvas: you can create anything you want, in any style or format you want, limited only by your imagination. You're not limited to interacting with your data in fixed, predefined ways.
When writing plain-text files, you can use Markdown syntax to add basic formatting like _italics_ in a more readable way than writing HTML tags. This demo is a great way to play around with Markdown.
- In July 2017, I verified this for myself as follows. I created a test Google Doc that contained a very large volume of text and let the Google Doc save itself. Then I deleted most of the text, leaving only a tiny amount of text, and then let the Google Doc save itself again. Then I ran Google Takeout. The downloaded
.docxfile was smaller in kilobytes than the amount of text I had added to the Google Doc originally, so the downloaded doc must not have contained the full version-history data anywhere. Moreover, the downloaded
.docxwas smaller than a compressed version of the original text using 7-Zip's most compact compression, so it's unlikely the
.docxfile was hiding the version-history content in compressed format. The large volume of text was still visible in the online version history for the Google Doc, though. (back)
- Actually, I do run Google Takeout on my YouTube videos periodically in order to get the metadata about the videos, especially title and description. I upload these files to Google Drive.
However, for the video files themselves, I prefer to store my original copy of the video files rather than using Google Takeout to get the video files. One reason is to ensure that I have two copies of the video files on the cloud rather than only having the single copy on YouTube. (back)
- I verified that this problem occurs both for
.tgzTakeout export formats. I don't think it matters how deep within Drive folders the file is, because I get the same problem even if the Takeout-exported file is in a top-level Drive folder. (back)