by Brian Tomasik
First written: 2016-2017; last update: 23 Jul. 2017

Summary

One way to back up a website—whether your own or someone else's—is to use a tool that downloads the website. Then you can back up the resulting files to the cloud, optical media, etc. This page gives some information on downloading websites using tools like HTTrack and SiteSucker.

HTTrack

On Windows, HTTrack is commonly used to download websites, and it's free. Once you download a site, you can zip its folder and then back that up the way you would any of your other files.

I'm still a novice at HTTrack, but from my experience so far, I've found that it captures only ~90% of a website's individual pages on average. For some websites (like the one you're reading now), HTTrack seems to capture everything, but for other sites, it misses some pages. Maybe this is because of complications with redirects? I'm not sure. Still, ~90% backup is much better than 0%.

You can verify which pages got backed up by opening the domain's index.html file from HTTrack's download folder and browsing around using the files on your hard drive. It's best if you disconnect from the Internet when doing this because I found that if I was online when browsing around the downloaded file contents, some pages got loaded from the Internet, not from the local files that I was testing.

Pictures don't seem to load offline, but you can check that they're still being downloaded. For example, for WordPress site downloads, look at the \wp-content\uploads folder.

I won't explain the full how-to steps of using HTTrack, but below are two problems that I ran into.

Troubleshooting: gets too many pages

When I tried to use HTTrack to download a single website using the program's default settings (as of Nov. 2016), I downloaded the website but also got some other random files from other domains, presumably from links on the main domain. In some cases, the number of links that the program tried to download grew without limit, and I had to cancel. In order to download files only from the desired domain, I had to do the following.

Step 1: Specify the domain(s) to download (as I had already been doing).

Step 2: Add a Scan Rules pattern like this: +https://*animalcharityevaluators.org/* . This way, only links on that domain will be downloaded.

Including a * before the main domain name is useful in case the site has subdomains. For example, the site https://animalcharityevaluators.org/ has a subdomain http://researchfund.animalcharityevaluators.org/ , which would be missed if you only used the pattern +https://animalcharityevaluators.org/* .

Troubleshooting: Error: "Forbidden" (403)

Some pages gave me a "Forbidden" error, which prevented any content from being downloaded. I was able to fix this by clicking on "Set options...", choosing the "Browser ID" tab, and then changing "Browser 'Identity'" from the default of "Mozilla/4.5 (compatible: HTTrack 3.0x; Windows 98)" to "Java1.1.4". I chose the Java identity because it didn't contain the substring "HTTrack", which may have been the reason I was being blocked.

SiteSucker

On Mac, I download websites using SiteSucker. This page gives configuration details that I use when downloading certain sites.

Compress archived websites?

Once you've downloaded a website using HTTrack or similar software, should you compress the website folder before backing it up to the cloud? I'm uncertain and would appreciate reader feedback, but here are some considerations.

My impression is that plain text files (such as raw HTML files) are more secure against format rot and bit rot, because "They avoid some of the problems encountered with other file formats, such as endianness, padding bytes, or differences in the number of bytes in a machine word. Further, when data corruption occurs in a text file, it is often easier to recover and continue processing the remaining contents." A Reddit comment says: "Straight up txt files have a very low structural scope / over head, so unless you're doing something funky, a bit error is limited to a character byte."

As a result, I plan to back up my own websites and other important sites mostly as uncompressed files (with some compressed copies thrown into the mix too). However, when backing up lots of other websites that are less essential, compression may make sense. This is especially so if the website download has a lot of redundancy. Following is an example.

Compression example with duplicate content

In 2017, I downloaded www.mattball.org using SiteSucker. The download had a huge amount of redundancy using the default SiteSucker download settings, because each blog comment on a blog post had its own url and thus downloaded the blog post again. For example, on a blog post with 7 comments, I got 8 copies of the blog HTML: 1 from the original post, and 7 from each of the 7 comment urls. The website download also included an enormous number of search pages. Probably I could prevent these copies from downloading with some jiggering of the settings, but I want to be able to download lots of sites with minimal per-site configuration, and I'm not sure that url-exclusion rules that I might apply in this case would work elsewhere.

In principle, compression can minimize the burden of duplicate content. Does it in practice? During the www.mattball.org download, I checked to see that the raw content downloaded so far occupied ~450 MB. Applying "Normal" zip compression using Keka software gave a zip archive of 88 MB, which is about 1/5 the uncompressed size. Not bad. However, a "Normal" 7z archive of the raw data was only 1.6 MB—a little more than 1/300th of the uncompressed size!

Using a simple test folder with two copies of a file, I verified that zip compression doesn't detect duplicate files, but 7z compression does. Presumably this explains the dramatic size reduction using 7z. This person found the same: "You might expect that ZIP is smart enough to figure out this is repeating data and use only one compression object inside the .zip, but this is not the case[....] Basically most such utilities behave similarly (tar.gz, tar.bz2, rar in solid mode) - only 7zip caught me [...]."

Security concerns?

Is it dangerous to download websites because you might make a request to a dangerous url? I'm still exploring this topic and would like advice.

My tentative guess is that the risk is low if you only download web pages from a given (trustworthy) domain. If you also download pages on other domains that are linked from the first domain, perhaps there's more risk?

HTTrack's FAQ says: "You may encounter websites which were corrupted by viruses, and downloading data on these websites might be dangerous if you execute downloaded executables, or if embedded pages contain infected material (as dangerous as if using a regular Browser). Always ensure that websites you are crawling are safe."

This page says: "SiteSucker totally ignores JavaScript. Any link specified within JavaScript will not be seen by SiteSucker and will not be downloaded." Does this help with security? How much?