Preserving web content never really left my mind ever since taking screenshots of old sites and putting them in my personal museum. The Internet Archive’s Wayback Machine is a wonderful tool that currently stores 748 billion webpage snapshots over time, including dozens of my own webdesign attempts, dating back to 2001. But that data is not in our hands.
Should it? It should. Ruben says: archive it if you care about it:
The only way to be sure you can read, listen to, or watch stuff you care about is to archive it. Read a tutorial about yt-dlp for videos. Download webcomics. Archive podcast episodes.
This should include websites! (And mirrors of your favorite git repos) And I’m not talking about “clipping” a (portion of a) single page to your Evernote-esque scrapbook tool, but about a proper archive of the whole website. It seems that there are already tools for that, such as ArchiveBox that crawls and downloads using a browser engine, or Archivarix, an online Wayback machine downloader, or even just using
wget, as David Heinemann suggests.
The problem I’ve encountered with personal Wayback Machine snapshots is:
- Proprietary frameworks and their fleeting popularity; e.g. Flash or Applet embeds that break;
- Wayback machine doesn’t seem to be very fond of correctly preserving all images—CSS backgrounds or
.phpscripts to embed watermarks in images don’t make it into the archive;
- Binaries are lost. I loved sharing levels or savegames, did not archive everything myself locally, and neither did Archive.org.
To combat these problems, Jeff Huang came up with seven guidelines to design webpages to last:
- Return to vanilla HTML/CSS. See above. Many of my old Wayback snapshots now display a “Something’s wrong with the database, contact me!” message.
- Don’t minimize that HTML. This increases your workflow that will probably not survive in 10+ years.
- Prefer one page over several. Not sure if agree, but a one-pager is definitely easier to save.
- End all forms of hotlinking.
<link/>only to your own local stuff.
- Stick to native fonts. I do ignore this rule: if the font is lost, the content isn’t, and I won’t care.
- Obsessively compress your images. Low Tech Magazine even uses dithering to great effect.
- Eliminate the broken URL risk by using monitoring to check for dead links.
While writing this article, I explored others' usage of Wayback Machine, but surprisingly few seem to mention that they regularly back up their own website—either by saving their own build artifacts somewhere, or by leveraging Wayback Machine. David Mead suggested to include a personalized Wayback Machine link in your 404 page which sounds good but doesn’t really help towards carefully preserving your stuff.
So I wondered. Can we self-host Wayback machine? Soneone at a “datahoarder” sub-Reddit asked that very same question 2 years ago, but never received a reply. I think ArchiveBox comes very close! It has a docker-compose script so is dead easy to throw on our NAS. However, this creates another potential problem: will that piece of software still work after 10-20-30 years? The source code is on GitHub: internally, it uses trend-sensitive packages like Django, so you’re still better off by simply archiving static HTML yourself—given you’ve got control over the source.
Except that with ArchiveBox, you can archive any website. And you can tell it to archive the same site every week. And it has a clear strategy laid out towards long-term usage. If what you’re looking to download doesn’t exist anymore, I guess then your only option is a Wayback extractor like Archivarix (of which the free tier does not save CSS). Wayback comes with APIs and wrappers call one of those “SavePageNow”—this is to tell Wayback to archive it, not to locally download (or what I’d call save) it. Bummer.
Check out the Web Archiving Community Wiki if you’re interested in more lists or archiving software. I was pleasantly surprised by the amount of existing software and people actively involved in this initiative.
By the way, by limiting our understanding of “archiving webpages” to the HTTP protocol, we’re also ignoring thousands of Gemini and Gopher ones.
Wayback Machine’s timeline you can pick snapshots from is nice to interact with; it gives an immediate idea of frequency of archival. Once you select a certain snapshot, it cries it’s alive! and serves you the site. What’s missing though is screenshots: sometimes it fails to render the site or gives a timeout—or doesn’t have any snapshot stored at all. I think that’s what I tried to do with my personal museum. Unfortunately, even though I have the source, some websites are impossible to revive: either I miss the DB files or don’t have the right ancient framework versions anymore (or even they are becoming hard to find).
Another fun experiment: here are old bookmarks from 2007. Try randomly clicking on a few of those. 404? Yes? No? I tried creating a script to convert these into HTTP response status codes but that won’t work as many still return a 200 but suddenly become infested with smileys, rifle clubs, and other spam junk as the domain is hijacked, or it just states “database error” (still a 200? Cool!), or it states “we will return!”. Less than 20% of those links are still fully accessible 15 years later, and those are probably the Amazons.
I’m not sure where this thought experiment is going, but I am sure that Ruben is right: archive it if you care about it.