How does the Library archive websites?

The Library's Web Archiving Section manages the overall program and ensures that content selected is archived and preserved. The Library’s goal is to create a reproducible copy of how the site appeared at a particular point in time.

The Library attempts to archive as much of the site as possible, including html, CSS, JavaScript, images, PDFs, audio and video files to provide context for future researchers. The Library (and its agents) use special software to download copies of web content and preserve it in a standard format. The crawling tools start with a "seed URL" – for instance, a homepage – and the crawler follows the links it finds, preserving content as it goes. Library staff also add scoping instructions for the crawler to follow links to that organization's host on related domains, such as third party sites, based on permissions policies.

Archiving is not a perfect process – there are a number of technical challenges that make it difficult to preserve some content. For instance, the Library is currently unable to collect content streamed through third-party web applications, "deep web" or database content requiring user input, data visualizations that dynamically render by querying external data sources, GIS and some interactive maps, and content requiring payment or a subscription for access. In addition, there will always be some websites that take advantage of emerging or unusual technologies that the crawler cannot anticipate. Social media sites and some common publishing platforms can be difficult to preserve.


Last Updated: May 01, 2025
Views: 13