The Frontiers of Internet Archiving

The internet is vast and constantly changing, making it difficult to preserve information for future generations. To address this issue, several large-scale internet archiving projects have emerged, each with their own unique approach to preserving digital content. In this article, we will explore the history, storage devices used, pricing, and challenges faced by some of the largest internet archiving projects in the world.

The Internet Archive

Brewster Kahle in 2009 Founded in 1996 by Brewster Kahle, the Internet Archive is perhaps the best-known internet archiving project. The organization’s mission is to provide “universal access to all knowledge.” To achieve this, they have created a massive digital library that contains over 70 petabytes of data, including websites, books, music, videos, and more.

To store this vast amount of data, the Internet Archive uses a combination of hard drives, solid-state drives, and magnetic tapes. Hard drives are used for more frequently accessed data, while magnetic tapes are used for long-term storage. The cost of this storage is not publicly disclosed, but the Internet Archive operates on a non-profit model and relies heavily on donations to continue its work.

One of the biggest challenges faced by the Internet Archive is the sheer size of the data they are trying to preserve. As the internet grows, so too does the amount of data that needs to be archived. Additionally, the Internet Archive has faced legal challenges over copyright infringement and the distribution of copyrighted materials.

The Wayback Machine

Perhaps the most well-known feature of the Internet Archive is the Wayback Machine. This tool allows users to browse archived versions of websites from as far back as 1996. As of 2023, the Wayback Machine contains over 450 billion web pages.

The Wayback Machine has faced criticism over the years for not being able to archive every website on the internet. Some websites use technology that makes it difficult or impossible for the Wayback Machine to capture their content accurately. Additionally, some website owners have requested that their sites be excluded from the Wayback Machine due to privacy concerns.

To handle the immense volume of archived web pages, the Wayback Machine employs a distributed storage system. It utilizes a combination of commodity servers, network-attached storage (NAS) devices, and content delivery networks (CDNs) to ensure accessibility and reliability. Challenges faced by the Wayback Machine include the continuous crawling and indexing of web pages, dealing with the dynamic nature of websites, and addressing copyright and legal considerations.

The Library of Congress

The Library of Congress is the largest library in the world, and it has been collecting and preserving books, manuscripts, and other physical media for over two centuries. In recent years, the Library of Congress has also turned its attention to digital archiving.

The Library of Congress’ digital preservation program focuses primarily on audiovisual content, including film, television, and sound recordings.

To store this data, this project a diverse range of storage devices, including enterprise-grade storage arrays and tape libraries, to preserve its vast collection of digital content. Challenges encountered by the Library of Congress include the selection and prioritization of websites for archiving, ensuring the authenticity and integrity of archived content, and navigating copyright complexities.

Unlike the Internet Archive, the Library of Congress is a government-funded institution. The exact cost of their digital preservation efforts is not publicly disclosed, but it is estimated to be in the tens of millions of dollars per year.

One of the biggest challenges faced by the Library of Congress is the obsolescence of digital formats. As technology advances, older file formats become increasingly difficult to access and read. To address this issue, the Library of Congress has developed a program called the National Digital Stewardship Alliance, which aims to establish best practices for digital preservation.

The British Library

The British Library is the national library of the United Kingdom, and it is responsible for preserving the country’s cultural heritage. In recent years, the British Library has turned its attention to digital archiving.

The British Library’s digital preservation program focuses on a wide range of content, including books, manuscripts, and sound recordings. To store this data, the British Library uses a combination of hard drives, magnetic tapes, and optical disks.

The cost of the British Library’s digital preservation efforts is not publicly disclosed. However, the library operates on a government-funded model.

One of the biggest challenges faced by the British Library is the sheer volume of data they are trying to preserve. As more content is created digitally, the library must find ways to store and preserve this information for future generations.

National Digital Archive of Datasets (NDAD)

The National Digital Archive of Datasets, operated by The National Archives in the UK, focuses on preserving significant datasets that are vital for understanding the country’s history and social fabric. NDAD collects and stores datasets from various government departments, agencies, and organizations. The project employs robust data storage systems, including tape libraries and redundant servers, to ensure the long-term preservation and accessibility of the archived datasets. Access to the NDAD collection is available through The National Archives’ dedicated research facilities and online platforms.

The NDIIPP employs a combination of storage devices, including redundant arrays of independent disks (RAID), tape libraries, and cloud storage. Challenges faced by the NDIIPP involve managing the complexity of digital preservation workflows, developing effective metadata and indexing systems, and ensuring long-term accessibility and usability of archived content.

Challenges Faced by Internet Archiving Projects

Internet archiving projects encounter several challenges in their mission to preserve digital content. One significant challenge is the ever-changing nature of the web, with websites frequently undergoing updates, redesigns, or even complete removal. This requires archiving projects to develop sophisticated web crawling techniques that can capture and preserve dynamic web content accurately. Additionally, the sheer scale of data involved poses challenges in terms of storage infrastructure, data transfer speeds, and indexing mechanisms. Ensuring the integrity and authenticity of archived content, particularly in the case of government records and historical datasets, presents another significant challenge. Lastly, funding and sustainability remain ongoing concerns for many internet archiving projects, as the cost of maintaining extensive storage infrastructure and developing advanced archiving technologies can be substantial.

Conclusion

Internet archiving projects play a crucial role in preserving our digital heritage. Through their efforts, vast amounts of information are stored and made accessible to future generations. However, these projects face significant challenges, including the exponential growth of data, legal issues surrounding copyright, and the obsolescence of digital formats.

Despite these challenges, the Internet Archive, the Library of Congress, the British Library, and other similar initiatives continue to push the boundaries of digital preservation. Their work ensures that we can look back at the evolution of the internet, access historical websites, and preserve our cultural heritage for years to come.