Martin Klein, a Los Alamos National Laboratory scientist and New Mexico Consortium affiliate, is working on an internet preservation project titled, “Developing Bloom Filters for Web Archives’ Holdings”. This project is an international collaboration with the National and University Library Zagreb, Croatia
The ultimate goal of Klein’s project is to provide web archives with a software framework to generate Bloom Filters. What is a Bloom Filter? A Bloom filter can be thought of as a sitemap for web archives, listing all (or a subset of) URLs of which an archive has one or more archival copies. Unlike sitemaps, however, URL strings are hashed before ingested into the Bloom filter, which means URLs are not shared in plain text. The filter allows queries for a URL to confirm if an archive indeed has one or more archival copies of that URL.
Bloom filters enable archives to passively share the URLs of their archival holdings. Having this information available makes a variety of 3rd party applications possible such as federated search services across archives, archive-agnostic cataloguing services of archived web resources, and synchronized crawling efforts avoiding unwanted duplication.
Why is it important to make web archives’ holdings public? It is because making them public leads to increased use, demonstrates value and relevance in the community and leads to more funding sources. It also helps to show an overall picture of what is (not) archived which in turns helps optimize crawling efforts by avoiding undesired duplication.
This project is funded by the International Internet Preservation Consortium (IIPC). The mission of the IIPC is to collect and preserve a diverse body of global internet content so that it can be archived and made available in the future. The IIPC strives for the development and use of common tools, techniques and standards for international archives, and supports research with a focus on preserving the internet.
To learn more about this project see: https://netpreserve.org/projects/bloom-filters/
To learn more about the IIPC see their website at: https://netpreserve.org/
Photo above of Martin Klein.