Copyright © 2022 Webrecorder CC-BY
This document sets out use cases and requirements for a new packaging format for distributed web archive collections.
This document was published by the Webrecorder Project as part of a grant from the FileCoin Foundation. It is being actively developed and feedback on new use cases is encouraged using GitHub Issues.
This section will include a few paragraphs about what we mean by decentralized web archives and why web archives and their decentralization are are important. It is also probably important to delimit the scope: archives of HTTP transactions, not new protocols IPFS. But leave open possibility of interactions between HTTP and other protocols?
The goal of this specification is to provide a portable format for web archives...
This section will provide a sketch of related web packaging work. This should include other approaches to storing HTTP transactions such as: HAR, Web Bundle, ZIM, SinglePage, wpack.
Glossary of terms that are useful for understanding the use cases.
Distinguish between uses and threats here. Give some background on how the use cases were collected.
A researcher wants to save an interactive article they're using browsing for later use in their research. They’ve bookmarked the page, but want to ensure they have a copy in case it disappears, changes or becomes paywalled. They use the browser to create a web archive (saved in the browser as well) and then download a local copy of the article as a file just in case. They do not intend to share this article with anyone else.
A creator of a digital humanities project no longer has resources to maintain an aging web application. They use an automated crawler to produce a high fidelity web archive, and host the web archive in their digital repository system without any additional costs, providing replayweb.page as the access.
Requirements: List, Search, Description, Automatic, Public, Published
A journalist uses their browser and the Tor network to create archives of primary source material, documenting sources which may be sensitive and may be deleted in the future, including social media posts. They ensure the data is archived and store the file in encrypted storage. The archive indicates that Tor was being used to access the web content.
Requirements: Technical, Description, Encryption, Authenticity, Private, Share
A net artist meticulously browses their interactive art piece to fully archive their work, which includes an old Flash piece. The resulting archive is then donated to a museum for preservation.
Requirements: List, Description, Authenticity, Aggregated, Manual, Public, Share
A crawling service runs repeated crawls of a list of seeds, producing a weekly crawl. Each crawl is stored as its own file. The entire collection of crawls is available for access together, or each crawl at a time as a file. Users can download their archive files at any time.
Requirements: List, Search, Technical, Description, Aggregated, Logs, Automatic, Public, Published
A news site uses the browser to create an archive of a tweet that contains a video they would like to use in a news story, downloading the file and embedding it into their page instead of the live tweet, which may be deleted. The news site hosts the file in their CMS along with other content.
Requirements: Authenticity, Manual, Public, Published
A researcher requests web archive data for a specific domain from a larger national archive, which has been crawling the site for years. The archive has the data in WARCs, indexes in a different system, and metadata in its own custom database, and full text search in Solr. The web archive extracts just requested data from its multi-PB archive, and offers it to the researcher as a file so that they may easily browse the archive using a standard viewer.
Requirements: Technical, Description, Authenticity, Aggregated, Automatic
A lawyer requests web archive data or archives web material as part of an ongoing civil or criminal case. Whether building a defense for the defendant or a case for the plaintiff, the archive must support efficient traversal after the capture process, and support exporting in multiple formats (including paper formats) to be entered as evidence. Additional chain of custody and/or forgery defenses potentially may also be beneficial.
Requirements: List, Search, Technical, Description, Authenticity, Logs, Manual, Automatic
A disinformation platform deploys web archives to help informants produce evidence of coordinated international propaganda campaigns. Web archives are provided as one of many options for “tip submission,” and may be submitted alongside any other form of evidence (screenshot, photograph, text description, blobs shared between apps, etc).
Requirements: Search, Technical, Description, Screenshots, Manual
A website owner retrieves lost website content from one or more web archives and assembles this into a local, consolidated archive (e.g., like Warrick, except saving to a more robust archival format). Ideally, this operation would leverage Memento, both for the best possible coverage but also to assemble the most temporally coherent archive.
A user wants to retrieve a bundle of webpages from a web archive (e.g., via batch downloading URLs identified and filtered using the Wayback CDX Server API) so that full-text search can be more easily performed upon the corpus (i.e., and where full-text search isn't otherwise offered by the web archive platform).
Court staff archive open web resources cited in opinions issued by that court. The citations in the published opinions link to the archived versions rather than the live web resources. This mitigates content drift and ensures that the court continues to point to the versions of the cited resources that it considers to be canonical for the purposes of the opinion.
Requirements: Authenticity, Manual, Public, Published
Students are given laptops with WACZ versions of web-based curriculum materials and a replay tool preinstalled. Instructors can update materials as needed by using a self-directed archiving tool.
As someone looking at a web archive in the context of a web post (e.g. blog or entry on CMS) I want to evaluate whether I want to see who created the archive (not just who is publishing it) to establish if I trust them or not. It would also be useful to see who else trusts the creator as well.
Requirements: Authenticity, Manual, Automatic
A human rights activist wants to be able to archive a website that is at risk due to political instability, censorship, and war in the area where the website is being served from. The activist would like to repatriate this content to the original publishers and community once the political instability has ceased. Ideally the content could be delivered via the network or on physical media such as an external hard disk or thumb drive.
Requirements: Authenticity, Share, Consent
A member of a harassment community collects the output of a targeted individual. Here, the goal is to develop a richly detailed database of said individuals -- both of their own output, but also material related to them (online criminal records, sensitive photos, previous aliases, abandoned online presences, etc).
Requirements: Search, Aggregated
A state actor is aware of a team of investigative reporters using web archives, and seeks to leverage the format as an unusual vector for deploying web-based attack payloads.
Requirements:
A political operative continually monitors the material produced by their opponents as a form of surveillance. Working solo or as part of a larger organization, they store large collections of material to recall, publish, manipulate or doctor later. Here, metadata and data integrity may not be important; instead they will use the web archive to hit their targets based solely on the content of the media or its potential with minor changes. They may even produce a customised web archive-supporting tool to simplify this process.
Requirements: List, Aggregated, Public, Share
An archivist for an activist organization is responsible for reviewing content that has been archived by an automated data collection process. They open the web archive and are immediately presented with a large amount of traumatic imagery related to the recent death of fellow activist. This experience continues as they browse through the archive, until they are unable to continue to review the content and they close it, leaving it to the next archivist to process.
Requirements: Description, Authenticity, Screenshots, List
An investigative journalist uses their browser to create archives of a social media post that requires a login. The archive is then published and a malicious actor uses information contained in the archive to create fraudulent social media posts that appear to be authored by the journalist.
Requirements: Authenticity, Published, Security
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
2.2 Social media user preserves and deletes account
A social media user wants to use their browser to archive their public profile, and make it available on their personal website, before canceling their account.
Requirements: Description, Aggregated, Manual, Public, Published