Webrecorder Recommendation
Copyright ยฉ 2021 Webrecorder CC-BY
WACZ is a media type that allows web archive collections to be packaged and shared on the web as a discrete file. A WACZ file includes all the data that is needed for the rendering archived content as well as contextual information required for users to interpret it. Rendering software can obtain this data on demand using HTTP Range requests, without requiring the entire file to be fully retrieved, or for it to be otherwise mediated by specialized server side software.
As well as sections marked as non-normative, all authoring guidelines, diagrams, examples, and notes in this specification are non-normative. Everything else in this specification is normative.
The key words MAY, MUST, MUST NOT, and SHOULD in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.
This is a stable version of the WACZ standard and is in active use by the Webrecorder project. Please open GitHub issues for questions and suggestions.
This section defines the terms used in this specification and throughout web archives infrastructure. A link to these terms is included whenever they appear in this specification.
This specification defines a directory structure and ZIP format specification for sharing and distributing web archives. ZIP files using this format can be referred to as WACZ (Web Archive Collection Zipped).
The goal of this specification is to provide a portable format for web archives in order to achieve two broad goals for web archives:
Social: to provide an interoperable way of sharing web archive collections that includes the contextual information needed for users to interpret and meaningfully interact with them.
Technical: to provide an efficient way to dynamically load
small amounts of data from a remotely hosted file
on static storage, without requiring the entire file
to be downloaded, or for the intervention of specialized
server side applications.
To use and make sense of a web archive collection, it is necessary to have the archived web content as well as contextual information that describes what the collection contains as well as when and how it was created. The collection also requires a set of entry points or pages to use for browsing the collection.
All of this data needs to be packaged together so that the various pieces can be easily copied and transferred without accidentally separating them. This data package needs to to be easily transported from one storage system to another, sent as an attachment in an email, placed on a thumb drive, and hosted by simply serving it up at a given URL as a static document, possibly from cloud object storage, or a CDN.
Hosting web archives currently requires complex server infrastructure (e.g. a Wayback Machine) to serve WARC data in such a way that can be viewed in the browser. The WACZ format provides a storage approach optimized for efficient random-access to packaged up WARC data that allows the browser to render a page by fetching only what is needed for that particular page. This is done by leveraging the ZIP format's built-in index to locate the contents of the web archive and its constituent metadata.
WACZ is not designed to replace other web archiving formats. Rather it establishes a file packaging convention for all the data needed by a browser for efficient rendering of a web archive collection, and its contextualization.
The py-wacz repository contains a reference implementation for creating WACZ files from existing WARC files, and validating them. Parts of the specification are also implemented and in use by wabac.js and ReplayWeb.page.
A WACZ object consists of the following:
A datapackage.json
file for recording technical and descriptive metadata
specified in [FRICTIONLESS-DATA-PACKAGE].
An extensible directory and naming convention for web archive data.
A method for bundling the directory layout in a ZIP file.
A WACZ contains a directory structure, that contains web archive collection data which MUST conform to the [FRICTIONLESS-DATA-PACKAGE] specification. This directory structure looks like:
โโโ archive โ โโโ data.warc.gz โโโ datapackage.json โโโ datapackage-digest.json โโโ indexes โ โโโ index.cdx.gz โโโ pages โโโ pages.jsonl
The archive
directory MUST contain one or more files in the [WARC] format.
The files SHOULD use the .warc
file extension unless they are GZIP encoded in
which case they MUST use the .warc.gz
file extension.
archive โโโ data.warc
The indexes
directory MUST include one or more indexes for the WARC data stored
in archive
. These index files allow clients to efficiently look up a URL to
see if it is contained in the WACZ. Index files MUST contain CDXJ data
and MAY be gzip compressed [PYWB-CDXJ].
indexes โโโ index.cdx.gz
The pages/pages.jsonl
MUST be present and include a list of 'Page' objects as
[JSON-Lines] where each line MUST contain at least the following properties:
url
- a URL for the pagets
- a [RFC3339] datetime stringEach entry in the [JSONL] file MAY contain the following properties to aid in navigating a web archive collection:
title
- a string describing the resourceid
- an arbitrary identifier for the resourcetext
- text extracted from the snapshotsize
- an integer that represents the number of bytes for the page and all its resources{"format": "json-pages-1.0", "id": "pages", "title": "All Pages"} {"id": "1db0ef709a", "url": "https://www.example.com/page", "size": 1256, "ts": "2020-10-07T21:22:36Z", "title": "Example Domain", "text": "Example Domain This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission. More information..."} {"id": "12304e6ba9", "url": "https://www.example.com/another", "size": 1256, "ts": "2020-10-07T21:23:36Z", "title": "Another Page", "text": "Example Domain This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission. More information..."}
Each entry in the [JSONL] file MAY contain additional properties as long as they do not interfere with the required properties.
Other [JSONL] files MAY be added on using the same format in the pages/
directory. A common use case is to include only the main pages in the
pages.jsonl
, while including additional pages, such as those discovered
automatically via a crawl in an another file e.g. extraPages.jsonl
.
The datapackage.json
file MUST be present at the root of the WACZ which
serves as the manifest for the web archive and is compliant with the
[FRICTIONLESS-DATA-PACKAGE] specification. It MUST contain the following
properties:
profile
: the string data-package
resources
: a list of file names, paths, sizes and fixity for all files
contained in the WACZ.wacz_version
: the version of WACZ used, for example 1.1.1
{ "profile": "data-package", "wacz_version": "1.1.1", "resources": [ { "name": "pages.jsonl", "path": "pages/pages.jsonl", "hash": "sha256:8a7fc0d302700bed02294404a627ddbbf0e35487565b1c6181c729dff8d2fff6", "bytes": 75 }, { "name": "data.warc", "path": "archive/data.warc", "hash": "sha256:0e7101316ba5d4b66f86a371ee615fbd20f9d3f32d32563ed2c829db062f7714", "bytes": 11469796 } ] }
The datapackage.json
SHOULD include properties that allow rendering
applications to present the user with contextual information about the
web archive:
title
: a string or one sentence description for the collectiondescription
: a longer description of the archive's contents
which MUST be Markdown formatted (plain text is valid Markdown)created
: a [RFC3339] datetime for when the WACZ file was createdmodified
: a [RFC3339] datetime for when the WACZ file was last modifiedsoftware
: A description of what software was used to create the WACZ filemainPageUrl
: An optional URL of the main or starting page in the collection
to be used for initial replaymainPageDate
: An optional ISO-formatted date of the main or startng page in
the collection to be used for initial replayOther properties from the [FRICTIONLESS-DATA-PACKAGE] specification such as
licenses
, version
, organization
, contributors
, email
MAY be used.
Custom properties that do not interfere with pre-existing properties MAY also
be used.
A datapackage-digest.json
file SHOULD be included in the root of the WACZ to
verify the datapackage.json
manifest with a hash and thus for the entire
contents of the WACZ. If present the following properties MUST be included:
path
: the string "datapackage.json"hash
: a cryptographic hash for the datapackage.json
file{ "path": "datapackage.json", "hash": "sha256:ec1f44ab13e2c94b0ddf66e9673d585ba4a77e6f8c9cc30d8665da434557e885" }
For an approach to recording a cryptographic signature in the
datapackage-digest.json
in order to assert and prove the authorship of a WACZ
please see WACZ Signing and Verification.
Other files and directories MAY be present in a WACZ as long as they do
not interfere with specified files and directories that are used by WACZ.
Specifically, custom files and directories MUST NOT be added to the existing WACZ directories, archive
, indexes
and pages
. Additional files MUST be listed in the resources section of datapackage.json
to ensure conformance with [FRICTIONLESS-DATA-PACKAGE]
The entire directory structure MUST be stored in a standard [ZIP] file.
Already compressed files MUST NOT be compressed again to allow for random access.
archive/
files should be stored in ZIP with 'STORE' mode.index/*.cdx.gz
files should be stored in ZIP with 'STORE' mode.*.jsonl
, *.json
, *.idx
, *.cdx
, *.cdxj
) can be stored in
the ZIP with either 'DEFLATE' or 'STORE' mode.A ZIP file that follows this Web Archive Collection format spec MUST use the extension .wacz
.
Such a file can be referred to as a WACZ file or a WACZ.
The [ZIP] file format provides efficient random access, which means archived web pages can be retrieved efficiently even from large web archive collections without requiring the entire WACZ to be transferred. To achieve this WACZ clients can read portions of the ZIP file on-demand using HTTP RANGE requests [RFC7233].
The processing model works as follows. Given a ZIP file, a client can quickly:
datapackage.json
pages.jsonl
, if anyTo lookup a given URL the client needs to:
This approach is being used by ReplayWeb.page
Because they are ZIP files WACZ can be hosted on the web as static files. This allows web archives to be easily maintained over time without relying on complex server side software, apart from widely available, open source, and well tested web server applications. If desirable WACZ files can be managed and made accessibile using HTTP object stores available from cloud hosting providers, and content deliver networks that geographically position web-archives closer to their users. However there are certain considerations to make when publishing WACZ files.
WACZ clients need to know how large an entire WACZ file is in order to
download it prior to rendering, or to read it dynamically. To support this HTTP
responses for WACZ files MUST use the Content-Length
HTTP header.
Clients that render WACZ files typically need to be able to fetch content from the WACZ file on demand. For example when displaying archived content for a given URL that URL needs to be looked up in the CDXJ index, and the byte offsets from the index entry are then used to retrieve a portion of a given WARC file that is enclosed in the WACZ.
In order for clients to be able to perform this dynamic retrieval web servers
that publish WACZ files MUST support HTTP range requests [RFC7233]. HTTP
responses for WACZ HTTP requests SHOULD server WACZ files using the
Accept-Ranges
HTTP header.
WACZ files and the their clients MAY be served from the same host name. However
it can be useful to view the web archive from a host name that is distinct from
the host name that is publishing the WACZ file. For example this is the
case when publishing WACZ files using a cloud provider's HTTP object storage
(e.g. s3.amazonaws.com
) and making it viewable at another domain (e.g. example.org
). It also is the case when WACZ publishers want to allow their web
archives to circulate on the web, and be viewable in multiple locations.
For security reasons browsers restrict access to files hosted on a
different domain than the websites that is trying to load them. In order to
support loading from different domains WACZ files SHOULD be made available using
the access-control-allow-origin
[CORS] HTTP header.
WACZ HTTP responses for WACZ files SHOULD be published with the
application/wacz
media type.
Given these requirements a minimal HTTP response for a WACZ could look like:
HTTP/2 200 Content-Type: application/wacz Content-Length: 20961755 Accept-Ranges: bytes Access-Control-Allow-Origin: *