This section defines the terms used in this specification and throughout web archives infrastructure. A link to these terms is included whenever they appear in this specification.
├── archive │ └── data.warc.gz ├── datapackage.json ├── datapackage-digest.json ├── indexes │ └── index.cdx.gz └── pages └── pages.jsonl## Directories and Files ### archive The `archive` directory MUST contain one or more files in the [[WARC]] format. The files SHOULD use the `.warc` file extension unless they are GZIP encoded in which case they MUST use the `.warc.gz` file extension.
archive └── data.warc### indexes The `indexes` directory MUST include one or more indexes for the WARC data stored in `archive`. These index files allow clients to efficiently look up a URL to see if it is contained in the WACZ. Index files MUST contain CDXJ data and MAY be gzip compressed [[PYWB-CDXJ]].
indexes └── index.cdx.gz### pages.jsonl The `pages/pages.jsonl` MUST be present and include a list of 'Page' objects as [[JSON-Lines]] where each line MUST contain at least the following properties: - `url` - a URL for the page - `ts` - a [[RFC3339]] datetime string Each entry in the [[JSONL]] file MAY contain the following properties to aid in navigating a web archive collection: - `title` - a string describing the resource - `id` - an arbitrary identifier for the resource - `text` - text extracted from the snapshot - `size` - an integer that representes the number of bytes for the page and all its resources
{"format": "json-pages-1.0", "id": "pages", "title": "All Pages"} {"id": "1db0ef709a", "url": "https://www.example.com/page", "ts": "2020-10-07T21:22:36Z", "title": "Example Domain"} {"id": "12304e6ba9", "url": "https://www.example.com/another", "ts": "2020-10-07T21:23:36Z", "title": "Another Page"}Each entry in the [[JSONL]] file MAY contain additional properties as long as they do not interfere with the required properties. Other [[JSONL]] files MAY be added on using the same format in the `pages/` directory. A common use case is to include only the main pages in the `pages.jsonl`, while including additional pages, such as those discovered automatically via a crawl in an another file e.g. `extraPages.jsonl`. ### datapackage.json The `datapackage.json` file MUST be present at the root of the WACZ which serves as the manifest for the web archive and is compliant with the [[FRICTIONLESS-DATA-PACKAGE]] specification. It MUST contain the following properties: - `profile`: the string `data-package` - `resources`: a list of file names, paths, sizes and fixity for all files contained in the WACZ. - `wacz_version`: the version of WACZ used, for example `1.1.1`
{ "profile": "data-package", "wacz_version": "1.1.1", "resources": [ { "name": "pages.jsonl", "path": "pages/pages.jsonl", "hash": "sha256:8a7fc0d302700bed02294404a627ddbbf0e35487565b1c6181c729dff8d2fff6", "bytes": 75 }, { "name": "data.warc", "path": "archive/data.warc", "hash": "sha256:0e7101316ba5d4b66f86a371ee615fbd20f9d3f32d32563ed2c829db062f7714", "bytes": 11469796 } ] }The `datapackage.json` SHOULD include properties that allow rendering applications to present the user with contextual information about the web archive: - `title`: a string or one sentence description for the collection - `description`: a longer description of the archive's contents which MUST be Markdown formatted (plain text is valid Markdown) - `created`: a [[RFC3339]] datetime for when the WACZ file was created - `modified`: a [[RFC3339]] datetime for when the WACZ file was last modified - `software`: A description of what software was used to create the WACZ file - `mainPageUrl`: An optional URL of the main or starting page in the collection to be used for initial replay - `mainPageDate`: An optional ISO-formatted date of the main or startng page in the collection to be used for initial replay Other properties from the [[FRICTIONLESS-DATA-PACKAGE]] specification such as `licenses`, `version`, `organization`, `contributors`, `email` MAY be used. Custom properties that do not interfere with pre-existing properties MAY also be used. ### datapackage-digest.json A `datapackage-digest.json` file SHOULD be included in the root of the WACZ to verify the `datapackage.json` manifest with a hash and thus for the entire contents of the WACZ. If present the following properties MUST be included: * `path`: the string "datapackage.json" * `hash`: a cryptographic hash for the `datapackage.json` file
{ "path": "datapackage.json", "hash": "sha256:ec1f44ab13e2c94b0ddf66e9673d585ba4a77e6f8c9cc30d8665da434557e885" }For an approach to recording a cryptographic signature in the `datapackage-digest.json` in order to assert and prove the authorship of a WACZ please see [WACZ Signing and Verification](/wacz-auth/latest/). ## Other files and directories Other files and directories MAY be present in a WACZ as long as they do not interfere with specified files and directories that are used by WACZ. Specifically, custom files and directories MUST NOT be added to the existing WACZ directories, `archive`, `indexes` and `pages`. Additional files MUST be listed in the resources section of `datapackage.json` to ensure conformance with [[FRICTIONLESS-DATA-PACKAGE]] ## Zip Format The entire directory structure MUST be stored in a standard [[ZIP]] file. ### Zip Compression Already compressed files MUST NOT be compressed again to allow for random access. - All `archive/` files should be stored in ZIP with 'STORE' mode. - All `index/*.cdx.gz` files should be stored in ZIP with 'STORE' mode. - All files (`*.jsonl`, `*.json`, `*.idx`, `*.cdx`, `*.cdxj`) can be stored in the ZIP with either 'DEFLATE' or 'STORE' mode. ### Zip Format File Extension A ZIP file that follows this Web Archive Collection format spec MUST use the extension `.wacz`. Such a file can be referred to as a WACZ file or a WACZ.
HTTP/2 200 Content-Type: application/wacz Content-Length: 20961755 Accept-Ranges: bytes Access-Control-Allow-Origin: *