WACZ Signing and Verification

Webrecorder Recommendation

More details about this document
This version:
https://specs.webrecorder.net/wacz-auth/0.1.0/
Latest published version:
https://specs.webrecorder.net/wacz-auth/latest/
Editors:
Ilya Kreymer (Webrecorder)
Jack Cushman (Harvard Law Library Innovation Lab (LIL))
Ed Summers (Stanford University)
Additional Documents
Use Cases for Decentralized Web Archives
Repository
Github
Issues
Commits

Abstract

Conformance

As well as sections marked as non-normative, all authoring guidelines, diagrams, examples, and notes in this specification are non-normative. Everything else in this specification is normative.

The key word MUST in this document is to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.

Introduction

This document is a working draft for a proposal to create signed WACZ packages which allow the package's author to be cryptographically proven.

Motivation

The purpose of this specification is to provide an optional mechanism to make web archives bundled in WACZ more trusted.

The WACZ format provides a way to create portable web archives and distribute them through any online network in a decentralized manner.

To increase trust in web archives, it becomes necessary to guarantee certain properties about who the web archive was created and when.

Proof of Authenticity

Proving web archive authenticity can be difficult. Ideally, proof of authenticity could guarantee that any web server served a particular URL at a particular point in time. Unfortunately, this is not currently possible with existing web standards, as even TLS does not provide "non-repudiation".

There are proposals, such as signed exchanges, which provide new ways for web servers to serve content that could later be verified. However, this requires HTTPS servers to support a new protocol and format.

A web archive can be created by any HTTP client that is capable of recording its end of the HTTP network traffic. The goal of this specification is to provide a way for a client that creates a web archive and stores it in the WACZ format to also add authentication information.

This proposal will provide a mechanism to authenticate:

This approach requires trusting the client, and possible trusted third party 'observer' that signs the web archive. This proposal does not make any guarantees from the perspective of the web server serving the content, as this is not currently possible with HTTP/S.

WACZ Signature File and Format

The WACZ format builds on top of the [FRICTIONLESS-DATA-PACKAGE]. The data package includes a manifest datapackage.json file which contains the hashes of all the files in the data package.

A signed WACZ also contains a datapackage-digest.json, which contains a hash and signature of the datapackage.json

(Note: While this specification is primarily designed for WACZ files, the approach described here may be adapted to any Frictionless Data Package-based specification)

Data Package Digest File

The datapackage-digest.json file, included in the root of the WACZ file, contains the following structure:

{
  "path": "datapackage.json",
  "hash": "sha256:3dd086a0be145d1108bf32a5cac7c4b4c046eb78365792d4bb28e9f43e3c6571",
  "signedData": <Signature Data structure>
}

Signature Data Format

The SignatureData structure is the output of the signing operation.

There can only be one SignatureData object per WACZ file, stored as signedData in the root object present in datapackage-digest.json.

The SignatureData object MUST conform to one of the two signature formats described in this section:

The SignatureData object MUST not contain any property that is not listed in either signature format.

Anonymous Signature

{
    "hash": "<sha256 hash of datapackage.json>",
    "created": "<ISO 8601 Date>",
    "software": "<string>",
    "version": "<string>",
    "signature": "<base64 encoded signature>",
    "publicKey": "<base64 encoded public key (ECDSA)>"
}

With this approach, the WACZ contains just enough to validate that they signature with the publicKey.

To validate authorship of the WACZ, external key management is required, and this signature is otherwise anonymous.

Currently, this approach is used in decentralized tooling, such as the ArchiveWeb.page extension.

Domain-Ownership Identity + Signed Timestamp

{
    "hash": <sha256 hash of datapackage.json>,
    "created": <ISO 8601 Date>,
    "software": <string>,
    "version": <string>,
  
    // signature of 'hash' by domainCert
    "signature": <base64 encoded signature>,
    "domain": <valid hostname>,
    "domainCert": <PEM certificate chain>,
  
    // signature of 'signature' by timestampCert
    "timeSignature": <base64 encoded signature>,
    "timestampCert": <PEM ceriticate chain>,
    
    // optional: cross-signing cert for "signature"
    "crossSignedCert": <PEM certificate chain>
}

This approach allows for the WACZ signature to be created by the same private key as is used to create a TLS certificate for a particular domain.

The creator of the WACZ file is the same as the owner of a particular TLS certificate, which can be explored via Certificate Transparency logs.

This approach also includes an RFC 3161 timestamp server timeSignature of the first signature.

The timeSignature includes the timestamped and is designed to further guarantee that the signature was created close to the specified creation time.

For additional verification, an optional crossSignedCert can be provided which can be used as an alternative to the domainCert, in case the domain certificate has been found to be compromised for any reason. The cross-signed certificate simply provides a way to provide an alternative trust path not tied to the domainCert, if it becomes needed for any reason.

The domainCert, timestampCert, and crossSignedCert should include the full certificate chain to aide in validation.

Signing

To generate the signedData, the creator of the WACZ can perform the following steps.

Anonymous Signing

At minimum, the creator of the WACZ can create the anonymous signature signedData, containing only its public key and signature.

With this approach, the client must distribute its public key out-of-band to make WACZ verification possible.

Domain-Name Identity + Timestamp Signing

To create the signedData in the second example, the client must perform the following:

  1. Generate a ECDSA private key.

  2. Create a Certificate Request signed by the private key.

  3. Receive a TLS certificate for its CSR from a trusted CA, such as LetsEncrypt.

  4. Optionally: use a second CA (such as a private self-signed CA) to generate a second cross-signed certificate as backup.

  5. Sign the hash using its private key to generate the first signature (signature)

  6. Use an RFC 3161 timestamp server to sign the previous signature (timeSignature)

This approach is based on a 'trusted-third party' which securely creates and signs the WACZ files.

The intended workflow for this approach is that the creator of the WACZ has a secure, private connection to the attestation server which is running on the designated domain and can observe and sign the requested hash. Access to the attestation server must be restricted to trusted creator of the WACZ. Establishing this secure connection is beyond the scope of this specification.

Verification

Verification of a signed WACZ can involve the following steps:

  1. Validate the WACZ as a Frictionless Data Package, ensuring the datapackage.json hashes match the contents.

  2. Validate that the hash in datapackage-digest.json matches datapackage.json.

  3. Verify that the signedData property of datapackage-digest.json is a valid SignatureData object.

Anonymous Signature Validation

Given an anonymous signedData with only a publicKey:

  1. Validate the signature of the specified hash with the publicKey.

The WACZ file is thus valid, but publicKey must be validated using external mechanisms.

Domain-Name Identity + Timestamp Validation.

Given a signature data with domain-name signature + timestamp, the validation is as follows:

  1. Read the first certificate in domainCert certificate chain and validate that signature is a valid signature of hash using the public key of the certificate.

  2. Verify that the domain matches the subjectName of the domainCert TLS certificate.

  3. Read the first certificate of timestampCert certificate chain and validate that the timeSignature is a valid RFC 3161 timestamp signature of signature

  4. Validate that the created date is within 10 minutes of the signed timestamp in timeSignature

  5. Verify trust of the domainCert certificate chain by checking trusted root list. (Optionally, check for certificate revocation in Certificate Transparency logs).

  6. Verify trust of the timestampCert certificate chain by checking trusted root list.

  7. Optional: if crossSignedCert is provided, check that it has same public key as domainCert, and check this chain in trusted root list. Or, if domainCert is not trusted, use this trust path instead of domainCert.

Verification of Partially loaded WACZ

The verification process above assumes a fully loaded WACZ file. However, WACZ format supports partial loading of individual HTTP responses (via WARC records) on demand.

For partial/on-demand loading, Step 1 is modified as WARC files are not fully loaded and can not be validated against the hash.

Instead, for each WARC record, verify the hash of the loaded WARC record with the hash specified in the CDXJ index. If using a compressed CDXJ index, also verify the hash of each compressed CDXJ block with the IDX index. (See the CDXJ specification for more details).

Implementations

Implementations of this spec exist in the following tools:

Acknowledgments

The development of this specification was supported by and developed in collaboration with Harvard Law Library Innovation Lab (Harvard LIL).