Webrecorder Recommendation
Copyright © 2022 Webrecorder CC-BY
As well as sections marked as non-normative, all authoring guidelines, diagrams, examples, and notes in this specification are non-normative. Everything else in this specification is normative.
The key word MUST in this document is to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.
This document is a working draft for a proposal to create signed WACZ packages which allow the package's author to be cryptographically proven.
The purpose of this specification is to provide an optional mechanism to make web archives bundled in WACZ more trusted.
The WACZ format provides a way to create portable web archives and distribute them through any online network in a decentralized manner.
To increase trust in web archives, it becomes necessary to guarantee certain properties about who the web archive was created and when.
Proving web archive authenticity can be difficult. Ideally, proof of authenticity could guarantee that any web server served a particular URL at a particular point in time. Unfortunately, this is not currently possible with existing web standards, as even TLS does not provide "non-repudiation".
There are proposals, such as signed exchanges, which provide new ways for web servers to serve content that could later be verified. However, this requires HTTPS servers to support a new protocol and format.
A web archive can be created by any HTTP client that is capable of recording its end of the HTTP network traffic. The goal of this specification is to provide a way for a client that creates a web archive and stores it in the WACZ format to also add authentication information.
This proposal will provide a mechanism to authenticate:
This approach requires trusting the client, and possible trusted third party 'observer' that signs the web archive. This proposal does not make any guarantees from the perspective of the web server serving the content, as this is not currently possible with HTTP/S.
The WACZ format builds on top of the [FRICTIONLESS-DATA-PACKAGE]. The data
package includes a manifest datapackage.json
file which contains the hashes of
all the files in the data package.
A signed WACZ also contains a datapackage-digest.json
, which contains a hash
and signature of the datapackage.json
(Note: While this specification is primarily designed for WACZ files, the approach described here may be adapted to any Frictionless Data Package-based specification)
The datapackage-digest.json
file, included in the root of the WACZ file, contains the following structure:
{ "path": "datapackage.json", "hash": "sha256:3dd086a0be145d1108bf32a5cac7c4b4c046eb78365792d4bb28e9f43e3c6571", "signedData": <Signature Data structure> }
The SignatureData structure is the output of the signing operation.
There can only be one SignatureData object per WACZ file, stored as signedData
in the root object present in datapackage-digest.json
.
The SignatureData object MUST conform to one of the two signature formats described in this section:
The SignatureData object MUST not contain any property that is not listed in either signature format.
{ "hash": "<sha256 hash of datapackage.json>", "created": "<ISO 8601 Date>", "software": "<string>", "version": "<string>", "signature": "<base64 encoded signature>", "publicKey": "<base64 encoded public key (ECDSA)>" }
With this approach, the WACZ contains just enough to validate that they
signature with the publicKey
.
To validate authorship of the WACZ, external key management is required, and this signature is otherwise anonymous.
Currently, this approach is used in decentralized tooling, such as the ArchiveWeb.page extension.
{ "hash": <sha256 hash of datapackage.json>, "created": <ISO 8601 Date>, "software": <string>, "version": <string>, // signature of 'hash' by domainCert "signature": <base64 encoded signature>, "domain": <valid hostname>, "domainCert": <PEM certificate chain>, // signature of 'signature' by timestampCert "timeSignature": <base64 encoded signature>, "timestampCert": <PEM ceriticate chain>, // optional: cross-signing cert for "signature" "crossSignedCert": <PEM certificate chain> }
This approach allows for the WACZ signature to be created by the same private key as is used to create a TLS certificate for a particular domain.
The creator of the WACZ file is the same as the owner of a particular TLS certificate, which can be explored via Certificate Transparency logs.
This approach also includes an RFC 3161 timestamp server timeSignature
of the
first signature
.
The timeSignature
includes the timestamped and is designed to further
guarantee that the signature was created close to the specified creation time.
For additional verification, an optional crossSignedCert
can be provided which
can be used as an alternative to the domainCert
, in case the domain
certificate has been found to be compromised for any reason. The cross-signed
certificate simply provides a way to provide an alternative trust path not tied
to the domainCert, if it becomes needed for any reason.
The domainCert
, timestampCert
, and crossSignedCert
should include the full
certificate chain to aide in validation.
To generate the signedData
, the creator of the WACZ can perform the
following steps.
At minimum, the creator of the WACZ can create the anonymous signature
signedData
, containing only its public key and signature.
With this approach, the client must distribute its public key out-of-band to make WACZ verification possible.
To create the signedData in the second example, the client must perform the following:
Generate a ECDSA private key.
Create a Certificate Request signed by the private key.
Receive a TLS certificate for its CSR from a trusted CA, such as LetsEncrypt.
Optionally: use a second CA (such as a private self-signed CA) to generate a second cross-signed certificate as backup.
Sign the hash using its private key to generate the first signature (signature)
Use an RFC 3161 timestamp server to sign the previous signature (timeSignature)
This approach is based on a 'trusted-third party' which securely creates and signs the WACZ files.
The intended workflow for this approach is that the creator of the WACZ has a secure, private connection to the attestation server which is running on the designated domain and can observe and sign the requested hash. Access to the attestation server must be restricted to trusted creator of the WACZ. Establishing this secure connection is beyond the scope of this specification.
Verification of a signed WACZ can involve the following steps:
Validate the WACZ as a Frictionless Data Package, ensuring the
datapackage.json
hashes match the contents.
Validate that the hash
in datapackage-digest.json
matches
datapackage.json
.
Verify that the signedData
property of datapackage-digest.json
is a valid SignatureData object.
Given an anonymous signedData
with only a publicKey:
signature
of the specified hash
with the publicKey
.The WACZ file is thus valid, but publicKey must be validated using external mechanisms.
Given a signature data with domain-name signature + timestamp, the validation is as follows:
Read the first certificate in domainCert
certificate chain and validate
that signature
is a valid signature of hash
using the public key of the
certificate.
Verify that the domain
matches the subjectName of the domainCert
TLS
certificate.
Read the first certificate of timestampCert
certificate chain and validate
that the timeSignature
is a valid RFC 3161 timestamp signature of signature
Validate that the created
date is within 10 minutes of the signed timestamp
in timeSignature
Verify trust of the domainCert
certificate chain by checking trusted root
list. (Optionally, check for certificate revocation in Certificate Transparency
logs).
Verify trust of the timestampCert
certificate chain by checking trusted
root list.
Optional: if crossSignedCert
is provided, check that it has same public key
as domainCert
, and check this chain in trusted root list. Or, if domainCert
is not trusted, use this trust path instead of domainCert
.
The verification process above assumes a fully loaded WACZ file. However, WACZ format supports partial loading of individual HTTP responses (via WARC records) on demand.
For partial/on-demand loading, Step 1 is modified as WARC files are not fully loaded and can not be validated against the hash.
Instead, for each WARC record, verify the hash of the loaded WARC record with the hash specified in the CDXJ index. If using a compressed CDXJ index, also verify the hash of each compressed CDXJ block with the IDX index. (See the CDXJ specification for more details).
Implementations of this spec exist in the following tools:
The authsign library provides an
HTTP-based API for creating and validating a WACZ signedData
for the
domain-name identity + timestamp approach. This library uses the LetsEncrypt
service to generate a domain certificate on-demand, and the
FreeTSA timestamping service to generate an
RFC 3161 timestamp.
The py-wacz CLI tool can be used to
generate and validate WACZ file with domain-name identity + timestamp, by either
connecting to a authsign
server via API for signing, or for validating
directly via the CLI.
The ArchiveWeb.page extension produces signed WACZ
files with an anonymous signedData
. The publicKey is not yet distributed
outside the extension.
The development of this specification was supported by and developed in collaboration with Harvard Law Library Innovation Lab (Harvard LIL).