Experimental proxy and wrapper boilerplate for safely and efficiently embedding Web Archives (.warc
, .warc.gz
, .wacz
) into web pages.
This implementation:
- Wraps Webrecorder's replayweb.page client-side playback technology.
- Serves, proxies and caches web archive files using NGINX.
- Allows for two-way communication between the embedding website and the embedded archive using post messages.
<!-- Embedding a playback of archive.wacz on https://example.com -->
<iframe
src="https://wacz.example.com/?source=archive.wacz&url=https://what-was-archived.ext/path"
allow="allow-scripts allow-forms allow-same-origin"
>
</iframe>
See also: Live Demo, Blog post
wacz-exhibitor
serves an HTML document containing a pre-configured instance of replayweb.page, webrecorder's client-side web archives playback system, pointing at a proxied version of the requested WARC/WACZ file.
The playback will only start if said HTML document is embedded in a cross-origin <iframe>
for security reasons (XSS prevention in the context of an <iframe>
needing both allow-script
and allow-same-origin
).
We recommend hosting wacz-exhibitor
on a subdomain of the embedding website to avoid third-party cookie limitations:
www.example.com -> Has iframes pointing at wacz.example.com
wacz.example.com -> Hosts wacz-exhibitor
wacz-exhibitor
pulls and serves the requested archive file in the format required by <replay-web-page>
(right Content-Type
, support for range requests, CORS resolution and Content Security Policy).
The requested web archive file can be sourced from either:
- The local
/archives/
folder. This is where the server will look first. - A remote location the server will proxy from, defined in
nginx.conf
.
Serves an HTML document containing an instance of <replay-web-page>
, pointing at a proxied archive file.
Must be embedded in a cross-origin <iframe>
, preferably on the same parent domain to avoid third-party cookie limitations.
GET
, HEAD
Name | Required ? | Description |
---|---|---|
source |
Yes | Filename of the .warc , .warc.gz or .wacz . Can contain a path, but cannot be a url. The file must either be present in the /archives/ folder or on the remote server defined in nginx.conf . |
url |
No | Url of a page within the archive to display. |
ts |
No | Timestamp of the page to retrieve. Can be either a YYYYMMDDHHMMSS-formatted string or a millisecond timestamp or a. |
embed |
No | <replay-web-page> 's embed mode. Can be set to replayonly to hide its UI. |
deepLink |
No | <replay-web-page> 's deepLink mode. |
noSandbox |
No | If set, will remove the sandbox from the <replay-web-page> iframe. May be necessary for certain playbacks; e.g., cross-browser compatible playbacks of PDFs. |
<!-- On https://*.domain.ext: -->
<iframe
src="https://wacz.domain.ext/?source=archive.warc.gz&url=https://what-was-archived.ext/path"
allow="allow-scripts allow-forms allow-same-origin allow-downloads"
>
</iframe>
Pulls, caches and serves a given .warc
, .warc.gz
or .wacz
file, with full support for range requests.
Will first look for the path + file given in the local /archives/
folder, and try to proxy it from the remote server defined in nginx.conf
.
This project consists of a single Dockerfile
derived from the official NGINX Docker image, which can be deployed on any docker-compatible machine.
The following example describes the process of deploying wacz-exhibitor
on fly.io, a platform-as-a-service provider.
nginx.conf
needs to be edited. See comments starting withEDIT:
in the document for instructions.- Install the
flyctl
client and sign-in, if not already done. - Initialize and deploy the project by running the
flyctl launch
command (useflyctl deploy
for subsequent deploys). wacz-exhibitor
is now live and visible on thefly.io
dashboard.- We highly recommend setting up a custom domain and SSL certificate. This can be done directly from the
fly.io
dashboard. Ideally, the target domain should be a subdomain of the website on whichwacz-exhibitor
iframes are going to be embedded: for example,www.domain.ext
embedding an<iframe>
fromwacz.domain.ext
.
docker build . -t wacz-exhibitor-local
docker run --rm -p 8080:8080 wacz-exhibitor-local
# wacz-exhibitor is now accessible at http://localhost:8080
Shortcut: start-dev.sh
A minimal sandbox is available to test embedding wacz-exhibitor <iframe>
s in webpages.
You may edit sandbox/index.html
to make it point to a specific web archive file and run the following command to start the sandbox:
# Assuming: wacz-exhibitor is running on port 8080 ...
bash start-sandbox.sh
# The sandbox is now accessible at http://localhost:8000
wacz-exhibitor
allows the embedding website to communicate with the embedded archive playback using post messages.
All messages coming from a wacz-exhibitor
<iframe>
come with a waczExhibitorHref
property, helping identify the sender.
This feature can be used to build interactive experiences using web archive files.
wacz-exhibitor
will look for the following properties in messages coming from the embedding website and react accordingly:
Property name | Expected value | Description |
---|---|---|
updateUrl |
String | If provided, will replace the current url parameter of <replay-web-page> . |
updateTs |
Number | If provided, will replace the current ts parameter of <replay-web-page> . |
getCollInfo |
Boolean | If provided, will send a post message back with <replay-web-page> 's collInfo object, containing meta information about the currently-loaded archive. |
getInited |
Boolean | If provided, will send a post message back with the current value of <replay-web-page> s inited property, indicating whether or not the service worker is ready. |
overrideElementAttribute |
HTMLAttributeOverride |
If provided, will look for the element with the specified CSS selector inside <replay-web-page> and if found, apply the requested HTML attribute to it. If the element is not found, will send a post message back reporting "status": "timed out" , along with a copy of the original message's data . |
wacz-exhibitor
will forward to the embedding website every post message sent by <replay-web-page>
's service worker.
The most common example is the following, which is sent during navigation within an archive:
{
"waczExhibitorHref": "https://wacz.domain.ext/?source=archive.warc.gz&url=https://what-was-archived.ext/path",
"url": "https://what-was-archived.ext/new-path/",
"view": "pages",
"ts": "20220816162527"
}
// Assuming: there's only 1 <iframe class="wacz-exhibitor">
const playback = document.querySelector("iframe.wacz-exhibitor");
window.addEventListener("message", (event) => {
// This message bears data and comes from the `wacz-exhibitor` <iframe>
if (event?.data && event.source === playback.contentWindow) {
console.log(event);
}
});
// Assuming: there's only 1 <iframe class="wacz-exhibitor">
const playback = document.querySelector("iframe.wacz-exhibitor");
const playbackOrigin = new URL(playback.src).origin;
playback.contentWindow.postMessage(
{"updateUrl": "https://what-was-archived.ext/new-path"},
playbackOrigin
);