High-fidelity, browser-based, single-page web archiving library and CLI.
Use it in the terminal...
scoop "https://lil.law.harvard.edu"
... or in your Node.js project
import { Scoop } from '@harvard-lil/scoop'
const capture = await Scoop.capture('https://lil.law.harvard.edu')
const wacz = await capture.toWACZ()
- About
- Main Features
- Getting Started
- Using Scoop on the command line
- Using Scoop as a JavaScript library
- Development
- FAQ
Scoop is a high fidelity, browser-based, web archiving capture engine for witnessing the web from the Harvard Library Innovation Lab.
Fine-tune this custom web capture software to create robust single-page captures of the internet with accurate and complete provenance information.
With extensive options for asset formats and inclusions, Scoop will create .warc, warc.gz or .wacz files to be stored by users and replayed using the web archive replay software of their choosing.
Scoop also comes with built-in support for the WACZ Signing and Verification specification, allowing users to cryptographically sign their captures.
More info:
- "Witnessing the web is hard: Why and how we built the Scoop web archiving capture engine π¨"
April 13 2023 - lil.law.harvard.edu - "New Release: High Fidelity Capture Engine for Witnessing the Web π¨"
March 28 2023 - blogs.harvard.edu/perma
- High-fidelity, browser-based capture of singular web pages with no alterations
- Highly configurable
- Optional attachments:
- Provenance summary
- Screenshot
- Extracted videos with associated subtitles and metadata
- PDF snapshot
- DOM snapshot
- SSL certificates
- Support for
.warc.
,.warc.gz
and.wacz
output formats- Support for the WACZ Signing and Verification specification
- Optional preservation of "raw" exchanges in WACZ files for later analysis or reprocessing ("wacz with raw exchanges")
- πΎ Sample WACZ file captured with Scoop.
Playback software such as replayweb.page can be used to explore this sample capture. - π· Entry points
- π· Web Capture
- π· Provenance Summary
- π· PDF Snapshot
- π· Embedded videos as attachments [1] [2]
Scoop requires Node.js 18+.
Other recommended system-level dependencies:
curl, python3 (for --capture-video-as-attachment
option).
While the amount of resources Scoop needs is entirely dependent on what is being captured, a minimum of 4GB of RAM seems to be indicated for complex captures.
This program has been written for UNIX-like systems and is expected to work on Linux, Mac OS, and Windows Subsystem for Linux.
Scoop is available on npmjs.org and can be installed as follows:
# As a CLI
npm install -g @harvard-lil/scoop
# As a library
npm install @harvard-lil/scoop --save
# In both cases, you may need to install Playwright's dependencies:
sudo npx playwright install-deps chromium
Trouble installing the CLI?
- Make sure you are running Node JS 18+ (
node -v
) - Permissions issues are a common when installing
npm
packages globally for the first time. See npm's documentation for solutions. - On certain systems, using
install-deps
without thechromium
argument might be necessary:
sudo npx playwright install-deps
- npx may be used as an alternative to a global installation:
# In a new folder
npm init
npm install @harvard-lil/scoop
npx scoop "https://example.com"
Here are a few examples of how the scoop
command can be used to make a customized capture of a web page.
# This will capture a given url using the default settings.
scoop "https://lil.law.harvard.edu"
# Unless specified otherwise, scoop will save the output of the capture as "./archive.wacz".
# We can change this with the `--output` / `-o` option
scoop "https://lil.law.harvard.edu" -o my-collection/lil.wacz
# But what if I want to change the output format itself?
scoop "https://lil.law.harvard.edu" -f warc -o my-collection/lil.warc
# By default, Scoop runs in headless mode.
# I can turn the "headless" flag off to see what happens in Chromium during capture.
scoop "https://lil.law.harvard.edu" --headless false
# Although it comes with "good defaults", scoop is highly configurable ...
# timeout-related options are good
scoop "https://lil.law.harvard.edu" --capture-video-as-attachment false --screenshot false --capture-window-x 320 --capture-window-y 480 --capture-timeout 30000 --max-capture-size 100000 --signing-url "https://example.com/sign"
# ... use --help to list the available options, and see what the defaults are.
scoop --help
# Timeout-related options are good dials to turn first when trying to customize "how much" of a page to capture.
scoop "https://lil.law.harvard.edu" --capture-timeout 90000 --load-timeout 60000 --network-idle-timeout 30000
See: Output of scoop --help π
Usage: scoop [options] <url>
π¨ High-fidelity, browser-based, single-page web archiving library and CLI.
More info: https://github.com/harvard-lil/scoop
Options:
-v, --version Display Scoop and Scoop CLI version.
-o, --output <string> Output path. (default: "./archive.wacz")
-f, --format <string> Output format. (choices: "warc", "warc-gzipped", "wacz", "wacz-with-raw", default: "wacz")
--json-summary-output <string> If set, allows for saving a capture summary as JSON. Must be a path to .json file.
--export-attachments-output <string> If set, allows for exporting attachments (screenshot, certs, ...). Must be a path to an existing directory.
--signing-url <string> Authsign-compatible endpoint for signing WACZ file.
--signing-token <string> Authentication token to --signing-url, if needed.
--screenshot <bool> Add screenshot step to capture? (choices: "true", "false", default: "true")
--pdf-snapshot <bool> Add PDF snapshot step to capture? (choices: "true", "false", default: "false")
--dom-snapshot <bool> Add DOM snapshot step to capture? (choices: "true", "false", default: "false")
--capture-video-as-attachment <bool> Add capture video(s) as attachment(s) step to capture? (choices: "true", "false", default: "true")
--capture-certificates-as-attachment <bool> Add capture certificate(s) as attachment(s) step to capture? (choices: "true", "false", default: "true")
--provenance-summary <bool> Add provenance summary to capture? (choices: "true", "false", default: "true")
--attachments-bypass-limits <bool> If active, attachments will not count towards time and size constraints imposed on capture (--capture-timeout, --max--capture-size). (choices: "true", "false", default: "true")
--capture-timeout <number> Maximum time allocated to capture process before hard cut-off, in ms. (default: 60000)
--load-timeout <number> Max time Scoop will wait for the page to load, in ms. (default: 20000)
--network-idle-timeout <number> Max time Scoop will wait for the in-browser networking tasks to complete, in ms. (default: 20000)
--behaviors-timeout <number> Max time Scoop will wait for the browser behaviors to complete, in ms. (default: 20000)
--capture-video-as-attachment-timeout <number> Max time Scoop will wait for the video capture process to complete, in ms. (default: 30000)
--capture-certificates-as-attachment-timeout <number> Max time Scoop will wait for the certificates capture process to complete, in ms. (default: 10000)
--capture-window-x <number> Width of the browser window Scoop will open to capture, in pixels. (default: 1600)
--capture-window-y <number> Height of the browser window Scoop will open to capture, in pixels. (default: 900)
--max-capture-size <number> Size limit for the capture's exchanges list, in bytes. (default: 209715200)
--max-video-capture-size <number> Size limit for the video attachment, in bytes. Scoop will not capture video attachments larger than this. (default: 209715200)
--auto-scroll <bool> Should Scoop try to scroll through the page? (choices: "true", "false", default: "true")
--auto-play-media <bool> Should Scoop try to autoplay `<audio>` and `<video>` tags? (choices: "true", "false", default: "true")
--grab-secondary-resources <bool> Should Scoop try to download img srcsets and secondary stylesheets? (choices: "true", "false", default: "true")
--run-site-specific-behaviors <bool> Should Scoop run site-specific capture behaviors? (via: browsertrix-behaviors) (choices: "true", "false", default: "true")
--headless <bool> Should Chrome run in headless mode? (choices: "true", "false", default: "true")
--user-agent-suffix <string> If provided, will be appended to Chrome's user agent. (default: "")
--blocklist <string> If set, replaces Scoop's default list of url patterns and IP ranges Scoop should not capture. Comma-separated. Example: "/https?://localhost/,0.0.0.0/8,10.0.0.0".
--intercepter <string> ScoopIntercepter class to be used to intercept network exchanges. (default: "ScoopProxy")
--proxy-host <string> Hostname to be used by Scoop's HTTP proxy. (default: "localhost")
--proxy-port <string> Port to be used by Scoop's HTTP proxy. (default: 9000)
--proxy-verbose <bool> Should Scoop's HTTP proxy output logs to the console? (choices: "true", "false", default: "false")
--public-ip-resolver-endpoint <string> API endpoint to be used to resolve the client's IP address. Used in the context of the provenance summary. (default: "https://icanhazip.com")
--yt-dlp-path <string> Path to the yt-dlp executable. Used for capturing videos. (default: "[library]/executables/yt-dlp")
--crip-path <string> Path to the crip executable. Used for capturing SSL/TLS certificates. (default: "[library]/executables/crip")
--log-level <string> Controls Scoop CLI's verbosity. (choices: "silent", "trace", "debug", "info", "warn", "error", default: "info")
-h, --help Show options list.
Scoop can be used as a library in a Node.js project.
Here are a few examples of how to programmatically capture web pages using the Scoop.capture()
method, which returns an instance of the Scoop
class.
const capture = await Scoop.capture(url, options)
- List of available options for
Scoop.capture()
Scoop.toWACZ()
methodScoop.toWARC()
methodScoop.fromWACZ()
method (experimental)- Possible values of the
Scoop.state
property
import fs from 'fs/promises'
import { Scoop } from '@harvard-lil/scoop'
try {
const capture = await Scoop.capture('https://lil.law.harvard.edu')
const wacz = await capture.toWACZ()
await fs.writeFile('archive.wacz', Buffer.from(wacz))
} catch(err) {
// ...
}
import fs from 'fs/promises'
import { Scoop } from '@harvard-lil/scoop'
try {
const capture = await Scoop.capture('https://lil.law.harvard.edu', {
screenshot: true,
pdfSnapshot: true,
captureVideoAsAttachment: false,
captureTimeout: 120 * 1000,
loadTimeout: 60 * 1000,
captureWindowX: 320,
captureWindowY: 480
})
const warc = await capture.toWARC()
await fs.writeFile('archive.warc', Buffer.from(warc))
} catch(err) {
// ...
}
import { Scoop } from '@harvard-lil/scoop'
try {
// "options" will be a copy of Scoop's default settings
const options = Scoop.defaults
// It therefore becomes easier to inspect said defaults ...
console.log(options)
// ... and edit existing values
options.pdfSnapshot = true
options.blocklist.push('/https?:\/\/foo/')
const capture = Scoop.capture('https://lil.law.harvard.edu', options)
// ...
} catch(err) {
// ...
}
import fs from 'fs/promises'
import { Scoop } from '@harvard-lil/scoop'
try {
const capture = await Scoop.capture('https://lil.law.harvard.edu')
const signedWacz = await capture.toWACZ(true, {
url: 'https://example.com/sign',
token: 'some-very-secret-token'
})
await fs.writeFile('archive.wacz', Buffer.from(signedWacz))
} catch(err) {
// ...
}
π§ Under construction
Browser-based capture means that Scoop uses a browser - Chromium - to visit the web page to capture and collect resources.
Specifically, it uses an HTTP proxy to "intercept" network exchanges as early as possible and preserve them "as is".
flowchart LR
A[Scoop]
B[Playwright]
C[Chromium]
D[Website]
E[HTTP Proxy]
A <--> |Controls| B
B <--> C
C <--> D
A <-.-> |Capture| E <-.-> C
The browser Scoop controls was installed specifically for programmatic access by Playwright, the underlying tool it uses to communicate with it, and is different from the default browser of the machine Scoop is running on. Additionally, Scoop creates a single-use, isolated browsing context for every capture it makes.
More info:
Not yet - for security reasons - but we're working on it.
Although Playwright supports loading browser profiles doing so:
- Breaks context isolation
- May lead to the presence of credentials / tokens in the captured exchanges
Help us design this feature: #118
Yes, and unless specified otherwise.
Namely:
- If the main URL to capture is not a web page (for example: a PDF file), it will be captured using curl.
- Videos captured as attachments are captured outside of the browser using yt-dlp.
- Same goes for certificates, captured as attachments via crip.
- Favicons may be captured out-of-band using curl, if not intercepted during capture.
Exchanges captured in that context still go through Scoop's HTTP proxy, with the exception of crip.
flowchart LR
A[Scoop]
B[curl]
C[Resource]
D[HTTP Proxy]
A <--> |Controls| B
B <--> C
A <-.-> |Capture| D <-.-> B
The includeRaw
option of Scoop.toWACZ()
allows for adding a folder named "raw" in the WACZ file, which contains a copy of unprocessed HTTP exchanges coming directly from Scoop's HTTP proxy.
This feature may be used to preserve finer elements that would otherwise be lost, such as ill-formed HTTP headers, and could be relevant in certain contexts such as forensic analysis.
In order to prevent unnecessary use of storage, Scoop only keeps in "/raw" the contents of exchanges it assesses are presented differently in WARCs. In practice, this most often means the bodies of HTTP exchanges are not included in the "/raw" files because the WARCs already contain the same data.
Experimental: WACZ files stored with the includeRaw
option can be ingested by Scoop for analysis and processing via the Scoop.fromWACZ()
method.
In certain cases, running Scoop in "headful" mode might yield better results.
Passing --headless false
to the CLI or { headless: false }
to the library will instruct Scoop to run Chromium in headful mode.
Simulating a graphical output is necessary when running Scoop in headful mode on a server. The following command can be used for that purpose:
xvfb-run --auto-servernum -- scoop "https://lil.law.harvard.edu" --headless false
This codebase uses the Standard JS coding style.
npm run lint
can be used to check formatting.npm run lint-autofix
can be used to check formatting and automatically edit files accordingly when possible.- Most IDEs can be configured to automatically check and enforce this coding style.
JSDoc is used for both documentation and loose type checking purposes on this project.
This project uses Node.js' built-in test runner.
npm run test
The following environment variables allow for testing features requiring access to a third-party server.
These are optional, and can be added to a local .env
file which will be automatically interpreted by the test runner.
Name | Description |
---|---|
TEST_WACZ_SIGNING_URL |
URL of an authsign-compatible endpoint for signing WACZ files. To run such an endpoint locally, use npm run dev-signer , which will overwrite .env and set this variable to http://localhost:5000/sign ; see .services/signer. |
TEST_WACZ_SIGNING_TOKEN |
If required by the server at TEST_WACZ_SIGNING_URL , an authentication token. |
# Runs test suite
npm run test
# Runs linter
npm run lint
# Runs linter and attempts to automatically fix issues
npm run lint-autofix
# Runs a local instance of wacz-signer for test purposes (see "Testing" section)
npm run dev-signer
# Step-by-step NPM publishing helper
npm run publish-util