Skip to content

URL search: CDX server API

Fernando-Melo edited this page Oct 31, 2017 · 51 revisions

API Reference

CDX-server API allows automatic access in order to list, sort, and filter preserved pages from a given URL.

url

The only required parameter to the cdx-server api is the url, eg: http://arquivo.pt/wayback/-cdx?url=publico.pt

will return a list of captures for 'publico.pt'

from / to

Setting from= or to= will restrict the results to the given date/time range (inclusive).

Timestamps may be <=14 digits and will be padded to either lower or upper bound.

For example, http://arquivo.pt/wayback/-cdx?url=sapo.pt&from=2014&to=2014 will return results of sapo.pt that have a timestamp between 20140101000000 and 20141231235959

matchType

The cdx-server supports the following matchType

exact -- default setting, will return captures that match the url exactly

prefix -- return captures that begin with a specified path, eg: http://sapo.pt/noticias/*

host -- return captures which for a begin host (the path segment is ignored if specified)

domain -- return captures for the current host and all subdomains, eg. *.example.com

Instead of specifying a separate matchType parameter, wildcards may be used in the url:

limit

Setting limit= will limit the number of index lines returned. Limit must be set to a positive integer. If no limit is provided, all the matching lines are returned, which may be slow. For example http://arquivo.pt/wayback/-cdx?url=http://www.sapo.pt/noticias/&matchType=prefix&limit=1500 will show the first 1500 results.

sort

The sort param can be set as follows:

reverse -- will sort the matching captures in reverse order. It is only recommended for exact query as reverse a large match may be very slow.

closest -- setting this option also requires setting closest= where is a specific timestamp to sort by. This option will only work correctly for exact query and is useful for sorting captures based no time distance from a certain timestamp.

output (JSON output)

Setting output=json will return each line as a proper JSON dictionary. (Default format is text which will return the native format of the underlying CDX index, and may not be consistent). Using output=json is recommended for extensive analysis.

filter

The filter param can be specified multiple times to filter by specific fields in the cdx index. Field names correspond to the fields returned in the JSON output. Filters can be specified as follows:

  • ...coll-cdx?url=example.com/*&filter==mime:text/html&filter=!=status:200 Return captures from example.com/* where mime is text/html and http status is not 200.
  • ...coll-cdx?url=example.com&matchType=domain&filter=~url:.*\.php$ Return captures from the domain example.com which URL ends in .php.

The ! modifier before =status indicates negation. The = and ~ modifiers are optional and specify exact resp. regular expression matches. The default (no specific modifier) is to filter whether the query string is contained in the field value. Negation and exact/regex modifier may be combined, eg. filter=!~text/.*

The formal syntax is: filter=<fieldname>:[!][=|~]<expression> with the following modifiers:

modifier(s) example description
(no modifier) filter=mime:html field "mime" contains string "html"
= filter==mime:text/html exact match: field "mime" is "text/html"
~ filter=~mime:.*/html$ regex match: expression matches beginning of field "mime" (cf. re.match)
! filter=!mime:html field "mime" does not contain string "html"
!= filter=!=mime:text/html field "mime" is not "text/html"
!~ filter=!~mime:.*/html expression does not match beginning of field "mime"

fl

The fl param can be used to specify which fields to include in the output. The standard available fields are usually: urlkey, timestamp, url, mime, status, digest, length, offset, filename

If a minimal cdx index is used, the mime and status fields may not be available. Additional fields may be introduced in the future, especially in the CDX JSON format.

Fields can be comma delimited, for example fl=urlkey,timestamp will only include the urlkey, timestamp and filename in the output.