Skip to content
Daniel Gomes edited this page Jan 24, 2025 · 84 revisions

APIs specific to Arquivo.pt that enable the full exploration of its functions

APIs based on international standards to enable interoperability among web archives and code reuse

API usage limits

Each API has the following maximum usage limits. If a client exceeds the following maximum usage limits, it will receive an error "HTTP 429 Too many requests" and must decrease its access rate:

IP addresses that exceed these limits will be permanently blocked from using Arquivo.pt.

Learn more about APIs

Bulk download of web-archived resources

If you need to download a large amount of web-archived resources, such as all the URLs archived from a large website along time, we suggest the following methodology:

  1. Analyse the Arquivo.pt collections so that you may choose those which may contain the most interesting web-archived data for your use case. If you have any doubt, contact us.

  2. Download the CDXJ index files, (what is CDXJ?) of the Arquivo.pt collections you selected to process. For this purpose, analyse the "column A: Collection ID" and the corresponding CDXJ index files on "column H: Collection CDXJ File");

  3. Create a list of selected URLs to be downloaded, extracted from the CDXJ index files. E.g. using the Linux grep command to get HTML pages successfully archived (status:200, mime:text/html):

> cat EAWP5.cdxj | grep '\"status\": \"200\"'| grep '\"mime\": \"text/html'| wc

  1. Download the web-archived resources for the list of selected URLs from Arquivo.pt by using the above APIs or, by building links to directly access the web-archived resources. These links are available on the Technical details of the Options top-right menu when accessing a web-archived page. For instance, for the URL http://publico.pt/ with timestamp 20120201160355 extracted from the CDXJ index file, build the following links to download the:

Endpoints usage limits

If a client exceeds these following maximum usage limits, it will receive an error "HTTP 429 Too many requests" and must decrease its access rate:

IP addresses that exceed these limits will be permanently blocked from using Arquivo.pt.

Learn more about bulk download

Contact us

If you have any trouble using our APIs, please contact us so that we can try to help you.

Short link to this page: arquivo.pt/api