-
Notifications
You must be signed in to change notification settings - Fork 7
APIs
Each API has the following usage limits (thresholds), please check if you are exceeding these limits if you start receiving the HTTP response status Error 429 too many requests:
- Arquivo.pt API (Full-text & URL search): 250 requests per minute
- Image Search API v1.1 (beta version): 400 requests per minute
- CDX-server API (URL search): 250 requests per minute
- Memento API (URL search): 400 requests per minute
- Training module on Automatic processing of information preserved from the Web (module C)
If you need to download a large amount of web-archived resources, such as all the URLs archived from a large website along time, we suggest the following methodology:
-
Download the CDXJ index files, (what is CDXJ?) of the Arquivo.pt collections you selected to process. For this purpose, analyse the "column A: Collection ID" and the corresponding CDXJ index files on "column H: Collection CDXJ File");
-
Create a list of selected URLs to be downloaded, extracted from the CDXJ index files (e.g. using Linux grep command);
-
Download the web-archived resources for the list of selected URLs from Arquivo.pt by using the above APIs or, by building links to directly access the web-archived resources. These links are available on the Technical details of the Options top-right menu when accessing a web-archived page. For instance, for the URL http://publico.pt/ with timestamp 20120201160355 extracted from the CDXJ index file, build the following links to download the:
- original file of the web-archived page (loses replay quality because the original internal links are not rewritten to reference web-archived images or stylesheets): https://arquivo.pt/noFrame/replay/20120201160355id_/http://publico.pt/
- web-archived page without the Arquivo.pt UI frame (internal links are rewritten to reference web-archived resources): https://arquivo.pt/noFrame/replay/20120201160355/http://publico.pt/
- original file of the web-archived page/web-archived page without the Arquivo.pt UI frame (endpoint https://arquivo.pt/noFrame/replay/): 4437 requests/minute
- text extracted from the web-archived page (endpoint https://arquivo.pt/textextracted): 250 requests/minute
If you have any trouble using our APIs, please contact us so that we can try to help you.
Short link to this page: arquivo.pt/api