-
Notifications
You must be signed in to change notification settings - Fork 7
APIs
Each API has the following usage limits (thresholds), please check if you are exceeding these limits if you start receiving the HTTP response status Error 429 too many requests:
- Arquivo.pt API (Full-text & URL search): 250 requests per minute
- Image Search API v1.1: 400 requests per minute
- CDX-server API (URL search): 250 requests per minute
- Memento API (URL search): 400 requests per minute
- Training module on Automatic processing of information preserved from the Web (module C)
- Tutorial in Python about how to explore the Arquivo.pt API
If you need to download a large amount of web-archived resources, such as all the URLs archived from a large website along time, we suggest the following methodology:
-
Analyse the Arquivo.pt collections so that you may choose those which may contain the most interesting web-archived data for your use case. If you have any doubt, contact us.
-
Download the CDXJ index files, (what is CDXJ?) of the Arquivo.pt collections you selected to process. For this purpose, analyse the "column A: Collection ID" and the corresponding CDXJ index files on "column H: Collection CDXJ File");
-
Create a list of selected URLs to be downloaded, extracted from the CDXJ index files. E.g. using the Linux grep command to get HTML pages successfully archived (status:200, mime:text/html):
> cat EAWP5.cdxj | grep '\"status\": \"200\"'| grep '\"mime\": \"text/html'| wc
- Download the web-archived resources for the list of selected URLs from Arquivo.pt by using the above APIs or, by building links to directly access the web-archived resources. These links are available on the Technical details of the Options top-right menu when accessing a web-archived page. For instance, for the URL http://publico.pt/ with timestamp 20120201160355 extracted from the CDXJ index file, build the following links to download the:
- original file of the web-archived page (loses replay quality because the original internal links are not rewritten to reference web-archived images or stylesheets), notice that there is a suffix
id_
appended after the timestamp: https://arquivo.pt/noFrame/replay/20120201160355id_/http://publico.pt/ - web-archived page without the Arquivo.pt UI frame (internal links are rewritten to reference web-archived resources): https://arquivo.pt/noFrame/replay/20120201160355/http://publico.pt/
- original file of the web-archived page/web-archived page without the Arquivo.pt UI frame (endpoint https://arquivo.pt/noFrame/replay/): 4437 requests/minute.
- If the client exceeds this limit, it will receive an error "HTTP 429 Too many requests" and should decrease its download rate.
If you have any trouble using our APIs, please contact us so that we can try to help you.
Short link to this page: arquivo.pt/api