-
Notifications
You must be signed in to change notification settings - Fork 7
APIs
Each API has the following maximum usage limits. If a client exceeds the following maximum usage limits, it will receive an error "HTTP 429 Too many requests" and must decrease its access rate:
- Arquivo.pt API (Full-text & URL search): 250 requests/minute from the same IP address
- Image Search API v1.1: 400 requests/minute from the same IP address
- CDX-server API (URL search): 250 requests/minute from the same IP address
- Memento API (URL search): 400 requests/minute from the same IP address
IP addresses that exceed these limits will be permanently blocked from using Arquivo.pt.
- Training module on Automatic processing of information preserved from the Web (module C)
- Tutorial in Python about how to explore the Arquivo.pt API
If you need to download a large amount of web-archived resources, such as all the URLs archived from a large website along time, we suggest the following methodology:
-
Analyse the Arquivo.pt collections so that you may choose those which may contain the most interesting web-archived data for your use case. If you have any doubt, contact us.
-
Download the CDXJ index files, (what is CDXJ?) of the Arquivo.pt collections you selected to process. For this purpose, analyse the "column A: Collection ID" and the corresponding CDXJ index files on "column H: Collection CDXJ File");
-
Create a list of selected URLs to be downloaded, extracted from the CDXJ index files. E.g. using the Linux grep command to get HTML pages successfully archived (status:200, mime:text/html):
> cat EAWP5.cdxj | grep '\"status\": \"200\"'| grep '\"mime\": \"text/html'| wc
- Download the web-archived resources for the list of selected URLs from Arquivo.pt by using the above APIs or, by building links to directly access the web-archived resources. These links are available on the Technical details of the Options top-right menu when accessing a web-archived page. For instance, for the URL http://publico.pt/ with timestamp 20120201160355 extracted from the CDXJ index file, build the following links to download the:
- original file of the web-archived page (loses replay quality because the original internal links are not rewritten to reference web-archived images or stylesheets), notice that there is a suffix
id_
appended after the timestamp: https://arquivo.pt/noFrame/replay/20120201160355id_/http://publico.pt/ - web-archived page without the Arquivo.pt UI frame (internal links are rewritten to reference web-archived resources): https://arquivo.pt/noFrame/replay/20120201160355/http://publico.pt/
If a client exceeds these following maximum usage limits, it will receive an error "HTTP 429 Too many requests" and must decrease its access rate:
- https://arquivo.pt/wayback: 200 requests/minute from the same IP address
- https://arquivo.pt/noFrame/replay: 200 requests/minute from the same IP address
- https://arquivo.pt/noFrame/patching/record: 200 requests/minute from the same IP address
- https://arquivo.pt/save/now/record: 200 requests/minute from the same IP address
IP addresses that exceed these limits will be permanently blocked from using Arquivo.pt.
If you have any trouble using our APIs, please contact us so that we can try to help you.
Short link to this page: arquivo.pt/api