-
Notifications
You must be signed in to change notification settings - Fork 275
CDX Server requirements
As part of making the CDX-Server the default index engine for OpenWayback we need to clean up and formally define the API for the CDX-Server. This document is meant as a workplace for defining those API's.
The CDX-Server API, as it is today, is chracterized by a relatively close link to how the underlying CDX format is implemented. Functionality varies if you are using traditional flat CDX files or compressed zipnum clusters. One of the nice things by having a CDX Server is to separate the API from the underlying implementation. This way it would be relatively easy to implement indexes based on other technologies in the future. As a consequence we should avoid implementing features just because they are easy to do with a certain format if there is no real need for it. The same feature might be hard to implement on other technologies.
The API should also try to avoid giving the user conflicting options. For example it is possible, in the current api, to indicate match type both with a parameter and a wildcard. It is then possible to set matchType=prefix and at the same time use a wildcard indicating matchType=domain.
The following is a list of use-cases seen from the perspective of a user. Many of the use-cases are described as expectations to the GUI of OpenWayback, but is meant to help the understanding of the CDX-Server's role. For each use-case we need to understand what functionality the CDX-Server is required to support. CDX-Server functionality with no supporting use-case should not be implemented in OpenWayback 3.0.0.
This is a work in progress. Edits and comments are highly appreciated.
This case could be a user referencing a document from a thesis. It is important that the capture referenced is exactly the one the user used when writing the thesis. In this case the user should get the capture that exactly matches both the url and timestamp.
The digest needs also to be considered to actually guarantee that the user gets the same version. In addition you need to know that all embeds also are the same version the user originally requested. Achieving all this might be hard or impossible to do.
Similar to the above, but it might be allowed to return a capture close in time if the requested capture is missing i.e. the requirement for getting the same version is slightly loosened.
User is looking at a page and want to follow a link by clicking it. User then expects to be brought to closest in time capture of the new page.
Similar to above, but user is not involved. This is for loading embedded images and so on.
5. User requests/searches for an exact url without any timestamp, expecting to get a summary of captures for the url over time
The summary of captures might be presented in different ways, for example a list or a calendar.
7. User searches with a truncated path expecting the results to show up as matching paths regardless of time
8. User searches with a truncated path expecting the results to show up as matching paths regardless of time and subdomain
Requires the ability to request a date range.
This require consulting the digest of the captures for a page. This could be done in the CDX-Server if only the captures with a change is needed. Otherwise it is probably best solved by the consumer of the CDX-Server API, for example OpenWayback.
11. User requests/searchers for an exact url with a partial timestamp, expecting to get a summary of captures for the url over time
Possibly add a go to "random page" feature. This could potentially require a lot of searching through the CDX-files since they are first sorted on url and then on timestamp. If the requirement is loosened to get random page regardless of time, then it is simple.
Used by the calendar view.
The ability to get big portions of the CDX-data to be used by processing tools like Map Reduce. The data needs to be returned in chunks. It is preferable if the chunks could be requested in parallel from different processing nodes.
Current CDX Server seems to strip away the schema part of the url (i.e. http://example.com
-> example.com
) when looking up a url. Is there a need to sometimes be more strict? Let say you got http://example.com/foo.html
and ftp://example.com/foo.html
with different content. Is this a real world problem?
The following are not use cases for a Wayback machine, but for a system which provides access to raw (W)ARC files for export to researchers. Is this an appropriate use for the CDX server?
a. Identify the (W)ARC files which contain a particular domain/subdomain/full url (additionally: specify date range), returning a count of the relevant domains etc for each (W)ARC file (to be used to determine (W)ARC files for export)
b. Given a (W)ARC file identifier, list the URLs it holds which match a set of criteria (domain/subdomain/date etc) (to be used to export (W)ARC file extracts)
The Library of Congress has implemented a cdxserver and surt-ordered a cdx file using the following script: grep -v -P '^(dns|filedesc)' final_index.cdx | java -jar ia-hadoop-tools-1.0-SNAPSHOT-jar-with-dependencies.jar cdx-convert > surt.cdx
Copyright © 2005-2022 [tonazol](http://netpreserve.org/). CC-BY. https://github.com/iipc/openwayback.wiki.git