-
Notifications
You must be signed in to change notification settings - Fork 9
Collection API
This is a proposal for an API to publish a collection of documents. It's meant to be consumed by hoover-search, and implemented by hoover-snoop and future collection publishing tools, so we don't want to impose snoop-specific implementation details. A design goal is the possibility to implement the API as static files served by e.g. nginx.
The API entry point must return a JSON document describing the collection:
GET https://example.com/foo/
{
"name": "The collection named Foo!",
"description": "some long text here, perhaps markdown?",
"feed": "latest",
"data_urls": "documents/{id}/data.json"
}
The feed URL must return a paginated feed of documents:
GET https://example.com/foo/latest
{
"documents": [
{"id": "12742", "version": "2016-10-30T15:04:05.801Z"},
{"id": "12741", "version": "2016-10-30T15:04:04.320Z"}
],
"next": "latest?from=2016-10-30T15:03:01.983Z"
}
If next is missing, this is the last page of the feed.
We can construct a document's URL based on its id (12742
) and the data_urls template (documents/{id}/data.json
):
GET https://example.com/foo/documents/12742/data.json
{
"id": "12742",
"version": "2016-10-30T15:04:05.801Z",
"content": {
"tilte": "Some interesting document",
"text": "the content of said document",
"... other fields ...": "..."
},
"views": [
{"title": "Download", "url": "somedoc.pdf"}
]
}
For a given (id, version) pair, the content should always be the same.
To optimize indexing and help mitigate a race condition in snoop, the feed response may contain the content of each document.
When configuring a new collection, one must provide a name and a URL:
./manage.py addcollection foo https://example.com/foo/
hoover-search will poll the feed URL, say every 10 minutes, to index new/updated documents. It will send the content
of each document to elasticsearch, along with the id and version. Once the indexer encounters an (id, version)
pair that is already in elasticsearch, it will assume all the subsequent documents have been indexed, and stop.
For the current URL scheme of hoover-snoop, the metadata URL for collection foo
can be http://localhost:8000/foo/
:
{
"name": "The collection named Foo!",
"description": "some long text here, perhaps markdown?",
"feed": "latest",
"data_urls": "{id}/json"
}
We still need to serve original documents, OCRed files, and .msg
emails as .eml
; these go in the views list:
{
"id": "1234",
"version": "2016-10-30T15:04:05.801Z",
"content": "...",
"views": [
{"slug": "raw", "name": "Raw", "url": "1234/raw/email_807321.msg"},
{"slug": "ocr-s1", "name": "OCR source 1", "url": "1234/ocr/s1"},
{"slug": "ocr-s2", "name": "OCR source 2", "url": "1234/ocr/s2"},
{"slug": "eml", "name": "As .eml", "url": "1234/eml/email_807321.eml"}
]
}