Collection API

This is a proposal for an API to publish a collection of documents. It's meant to be consumed by hoover-search, and implemented by hoover-snoop and future collection publishing tools, so we don't want to impose snoop-specific implementation details. A design goal is the possibility to implement the API as static files served by e.g. nginx.

Spec

The API entry point must return a JSON document describing the collection:

GET https://example.com/foo/

{
  "name": "The collection named Foo!",
  "description": "some long text here, perhaps markdown?",
  "feed": "latest",
  "data_urls": "documents/{id}/data.json"
}

The feed URL must return a paginated feed of documents:

GET https://example.com/foo/latest

{
  "documents": [
    {"id": "12742", "version": "2016-10-30T15:04:05.801Z"},
    {"id": "12741", "version": "2016-10-30T15:04:04.320Z"}
  ],
  "next": "latest?from=2016-10-30T15:03:01.983Z"
}

If next is missing, this is the last page of the feed.

We can construct a document's URL based on its id (12742) and the data_urls template (documents/{id}/data.json):

GET https://example.com/foo/documents/12742/data.json

{
  "id": "12742",
  "version": "2016-10-30T15:04:05.801Z",
  "content": {
    "tilte": "Some interesting document",
    "text": "the content of said document",
    "... other fields ...": "..."
  },
  "views": [
    {"title": "Download", "url": "somedoc.pdf"}
  ]
}

For a given (id, version) pair, the content should always be the same.

To optimize indexing and help mitigate a race condition in snoop, the feed response may contain the content of each document.

Implementation in hoover-search

When configuring a new collection, one must provide a name and a URL:

./manage.py addcollection foo https://example.com/foo/

hoover-search will poll the feed URL, say every 10 minutes, to index new/updated documents. It will send the content of each document to elasticsearch, along with the id and version. Once the indexer encounters an (id, version) pair that is already in elasticsearch, it will assume all the subsequent documents have been indexed, and stop.

Implementation in hoover-snoop

For the current URL scheme of hoover-snoop, the metadata URL for collection foo can be http://localhost:8000/foo/:

{
  "name": "The collection named Foo!",
  "description": "some long text here, perhaps markdown?",
  "feed": "latest",
  "data_urls": "{id}/json"
}

We still need to serve original documents, OCRed files, and .msg emails as .eml; these go in the views list:

{
  "id": "1234",
  "version": "2016-10-30T15:04:05.801Z",
  "content": "...",
  "views": [
    {"slug": "raw", "name": "Raw", "url": "1234/raw/email_807321.msg"},
    {"slug": "ocr-s1", "name": "OCR source 1", "url": "1234/ocr/s1"},
    {"slug": "ocr-s2", "name": "OCR source 2", "url": "1234/ocr/s2"},
    {"slug": "eml", "name": "As .eml", "url": "1234/eml/email_807321.eml"}
  ]
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collection API

Spec

Implementation in hoover-search

Implementation in hoover-snoop

Clone this wiki locally