An API wrapper to the Elasticsearch index of web archival collections and a web UI to explore those indexes. A part of the story-indexer stack. Maintained as a separate repository for future legibility. This exposes an FastAPI-based API server and a Streamlit-based search UI (for quick testing). Both are managed as internal services as part of the Media Cloud Online News Archive.
The API service expects the following ES index schema, where title
and snippet
fields must have
the fielddata
enabled (if they have the type text
). This is currently defined in the story-indexer
stack, but is replicated here for convenience (but might be out of date).
{
"properties": {
"original_url": {"type": "keyword"},
"url": {"type": "keyword"},
"normalized_url": {"type": "keyword"},
"canonical_domain": {"type": "keyword"},
"publication_date": {"type": "date", "ignore_malformed": true},
"language": {"type": "text", "fields": {"keyword": {"type": "keyword"}}},
"full_language": {"type": "keyword"},
"text_extraction": {"type": "keyword"},
"article_title": {
"type": "text",
"fields": {"keyword": {"type": "keyword"}}
},
"normalized_article_title": {
"type": "text",
"fields": {"keyword": {"type": "keyword"}}
},
"text_content": {"type": "text", "fields": {"keyword": {"type": "keyword"}}}
}
}
Configurations is set using environment variables by setting corresponding upper-case names of the
config parameters. Environment variables that accept a list (e.g., ESHOSTS
and INDEXES
) can have
commas or spaces as separators. Configuration via a config file in the syntax of the provided
config.yml.sample
can be used for testing.
Then run the API and UI services using Docker Compose:
$ docker compose up
Access an interactive API documentation and a collection index explorer in a web browser:
Deployments are now configured to be automatically built and released via GitHub Actions.
- Change the version number stored in
ApiVersion.v1
inapi.py
- Add a small note to the version history below indicating what changed
- Commit and tag the repo with the same number
- Push the tag to GitHub to trigger the build and release
- Once it is done, the labeled image will be ready at https://hub.docker.com/r/mcsystems/news-search-api
- v1.4.2 - Fix topdomains aggregation
- v1.4.1 - Bugfix correcting missed date conversion in client.py
- v1.4.0 - New endpoints for sub-aggregations, including a small refactor of how aggregation queries are constructed, configurable timeout for elasticsearch
- v1.3.9 - Overview query includes 'keyword' field for domain aggregator
- v1.3.8 - Bugfix for 'expanded' results
- v1.3.7 - Increased default time out on top-terms, better tests
- v1.3.6 - Major refactor and cleanup, behavior unchanged
- v1.3.5 - Use mc-manage for deployment record
- v1.3.4 - Add airtable deployment update script
- v1.3.3 - Remove 'link' header
- v1.3.2 - Enhancement to GithubActions and introduces an independent deployment script
- v1.3.1 - Bugfix for 1.3.0
- v1.3.0 - Change to return aliases as well as indexes as legal values in Collections, and update article endpoint to work in the ILM context
- v1.2.0 - Change related to ID update in backend ES, including refurbishing the article endpoint and tests
- v1.1.0 - Change to return
None
when data is missing (including publication date), update dependencies - v1.0.0 - First official release
Append the suffix a
for a dev/alpha release and b
for a staging/beta release.
e.g
- v1.3.2b - Version 1.3.2 beta release
- v1.3.2a - Version 1.3.2 alpha release