Skip to content

Internal API server that offers search access to the Media Cloud Online News Archive (in Elasticsearch).

License

Notifications You must be signed in to change notification settings

mediacloud/news-search-api

Repository files navigation

Web Archive Search Index API and UI

An API wrapper to the Elasticsearch index of web archival collections and a web UI to explore those indexes. A part of the story-indexer stack. Maintained as a separate repository for future legibility. This exposes an FastAPI-based API server and a Streamlit-based search UI (for quick testing). Both are managed as internal services as part of the Media Cloud Online News Archive.

ES Index

The API service expects the following ES index schema, where title and snippet fields must have the fielddata enabled (if they have the type text). This is currently defined in the story-indexer stack, but is replicated here for convenience (but might be out of date).

{
    "properties": {
        "original_url": {"type": "keyword"},
        "url": {"type": "keyword"},
        "normalized_url": {"type": "keyword"},
        "canonical_domain": {"type": "keyword"},
        "publication_date": {"type": "date", "ignore_malformed": true},
        "language": {"type": "text", "fields": {"keyword": {"type": "keyword"}}},
        "full_language": {"type": "keyword"},
        "text_extraction": {"type": "keyword"},
        "article_title": {
            "type": "text",
            "fields": {"keyword": {"type": "keyword"}}
        },
        "normalized_article_title": {
            "type": "text",
            "fields": {"keyword": {"type": "keyword"}}
        },
        "text_content": {"type": "text", "fields": {"keyword": {"type": "keyword"}}}
    }
}

Run Services

Configurations is set using environment variables by setting corresponding upper-case names of the config parameters. Environment variables that accept a list (e.g., ESHOSTS and INDEXES) can have commas or spaces as separators. Configuration via a config file in the syntax of the provided config.yml.sample can be used for testing.

Then run the API and UI services using Docker Compose:

$ docker compose up

Access an interactive API documentation and a collection index explorer in a web browser:

Building and Releasing

Deployments are now configured to be automatically built and released via GitHub Actions.

  1. Change the version number stored in ApiVersion.v1 in api.py
  2. Add a small note to the version history below indicating what changed
  3. Commit and tag the repo with the same number
  4. Push the tag to GitHub to trigger the build and release
  5. Once it is done, the labeled image will be ready at https://hub.docker.com/r/mcsystems/news-search-api

Version History

  • v1.4.2 - Fix topdomains aggregation
  • v1.4.1 - Bugfix correcting missed date conversion in client.py
  • v1.4.0 - New endpoints for sub-aggregations, including a small refactor of how aggregation queries are constructed, configurable timeout for elasticsearch
  • v1.3.9 - Overview query includes 'keyword' field for domain aggregator
  • v1.3.8 - Bugfix for 'expanded' results
  • v1.3.7 - Increased default time out on top-terms, better tests
  • v1.3.6 - Major refactor and cleanup, behavior unchanged
  • v1.3.5 - Use mc-manage for deployment record
  • v1.3.4 - Add airtable deployment update script
  • v1.3.3 - Remove 'link' header
  • v1.3.2 - Enhancement to GithubActions and introduces an independent deployment script
  • v1.3.1 - Bugfix for 1.3.0
  • v1.3.0 - Change to return aliases as well as indexes as legal values in Collections, and update article endpoint to work in the ILM context
  • v1.2.0 - Change related to ID update in backend ES, including refurbishing the article endpoint and tests
  • v1.1.0 - Change to return None when data is missing (including publication date), update dependencies
  • v1.0.0 - First official release

Tags for dev and staging releases

Append the suffix a for a dev/alpha release and b for a staging/beta release. e.g

  • v1.3.2b - Version 1.3.2 beta release
  • v1.3.2a - Version 1.3.2 alpha release

About

Internal API server that offers search access to the Media Cloud Online News Archive (in Elasticsearch).

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

No packages published

Languages