Elasticsearch is a document-oriented data store with a very advanced search API. In short: You throw JSON data at Elasticsearch, then later you can search that data. By default, Elasticsearch writes happen with eventual consistency--if you write some data, then ask Elasticsearch for it back, you might not get it, or it might not look like what you previously wrote.
The TTA Hub uses Elasticsearch to index Activity Reports and provide full-text search capabilities. In staging and production, this functionality is provided by Cloud.gov. For local development, a single Elasticsearch node is included in the docker-compose
environment. Elasticsearch is a secondary datastore that should be regarded as ephemeral--over time, all Activity Report data that is written to Postgres should also be written to Elasticsearch and available for searching, but it may also occasionally go out-of-sync with Postgres or need to be completely re-indexed.
The application communicates with the Elasticsearch cluster in a manner similar to how it communicates with Postgres. Client configuration details (endpoints, access keys, etc.) are provided via the VCAP_SERVICES
environment variable in the cloud.gov environment. Application code creates an Elasticsearch client, configures it appropriately, then uses it to submit and query for data.
The Elasticsearch code uses Sequelize hooks to know when to write data to Elasticsearch. As Activity Reports are saved (or destroyed), these custom hooks schedule Worker jobs to propagate the changes from Postgres to Elasticsearch.
Only the Worker (background task queue) writes to Elasticsearch. The reasons for this are:
- A failed Elasticsearch write should not interrupt the user's day. If we fail to write to the application's primary data store (Postgres), the user should know their data has not actually been saved. But Elasticsearch is a secondary data store, and absolutely not the user's problem.
- Elasticsearch writes will be eventually consistent anyway. It is not guaranteed that, immediately after a write, a request for the same data will return what was written. So introducing an additional delay to the write for worker processing is not a big deal.
It is possible not to tell Elasticsearch about the shape of your data, and let it infer a schema from what you send it. In practice though, you will want to configure mappings that instruct Elasticsearch how certain data fields should be stored. Mappings are used to answer questions like:
- Does the text in this field need to be full text searchable (like the "Comments" field on a feedback form) or can it be restricted to exact matches only (like the "Department" field on a feedback form)?
- What format is used by the application to represent dates and times?
Mappings are configured in application code in lib/elasticsearch/mappings.js
.
If your data needs to be transformed or normalized before storage, Elasticsearch provides a feature called Ingest Pipelines that can be used to do this processing. Example uses of pipelines are:
- Stripping HTML tags from fields containing rich-formatted text (you likely don't want user input matching against raw HTML tags)
- Indexing text content inside common document formats (.pdf, .docx, etc.) using the Ingest Attachment Processor plugin
Pipelines are configured in application code in lib/elasticsearch/pipelines.js
.
Elasticsearch in cloud.gov is AWS OpenSearch (previously AWS Elasticsearch) under the hood. "OpenSearch" is AWS's fork of the Elasticsearch product. Newer versions of official Elastic clients have added code to detect when they are communicating with forked Elasticsearch servers and refuse to run. For now, pinning @elastic/elasticsearch
to version 7.13.0 (the last version without this check) works. In the future, we may want to evaluate any official clients published by AWS / OpenSearch.