Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Elasticsearch fields feature #285

Open
orangejulius opened this issue May 7, 2018 · 1 comment
Open

Use Elasticsearch fields feature #285

orangejulius opened this issue May 7, 2018 · 1 comment

Comments

@orangejulius
Copy link
Member

orangejulius commented May 7, 2018

Currently, we take the "name" field (and all its language variants) and run it through Elasticsearch twice. Once with the "name" analyzers, and once in the "phrase.*" field with the phrase analyzers. This is done so that we can have special analysis to handle all the different use cases of autocomplete and regular search.

It works, but is a bit of a pain to manage.

Even worse, in order to save space and improve performance of our indicies, we exclude the phrase.* fields from the _source object:

excludes : ['shape','phrase']

This causes all sorts of things to break:

  • You can't fetch a document from Elasticsearch and then re-index it without setting the phrase.* fields again from the contents of the name.* fields. Forgetting to do this will usually prevent the document from showing up in forward geocoding queries
  • You can't use tools like Elasticdump

It turns out Elasticsearch has functionality to support exactly this functionality: fields.

This allows one field to be indexed in multiple ways, without the confusion of multiple fields that do not have any inherent relationship in Elasticsearch. While there would have to be some cosmetic changes to all our queries, it looks like the change is not a big deal overall.

Here's an example of how fields can work in an Elasticsearch index mapping, from the docs above:

{
  "mappings": {
    "my_type": {
      "properties": {
        "city": {
          "type": "string",
          "fields": {
            "raw": { 
              "type":  "string",
              "index": "not_analyzed"
            }
          }
        }
      }
    }
  }
}

This would be a great first issue for someone new to Pelias, but with some Elasticsearch experience, and we'd be glad to help you get started.

Handling breaking changes

Fixing this will likely require a breaking change to our schema design. When those are required, we try to make a change to the Pelias API codebase well in advance that can handle either the old or new schema. This helps avoid problems for people who are transitioning from older to newer builds.

While it might not be possible or practical to do so in this case, we will want to consider how to make the transition across this breaking change as smooth as possible.

@orangejulius
Copy link
Member Author

orangejulius commented Sep 3, 2018

Here's a full writeup of the changes that are likely required:

Elasticsearch Multi-Field Refactor

Background

To facilitate both search and autocomplete, we break down all our text data into two separate full text indices. They are called the name and phrase fields and are roughly defined as follows:

name: input text is broken into many individual tokens of progressive characters of the input. for example main st would be converted into the array [m, ma, mai, main, s, st]. This is necessary for autocomplete.

phrase: input text is broken into an ordered list of words (roughly, since we split on more than just whitespace). main st becomes [0:main, 1:st]. This is used both for autocomplete and search

We do this by actually having two separate fields in our Elasticsearch schema. From Elasticsearch's perspective, there is no relation between these two fields. Only our application code in the Pelias API understands the
relationship.

However, although we didn't know it at the time, Elasticsearch has a feature to allow one input text to be analyzed multiple ways.

Using this feature would be a pure refactoring from a functionality standpoint (all queries should return exactly the same results as before) but should lead to increased readability in our code, and probably some minor disk usage savings and possibly performance increases in Elasticsearch.

Pelias Components Involved

Pelias schema library

This is where our code for managing Elasticsearch's schema lives. It's a set of Node.js scripts with lots of unit and integration tests for the behavior of the Elasticsearch schema. Using the fields feature will be done here.

Pelias API

This is where all of our core logic for the geocoder lives. It's a fairly large, completely stateless Node.js Express app.

Most of the significant logic changes will live here, or in a subset of the code that we extracted into the pelias-query module. At least four different query types will have to be updated (two for search, one for structured search, and one for autocomplete).

Fortunately most of our query code is nicely parameterized. We have a templating system that allows us to extract much of the logic into configuration variables. It might be the case that a prototype is as simple as changing two lines that control the name of the schema fields used for the name and phrase functionality.

Pelias Model

This is a library included in all of our different importers that allows us to easily create new records in a format that lines up nicely with our Elasticsearch schema. The changes here should be limited to essentially removing all references to the phrase field in the main Document definition, since all the existing code is doing is copying one text string into two places, and that's exactly what this change will remove the
need to do.

Acceptance tests (to verify functionality)

No code changes will be required here, but we have great acceptance-tests that verify pretty much all Pelias functionality. We have a new and growing set of tests for a small city (Portland) and our most tried and true global acceptance-tests which are essentially the ultimate authority on whether any change makes it into Pelias.

The global acceptance-tests require a full planet build which is a high barrier for most Pelias contributors. It's especially painful for testing schema changes like this one (which require a _re_build, and can take a day or two). To bridge the gap we are working on building out a collection of test suites for areas of different sizes, so code can be tested on progressively larger builds.

How to get started

Start with pelias/schema

You should be able to clone the pelias/schema repository, and follow the usage guide to re-create the Pelias schema on Elasticsearch with behavior no different than when using the Docker image.

Once that's verified (perhaps by doing an import) again, you can move on to changing the schema.

Because we have multiple name and phrase sub-fields for different purposes (alternative names, variations in formatting, different languages), we use the Elasticsearch dynamic template feature to configure them all once.

I think the change to make will be to remove the phrase dynamic template, and add usage of the fields feature to the name template.

Importers

It would be worth re-running an importer without any changes, because it might just work well enough to test. I'd suggest running the OpenAddresses importer via Docker with pelias import oa

If not, you'll have to run it yourself like pelias/schema. The instructions in the readme, especially the example configuration of pelias.json will be helpful.

As mentioned earlier, much logic common to all importers is stored in the pelias/model library. If changes are required there (such as removing any references to the phrase field, you'll have to use npm link to "point" your OpenAddresses importer at a local copy of the model library with any required changes.

pelias/api

Once data is successfully imported, try starting the API and running some simple address queries. http://localhost:3100/v1/search?text=777 NE MLK Blvd, Portland, OR should work with OpenAddesses imported.

Pelias/query changes, if required

If changes to the pelias/query module turn out to be required, you can "point"
your API's copy of pelias/query at checkout of the pelias/query repo with your
changes by using npm link

acceptance-tests

We have a suite of several hundred acceptance tests for Portland that can be run once all data is re-imported with the schema changes, to validate that it was indeed a pure refactor. The acceptance tests can be run from pelias/docker with pelias test run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant