Use Elasticsearch fields feature #285

orangejulius · 2018-05-07T22:12:06Z

Currently, we take the "name" field (and all its language variants) and run it through Elasticsearch twice. Once with the "name" analyzers, and once in the "phrase.*" field with the phrase analyzers. This is done so that we can have special analysis to handle all the different use cases of autocomplete and regular search.

It works, but is a bit of a pain to manage.

Even worse, in order to save space and improve performance of our indicies, we exclude the phrase.* fields from the _source object:

schema/mappings/document.js

Line 195 in 2a2d691

excludes : ['shape','phrase']

This causes all sorts of things to break:

You can't fetch a document from Elasticsearch and then re-index it without setting the phrase.* fields again from the contents of the name.* fields. Forgetting to do this will usually prevent the document from showing up in forward geocoding queries
You can't use tools like Elasticdump

It turns out Elasticsearch has functionality to support exactly this functionality: fields.

This allows one field to be indexed in multiple ways, without the confusion of multiple fields that do not have any inherent relationship in Elasticsearch. While there would have to be some cosmetic changes to all our queries, it looks like the change is not a big deal overall.

Here's an example of how fields can work in an Elasticsearch index mapping, from the docs above:

{
  "mappings": {
    "my_type": {
      "properties": {
        "city": {
          "type": "string",
          "fields": {
            "raw": { 
              "type":  "string",
              "index": "not_analyzed"
            }
          }
        }
      }
    }
  }
}

This would be a great first issue for someone new to Pelias, but with some Elasticsearch experience, and we'd be glad to help you get started.

Handling breaking changes

Fixing this will likely require a breaking change to our schema design. When those are required, we try to make a change to the Pelias API codebase well in advance that can handle either the old or new schema. This helps avoid problems for people who are transitioning from older to newer builds.

While it might not be possible or practical to do so in this case, we will want to consider how to make the transition across this breaking change as smooth as possible.

The text was updated successfully, but these errors were encountered:

orangejulius · 2018-09-03T10:55:42Z

Here's a full writeup of the changes that are likely required:

Elasticsearch Multi-Field Refactor

Background

To facilitate both search and autocomplete, we break down all our text data into two separate full text indices. They are called the name and phrase fields and are roughly defined as follows:

name: input text is broken into many individual tokens of progressive characters of the input. for example main st would be converted into the array [m, ma, mai, main, s, st]. This is necessary for autocomplete.

phrase: input text is broken into an ordered list of words (roughly, since we split on more than just whitespace). main st becomes [0:main, 1:st]. This is used both for autocomplete and search

We do this by actually having two separate fields in our Elasticsearch schema. From Elasticsearch's perspective, there is no relation between these two fields. Only our application code in the Pelias API understands the
relationship.

However, although we didn't know it at the time, Elasticsearch has a feature to allow one input text to be analyzed multiple ways.

Using this feature would be a pure refactoring from a functionality standpoint (all queries should return exactly the same results as before) but should lead to increased readability in our code, and probably some minor disk usage savings and possibly performance increases in Elasticsearch.

Pelias Components Involved

Pelias schema library

This is where our code for managing Elasticsearch's schema lives. It's a set of Node.js scripts with lots of unit and integration tests for the behavior of the Elasticsearch schema. Using the fields feature will be done here.

Pelias API

This is where all of our core logic for the geocoder lives. It's a fairly large, completely stateless Node.js Express app.

Most of the significant logic changes will live here, or in a subset of the code that we extracted into the pelias-query module. At least four different query types will have to be updated (two for search, one for structured search, and one for autocomplete).

Fortunately most of our query code is nicely parameterized. We have a templating system that allows us to extract much of the logic into configuration variables. It might be the case that a prototype is as simple as changing two lines that control the name of the schema fields used for the name and phrase functionality.

Pelias Model

This is a library included in all of our different importers that allows us to easily create new records in a format that lines up nicely with our Elasticsearch schema. The changes here should be limited to essentially removing all references to the phrase field in the main Document definition, since all the existing code is doing is copying one text string into two places, and that's exactly what this change will remove the
need to do.

Acceptance tests (to verify functionality)

No code changes will be required here, but we have great acceptance-tests that verify pretty much all Pelias functionality. We have a new and growing set of tests for a small city (Portland) and our most tried and true global acceptance-tests which are essentially the ultimate authority on whether any change makes it into Pelias.

The global acceptance-tests require a full planet build which is a high barrier for most Pelias contributors. It's especially painful for testing schema changes like this one (which require a _re_build, and can take a day or two). To bridge the gap we are working on building out a collection of test suites for areas of different sizes, so code can be tested on progressively larger builds.

How to get started

Start with pelias/schema

You should be able to clone the pelias/schema repository, and follow the usage guide to re-create the Pelias schema on Elasticsearch with behavior no different than when using the Docker image.

Once that's verified (perhaps by doing an import) again, you can move on to changing the schema.

Because we have multiple name and phrase sub-fields for different purposes (alternative names, variations in formatting, different languages), we use the Elasticsearch dynamic template feature to configure them all once.

I think the change to make will be to remove the phrase dynamic template, and add usage of the fields feature to the name template.

Importers

It would be worth re-running an importer without any changes, because it might just work well enough to test. I'd suggest running the OpenAddresses importer via Docker with pelias import oa

If not, you'll have to run it yourself like pelias/schema. The instructions in the readme, especially the example configuration of pelias.json will be helpful.

As mentioned earlier, much logic common to all importers is stored in the pelias/model library. If changes are required there (such as removing any references to the phrase field, you'll have to use npm link to "point" your OpenAddresses importer at a local copy of the model library with any required changes.

pelias/api

Once data is successfully imported, try starting the API and running some simple address queries. http://localhost:3100/v1/search?text=777 NE MLK Blvd, Portland, OR should work with OpenAddesses imported.

Pelias/query changes, if required

If changes to the pelias/query module turn out to be required, you can "point"
your API's copy of pelias/query at checkout of the pelias/query repo with your
changes by using npm link

acceptance-tests

We have a suite of several hundred acceptance tests for Portland that can be run once all data is re-imported with the schema changes, to validate that it was indeed a pure refactor. The acceptance tests can be run from pelias/docker with pelias test run.

Connects pelias/schema#285

Connects #285

orangejulius added 0 - Backlog good first issue ideas labels May 7, 2018

orangejulius removed the 0 - Backlog label Aug 17, 2018

orangejulius mentioned this issue Feb 8, 2019

Compute an ngram field for all admin data #345

Closed

orangejulius mentioned this issue Sep 11, 2019

Set store:false for phrase fields #376

Closed

orangejulius added a commit to pelias/api that referenced this issue Mar 20, 2020

Use phrase field

04dc3f5

Connects pelias/schema#285

orangejulius added a commit that referenced this issue Mar 21, 2020

WIP: name.default.phrase field

0ce2614

Connects #285

orangejulius mentioned this issue Jun 9, 2020

fix(doc): Do not allow duplicate names to be created pelias/whosonfirst#511

Merged

orangejulius mentioned this issue Apr 19, 2022

remove phrase field in Document model pelias/model#148

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Elasticsearch fields feature #285

Use Elasticsearch fields feature #285

orangejulius commented May 7, 2018 •

edited

Loading

orangejulius commented Sep 3, 2018 •

edited

Loading

Use Elasticsearch fields feature #285

Use Elasticsearch fields feature #285

Comments

orangejulius commented May 7, 2018 • edited Loading

Handling breaking changes

orangejulius commented Sep 3, 2018 • edited Loading

Elasticsearch Multi-Field Refactor

Background

Pelias Components Involved

Pelias schema library

Pelias API

Pelias Model

Acceptance tests (to verify functionality)

How to get started

Start with pelias/schema

Importers

pelias/api

Pelias/query changes, if required

acceptance-tests

orangejulius commented May 7, 2018 •

edited

Loading

orangejulius commented Sep 3, 2018 •

edited

Loading