Use a single Elasticsearch Mapping Type #293

orangejulius · 2018-05-18T15:46:39Z

We currently set up many Elasticsearch mapping types. They are all identical, and thus there's no benefit to having more than one.

In Elasticsearch 6, multiple mapping types will not be supported, so we should get on top of things and remove them when we can.

There's also the immediate benefit that our Elasticsearch mapping will be much shorter and easier to work with. This not only makes Elasticsearch itself have to do less work to keep track of our types, but take a look at the 5000 lines of duplicated mapping type configuration we get to remove in our tests below!

This PR is built on #282 and must be merged after pelias/model#95

Connects pelias/pelias#719

missinglink · 2018-06-22T10:22:44Z

I don't see any reason we can't merge this, we just need to first check:

no queries are relying on this functionality (and consider the effect on queries written in forks)
we run a full build and test that the quality and performance hasn't changed

The same goes for the strict mapping, but in that case, I'm not sure it would provide any real benefit? I'll add my comments on that PR.

As guessed in #99, there _are_ differences between setting `"index": "not_analyzed"` for a field, and merely setting the analyzer to `keyword`. They are detailed in the Elasticsearch 2.4 [String datatype](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/string.html#string-params) documentation, although it's a little bit confusing. In Elasticsearch 5+, there are _two_ different types of string datatypes: - [`text`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/text.html) and - [`keyword`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/keyword.html). These documentation pages make the difference much more clear. In short, in Elasticsearch 2.4, setting `"index": "not_analyzed"` gives the following changes, all of which we'd like for these literal fields: - Analysis is skipped all together, the raw value is added to the index directly (this is pretty much equivalent to setting `analyzer: keyword`) - [norms](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/norms.html) are disabled for the field, saving some disk space - [doc_values](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/doc-values.html) are _en_abled. The last one is most interesting. In short, doc_values take up a little disk space but allow us to very efficiently perform aggregations. Pelias doesn't generally perform aggregations today. However, after we begin using a [single mapping type](#293), we will have no way for the [pelias dashboard](https://github.com/pelias/dashboard) or any of our own analysis scripts to provide document counts for different sources or layers. The dasbhoard currently uses an API to get the count of various mapping types, which won't be supported going forward. While minor, we needed a solution to this, and the only other one is fielddata which is extremely expensive in terms of memory usage. In my testing, for the Portland metro Docker project, disk usage went from 451MB to 473MB, or about a 5% increase. If we wanted to trim that down a bit, we could consider disabling `doc_values` for the `parent.*_id` fields. We don't have an immediate need for `doc_values` on those fields, although it might be interesting for analysis. Summary ------ While not technically required for [Elasticsearch 5 support](pelias/pelias#461), this PR does bring us more in line with the best practices of ES5. It also sets us up for [Elasticsearch 6](pelias/pelias#719) where the `string` datatype we use now is completely removed. Fixes #99

As guessed in pelias#99, there _are_ differences between setting `"index": "not_analyzed"` for a field, and merely setting the analyzer to `keyword`. They are detailed in the Elasticsearch 2.4 [String datatype](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/string.html#string-params) documentation, although it's a little bit confusing. In Elasticsearch 5+, there are _two_ different types of string datatypes: - [`text`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/text.html) and - [`keyword`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/keyword.html). These documentation pages make the difference much more clear. In short, in Elasticsearch 2.4, setting `"index": "not_analyzed"` gives the following changes, all of which we'd like for these literal fields: - Analysis is skipped all together, the raw value is added to the index directly (this is pretty much equivalent to setting `analyzer: keyword`) - [norms](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/norms.html) are disabled for the field, saving some disk space - [doc_values](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/doc-values.html) are _en_abled. The last one is most interesting. In short, doc_values take up a little disk space but allow us to very efficiently perform aggregations. Pelias doesn't generally perform aggregations today. However, after we begin using a [single mapping type](pelias#293), we will have no way for the [pelias dashboard](https://github.com/pelias/dashboard) or any of our own analysis scripts to provide document counts for different sources or layers. The dasbhoard currently uses an API to get the count of various mapping types, which won't be supported going forward. While minor, we needed a solution to this, and the only other one is fielddata which is extremely expensive in terms of memory usage. In my testing, for the Portland metro Docker project, disk usage went from 451MB to 473MB, or about a 5% increase. If we wanted to trim that down a bit, we could consider disabling `doc_values` for the `parent.*_id` fields. We don't have an immediate need for `doc_values` on those fields, although it might be interesting for analysis. Summary ------ While not technically required for [Elasticsearch 5 support](pelias/pelias#461), this PR does bring us more in line with the best practices of ES5. It also sets us up for [Elasticsearch 6](pelias/pelias#719) where the `string` datatype we use now is completely removed. Fixes pelias#99

As guessed in #99, there _are_ differences between setting `"index": "not_analyzed"` for a field, and merely setting the analyzer to `keyword`. They are detailed in the Elasticsearch 2.4 [String datatype](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/string.html#string-params) documentation, although it's a little bit confusing. In Elasticsearch 5+, there are _two_ different types of string datatypes: - [`text`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/text.html) and - [`keyword`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/keyword.html). These documentation pages make the difference much more clear. In short, in Elasticsearch 2.4, setting `"index": "not_analyzed"` gives the following changes, all of which we'd like for these literal fields: - Analysis is skipped all together, the raw value is added to the index directly (this is pretty much equivalent to setting `analyzer: keyword`) - [norms](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/norms.html) are disabled for the field, saving some disk space - [doc_values](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/doc-values.html) are _en_abled. The last one is most interesting. In short, doc_values take up a little disk space but allow us to very efficiently perform aggregations. Pelias doesn't generally perform aggregations today. However, after we begin using a [single mapping type](#293), we will have no way for the [pelias dashboard](https://github.com/pelias/dashboard) or any of our own analysis scripts to provide document counts for different sources or layers. The dasbhoard currently uses an API to get the count of various mapping types, which won't be supported going forward. While minor, we needed a solution to this, and the only other one is fielddata which is extremely expensive in terms of memory usage. This PR disables doc_values for all fields except `source` and `layer`, which gives us about a 4% disk space savings. Merely changing the literal field to use `not_analyzed` _increases_ disk space goes up around 3%, so this is roughly a 7% win! Summary ------ While not technically required for [Elasticsearch 5 support](pelias/pelias#461), this PR does bring us more in line with the best practices of ES5. It also sets us up for [Elasticsearch 6](pelias/pelias#719) where the `string` datatype we use now is completely removed. Fixes #99

This project previously depended on Elasticsearch types in order to provide a count of the number of records in each layer. Elasticsearch types will be going away in Elasticsearch 6, and are already deprecated in ES5. Instead of relying on types, the count of records per layer is now created using a [terms aggregation](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html) similar to the source and layers [autodetection code](pelias/api#1316) recently added to pelias/api. This is one of the last few roadblocks for dropping Pelias's use of types completely and merging PRs like pelias/schema#293 that will drastically simplify our schema. Connects pelias/pelias#461 Connects pelias/pelias#719

orangejulius · 2019-08-30T12:38:52Z

I just updated this PR. When it was originally written, Travis didn't run the integration tests, which now pass.

There's now 7000 lines of duplicate mapping configuration that will be removed :)

It's good to merge at any time, however we should probably wait a month or so after both pelias/model#95 and pelias/model#121 are merged first.

orangejulius · 2019-09-20T16:39:27Z

pelias/model#95 was merged a week ago. I think we should merge this around October 15th!

this can help in updating that massive file

Currently, we define a unique Elasticsearch mapping type for each layer. There are 20 different layers in our standard datsets now, but all the mapping types are identical. This currently is non-optimal, but not really a big deal. However, in Elasticsearch 6, multiple mapping types are no longer supported. So we might as well get ahead of things now and do it. One immediate benefit, this change removes 4370 duplicate mapping type definition lines from the expected schema fixture. The time saved in updating that file alone when we make future changes will be huge! :) Connects pelias/pelias#719

orangejulius · 2019-10-04T20:35:47Z

Merged this a bit earlier than planned: it will be nice to have ES6 compatible schemas and Elasticsearch builds being completed ASAP, and anyone who runs into compatibility issues will only do so as they are about to start a new build. They will be able to update their importers and be ok.

ES6 here we come!

orangejulius mentioned this pull request May 19, 2018

Remove _type field pelias/model#95

Merged

orangejulius mentioned this pull request Jun 10, 2018

Elasticsearch 6 Support pelias/pelias#719

Closed

23 tasks

orangejulius mentioned this pull request Sep 21, 2018

Elasticsearch 6.3+ support (FYI) #316

Closed

orangejulius mentioned this pull request Jul 3, 2019

feat(layers): Use aggregation to count records pelias/dashboard#14

Merged

orangejulius force-pushed the use-a-single-type branch 3 times, most recently from a6887ae to 9ef11be Compare August 30, 2019 12:38

orangejulius force-pushed the use-a-single-type branch from 9ef11be to 756ff0b Compare August 30, 2019 12:52

orangejulius force-pushed the use-a-single-type branch 2 times, most recently from 4d36398 to fab8365 Compare September 13, 2019 16:48

orangejulius added 2 commits September 25, 2019 10:27

Add commented out code to write expected schema to fixture

3fe7a53

this can help in updating that massive file

orangejulius force-pushed the use-a-single-type branch from fab8365 to 30225e0 Compare September 25, 2019 14:27

orangejulius merged commit e31f8e2 into master Oct 4, 2019

orangejulius deleted the use-a-single-type branch October 4, 2019 20:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use a single Elasticsearch Mapping Type #293

Use a single Elasticsearch Mapping Type #293

orangejulius commented May 18, 2018

missinglink commented Jun 22, 2018 •

edited

Loading

orangejulius commented Aug 30, 2019

orangejulius commented Sep 20, 2019

orangejulius commented Oct 4, 2019

Use a single Elasticsearch Mapping Type #293

Use a single Elasticsearch Mapping Type #293

Conversation

orangejulius commented May 18, 2018

missinglink commented Jun 22, 2018 • edited Loading

orangejulius commented Aug 30, 2019

orangejulius commented Sep 20, 2019

orangejulius commented Oct 4, 2019

missinglink commented Jun 22, 2018 •

edited

Loading