-
-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use a single Elasticsearch Mapping Type #293
Conversation
I don't see any reason we can't merge this, we just need to first check:
The same goes for the strict mapping, but in that case, I'm not sure it would provide any real benefit? I'll add my comments on that PR. |
As guessed in #99, there _are_ differences between setting `"index": "not_analyzed"` for a field, and merely setting the analyzer to `keyword`. They are detailed in the Elasticsearch 2.4 [String datatype](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/string.html#string-params) documentation, although it's a little bit confusing. In Elasticsearch 5+, there are _two_ different types of string datatypes: - [`text`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/text.html) and - [`keyword`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/keyword.html). These documentation pages make the difference much more clear. In short, in Elasticsearch 2.4, setting `"index": "not_analyzed"` gives the following changes, all of which we'd like for these literal fields: - Analysis is skipped all together, the raw value is added to the index directly (this is pretty much equivalent to setting `analyzer: keyword`) - [norms](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/norms.html) are disabled for the field, saving some disk space - [doc_values](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/doc-values.html) are _en_abled. The last one is most interesting. In short, doc_values take up a little disk space but allow us to very efficiently perform aggregations. Pelias doesn't generally perform aggregations today. However, after we begin using a [single mapping type](#293), we will have no way for the [pelias dashboard](https://github.com/pelias/dashboard) or any of our own analysis scripts to provide document counts for different sources or layers. The dasbhoard currently uses an API to get the count of various mapping types, which won't be supported going forward. While minor, we needed a solution to this, and the only other one is fielddata which is extremely expensive in terms of memory usage. In my testing, for the Portland metro Docker project, disk usage went from 451MB to 473MB, or about a 5% increase. If we wanted to trim that down a bit, we could consider disabling `doc_values` for the `parent.*_id` fields. We don't have an immediate need for `doc_values` on those fields, although it might be interesting for analysis. Summary ------ While not technically required for [Elasticsearch 5 support](pelias/pelias#461), this PR does bring us more in line with the best practices of ES5. It also sets us up for [Elasticsearch 6](pelias/pelias#719) where the `string` datatype we use now is completely removed. Fixes #99
As guessed in pelias#99, there _are_ differences between setting `"index": "not_analyzed"` for a field, and merely setting the analyzer to `keyword`. They are detailed in the Elasticsearch 2.4 [String datatype](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/string.html#string-params) documentation, although it's a little bit confusing. In Elasticsearch 5+, there are _two_ different types of string datatypes: - [`text`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/text.html) and - [`keyword`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/keyword.html). These documentation pages make the difference much more clear. In short, in Elasticsearch 2.4, setting `"index": "not_analyzed"` gives the following changes, all of which we'd like for these literal fields: - Analysis is skipped all together, the raw value is added to the index directly (this is pretty much equivalent to setting `analyzer: keyword`) - [norms](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/norms.html) are disabled for the field, saving some disk space - [doc_values](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/doc-values.html) are _en_abled. The last one is most interesting. In short, doc_values take up a little disk space but allow us to very efficiently perform aggregations. Pelias doesn't generally perform aggregations today. However, after we begin using a [single mapping type](pelias#293), we will have no way for the [pelias dashboard](https://github.com/pelias/dashboard) or any of our own analysis scripts to provide document counts for different sources or layers. The dasbhoard currently uses an API to get the count of various mapping types, which won't be supported going forward. While minor, we needed a solution to this, and the only other one is fielddata which is extremely expensive in terms of memory usage. In my testing, for the Portland metro Docker project, disk usage went from 451MB to 473MB, or about a 5% increase. If we wanted to trim that down a bit, we could consider disabling `doc_values` for the `parent.*_id` fields. We don't have an immediate need for `doc_values` on those fields, although it might be interesting for analysis. Summary ------ While not technically required for [Elasticsearch 5 support](pelias/pelias#461), this PR does bring us more in line with the best practices of ES5. It also sets us up for [Elasticsearch 6](pelias/pelias#719) where the `string` datatype we use now is completely removed. Fixes pelias#99
As guessed in #99, there _are_ differences between setting `"index": "not_analyzed"` for a field, and merely setting the analyzer to `keyword`. They are detailed in the Elasticsearch 2.4 [String datatype](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/string.html#string-params) documentation, although it's a little bit confusing. In Elasticsearch 5+, there are _two_ different types of string datatypes: - [`text`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/text.html) and - [`keyword`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/keyword.html). These documentation pages make the difference much more clear. In short, in Elasticsearch 2.4, setting `"index": "not_analyzed"` gives the following changes, all of which we'd like for these literal fields: - Analysis is skipped all together, the raw value is added to the index directly (this is pretty much equivalent to setting `analyzer: keyword`) - [norms](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/norms.html) are disabled for the field, saving some disk space - [doc_values](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/doc-values.html) are _en_abled. The last one is most interesting. In short, doc_values take up a little disk space but allow us to very efficiently perform aggregations. Pelias doesn't generally perform aggregations today. However, after we begin using a [single mapping type](#293), we will have no way for the [pelias dashboard](https://github.com/pelias/dashboard) or any of our own analysis scripts to provide document counts for different sources or layers. The dasbhoard currently uses an API to get the count of various mapping types, which won't be supported going forward. While minor, we needed a solution to this, and the only other one is fielddata which is extremely expensive in terms of memory usage. This PR disables doc_values for all fields except `source` and `layer`, which gives us about a 4% disk space savings. Merely changing the literal field to use `not_analyzed` _increases_ disk space goes up around 3%, so this is roughly a 7% win! Summary ------ While not technically required for [Elasticsearch 5 support](pelias/pelias#461), this PR does bring us more in line with the best practices of ES5. It also sets us up for [Elasticsearch 6](pelias/pelias#719) where the `string` datatype we use now is completely removed. Fixes #99
As guessed in #99, there _are_ differences between setting `"index": "not_analyzed"` for a field, and merely setting the analyzer to `keyword`. They are detailed in the Elasticsearch 2.4 [String datatype](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/string.html#string-params) documentation, although it's a little bit confusing. In Elasticsearch 5+, there are _two_ different types of string datatypes: - [`text`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/text.html) and - [`keyword`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/keyword.html). These documentation pages make the difference much more clear. In short, in Elasticsearch 2.4, setting `"index": "not_analyzed"` gives the following changes, all of which we'd like for these literal fields: - Analysis is skipped all together, the raw value is added to the index directly (this is pretty much equivalent to setting `analyzer: keyword`) - [norms](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/norms.html) are disabled for the field, saving some disk space - [doc_values](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/doc-values.html) are _en_abled. The last one is most interesting. In short, doc_values take up a little disk space but allow us to very efficiently perform aggregations. Pelias doesn't generally perform aggregations today. However, after we begin using a [single mapping type](#293), we will have no way for the [pelias dashboard](https://github.com/pelias/dashboard) or any of our own analysis scripts to provide document counts for different sources or layers. The dasbhoard currently uses an API to get the count of various mapping types, which won't be supported going forward. While minor, we needed a solution to this, and the only other one is fielddata which is extremely expensive in terms of memory usage. This PR disables doc_values for all fields except `source` and `layer`, which gives us about a 4% disk space savings. Merely changing the literal field to use `not_analyzed` _increases_ disk space goes up around 3%, so this is roughly a 7% win! Summary ------ While not technically required for [Elasticsearch 5 support](pelias/pelias#461), this PR does bring us more in line with the best practices of ES5. It also sets us up for [Elasticsearch 6](pelias/pelias#719) where the `string` datatype we use now is completely removed. Fixes #99
As guessed in #99, there _are_ differences between setting `"index": "not_analyzed"` for a field, and merely setting the analyzer to `keyword`. They are detailed in the Elasticsearch 2.4 [String datatype](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/string.html#string-params) documentation, although it's a little bit confusing. In Elasticsearch 5+, there are _two_ different types of string datatypes: - [`text`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/text.html) and - [`keyword`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/keyword.html). These documentation pages make the difference much more clear. In short, in Elasticsearch 2.4, setting `"index": "not_analyzed"` gives the following changes, all of which we'd like for these literal fields: - Analysis is skipped all together, the raw value is added to the index directly (this is pretty much equivalent to setting `analyzer: keyword`) - [norms](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/norms.html) are disabled for the field, saving some disk space - [doc_values](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/doc-values.html) are _en_abled. The last one is most interesting. In short, doc_values take up a little disk space but allow us to very efficiently perform aggregations. Pelias doesn't generally perform aggregations today. However, after we begin using a [single mapping type](#293), we will have no way for the [pelias dashboard](https://github.com/pelias/dashboard) or any of our own analysis scripts to provide document counts for different sources or layers. The dasbhoard currently uses an API to get the count of various mapping types, which won't be supported going forward. While minor, we needed a solution to this, and the only other one is fielddata which is extremely expensive in terms of memory usage. This PR disables doc_values for all fields except `source` and `layer`, which gives us about a 4% disk space savings. Merely changing the literal field to use `not_analyzed` _increases_ disk space goes up around 3%, so this is roughly a 7% win! Summary ------ While not technically required for [Elasticsearch 5 support](pelias/pelias#461), this PR does bring us more in line with the best practices of ES5. It also sets us up for [Elasticsearch 6](pelias/pelias#719) where the `string` datatype we use now is completely removed. Fixes #99
As guessed in #99, there _are_ differences between setting `"index": "not_analyzed"` for a field, and merely setting the analyzer to `keyword`. They are detailed in the Elasticsearch 2.4 [String datatype](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/string.html#string-params) documentation, although it's a little bit confusing. In Elasticsearch 5+, there are _two_ different types of string datatypes: - [`text`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/text.html) and - [`keyword`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/keyword.html). These documentation pages make the difference much more clear. In short, in Elasticsearch 2.4, setting `"index": "not_analyzed"` gives the following changes, all of which we'd like for these literal fields: - Analysis is skipped all together, the raw value is added to the index directly (this is pretty much equivalent to setting `analyzer: keyword`) - [norms](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/norms.html) are disabled for the field, saving some disk space - [doc_values](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/doc-values.html) are _en_abled. The last one is most interesting. In short, doc_values take up a little disk space but allow us to very efficiently perform aggregations. Pelias doesn't generally perform aggregations today. However, after we begin using a [single mapping type](#293), we will have no way for the [pelias dashboard](https://github.com/pelias/dashboard) or any of our own analysis scripts to provide document counts for different sources or layers. The dasbhoard currently uses an API to get the count of various mapping types, which won't be supported going forward. While minor, we needed a solution to this, and the only other one is fielddata which is extremely expensive in terms of memory usage. This PR disables doc_values for all fields except `source` and `layer`, which gives us about a 4% disk space savings. Merely changing the literal field to use `not_analyzed` _increases_ disk space goes up around 3%, so this is roughly a 7% win! Summary ------ While not technically required for [Elasticsearch 5 support](pelias/pelias#461), this PR does bring us more in line with the best practices of ES5. It also sets us up for [Elasticsearch 6](pelias/pelias#719) where the `string` datatype we use now is completely removed. Fixes #99
This project previously depended on Elasticsearch types in order to provide a count of the number of records in each layer. Elasticsearch types will be going away in Elasticsearch 6, and are already deprecated in ES5. Instead of relying on types, the count of records per layer is now created using a [terms aggregation](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html) similar to the source and layers [autodetection code](pelias/api#1316) recently added to pelias/api. This is one of the last few roadblocks for dropping Pelias's use of types completely and merging PRs like pelias/schema#293 that will drastically simplify our schema. Connects pelias/pelias#461 Connects pelias/pelias#719
This project previously depended on Elasticsearch types in order to provide a count of the number of records in each layer. Elasticsearch types will be going away in Elasticsearch 6, and are already deprecated in ES5. Instead of relying on types, the count of records per layer is now created using a [terms aggregation](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html) similar to the source and layers [autodetection code](pelias/api#1316) recently added to pelias/api. This is one of the last few roadblocks for dropping Pelias's use of types completely and merging PRs like pelias/schema#293 that will drastically simplify our schema. Connects pelias/pelias#461 Connects pelias/pelias#719
a6887ae
to
9ef11be
Compare
I just updated this PR. When it was originally written, Travis didn't run the integration tests, which now pass. There's now 7000 lines of duplicate mapping configuration that will be removed :) It's good to merge at any time, however we should probably wait a month or so after both pelias/model#95 and pelias/model#121 are merged first. |
9ef11be
to
756ff0b
Compare
4d36398
to
fab8365
Compare
pelias/model#95 was merged a week ago. I think we should merge this around October 15th! |
this can help in updating that massive file
Currently, we define a unique Elasticsearch mapping type for each layer. There are 20 different layers in our standard datsets now, but all the mapping types are identical. This currently is non-optimal, but not really a big deal. However, in Elasticsearch 6, multiple mapping types are no longer supported. So we might as well get ahead of things now and do it. One immediate benefit, this change removes 4370 duplicate mapping type definition lines from the expected schema fixture. The time saved in updating that file alone when we make future changes will be huge! :) Connects pelias/pelias#719
fab8365
to
30225e0
Compare
Merged this a bit earlier than planned: it will be nice to have ES6 compatible schemas and Elasticsearch builds being completed ASAP, and anyone who runs into compatibility issues will only do so as they are about to start a new build. They will be able to update their importers and be ok. ES6 here we come! |
We currently set up many Elasticsearch mapping types. They are all identical, and thus there's no benefit to having more than one.
In Elasticsearch 6, multiple mapping types will not be supported, so we should get on top of things and remove them when we can.
There's also the immediate benefit that our Elasticsearch mapping will be much shorter and easier to work with. This not only makes Elasticsearch itself have to do less work to keep track of our types, but take a look at the 5000 lines of duplicated mapping type configuration we get to remove in our tests below!
This PR is built on #282 and must be merged after pelias/model#95
Connects pelias/pelias#719