Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a single Elasticsearch Mapping Type #293

Merged
merged 2 commits into from
Oct 4, 2019
Merged

Conversation

orangejulius
Copy link
Member

We currently set up many Elasticsearch mapping types. They are all identical, and thus there's no benefit to having more than one.

In Elasticsearch 6, multiple mapping types will not be supported, so we should get on top of things and remove them when we can.

There's also the immediate benefit that our Elasticsearch mapping will be much shorter and easier to work with. This not only makes Elasticsearch itself have to do less work to keep track of our types, but take a look at the 5000 lines of duplicated mapping type configuration we get to remove in our tests below!

This PR is built on #282 and must be merged after pelias/model#95

Connects pelias/pelias#719

@missinglink
Copy link
Member

missinglink commented Jun 22, 2018

I don't see any reason we can't merge this, we just need to first check:

  • no queries are relying on this functionality (and consider the effect on queries written in forks)
  • we run a full build and test that the quality and performance hasn't changed

The same goes for the strict mapping, but in that case, I'm not sure it would provide any real benefit? I'll add my comments on that PR.

orangejulius added a commit that referenced this pull request Oct 25, 2018
As guessed in #99, there _are_
differences between setting `"index": "not_analyzed"` for a field, and
merely setting the analyzer to `keyword`.

They are detailed in the Elasticsearch 2.4 [String datatype](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/string.html#string-params)
documentation, although it's a little bit confusing.

In Elasticsearch 5+, there are _two_ different types of string
datatypes:

- [`text`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/text.html) and
- [`keyword`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/keyword.html).

These documentation pages make the difference much more clear. In short,
in Elasticsearch 2.4, setting `"index": "not_analyzed"` gives the
following changes, all of which we'd like for these literal fields:

- Analysis is skipped all together, the raw value is added to the index
directly (this is pretty much equivalent to setting `analyzer: keyword`)
- [norms](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/norms.html) are disabled for the field, saving some disk space
- [doc_values](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/doc-values.html) are _en_abled.

The last one is most interesting. In short, doc_values take up a little
disk space but allow us to very efficiently perform aggregations. Pelias
doesn't generally perform aggregations today. However, after we begin
using a [single mapping type](#293), we will have no way for the [pelias dashboard](https://github.com/pelias/dashboard) or any of our own analysis scripts to provide document counts for different sources or layers. The dasbhoard currently uses an API to get the count of various mapping types, which won't be supported going forward.

While minor, we needed a solution to this, and the only other one is
fielddata which is extremely expensive in terms of memory usage.

In my testing, for the Portland metro Docker project, disk usage went
from 451MB to 473MB, or about a 5% increase.

If we wanted to trim that down a bit, we could consider disabling
`doc_values` for the `parent.*_id` fields. We don't have an immediate
need for `doc_values` on those fields, although it might be interesting
for analysis.

Summary
------

While not technically required for [Elasticsearch 5 support](pelias/pelias#461), this PR does bring us more in line with the best practices of ES5.

It also sets us up for [Elasticsearch 6](pelias/pelias#719) where the `string`
datatype we use now is completely removed.

Fixes #99
JWileczek pushed a commit to JWileczek/schema that referenced this pull request Oct 26, 2018
As guessed in pelias#99, there _are_
differences between setting `"index": "not_analyzed"` for a field, and
merely setting the analyzer to `keyword`.

They are detailed in the Elasticsearch 2.4 [String datatype](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/string.html#string-params)
documentation, although it's a little bit confusing.

In Elasticsearch 5+, there are _two_ different types of string
datatypes:

- [`text`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/text.html) and
- [`keyword`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/keyword.html).

These documentation pages make the difference much more clear. In short,
in Elasticsearch 2.4, setting `"index": "not_analyzed"` gives the
following changes, all of which we'd like for these literal fields:

- Analysis is skipped all together, the raw value is added to the index
directly (this is pretty much equivalent to setting `analyzer: keyword`)
- [norms](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/norms.html) are disabled for the field, saving some disk space
- [doc_values](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/doc-values.html) are _en_abled.

The last one is most interesting. In short, doc_values take up a little
disk space but allow us to very efficiently perform aggregations. Pelias
doesn't generally perform aggregations today. However, after we begin
using a [single mapping type](pelias#293), we will have no way for the [pelias dashboard](https://github.com/pelias/dashboard) or any of our own analysis scripts to provide document counts for different sources or layers. The dasbhoard currently uses an API to get the count of various mapping types, which won't be supported going forward.

While minor, we needed a solution to this, and the only other one is
fielddata which is extremely expensive in terms of memory usage.

In my testing, for the Portland metro Docker project, disk usage went
from 451MB to 473MB, or about a 5% increase.

If we wanted to trim that down a bit, we could consider disabling
`doc_values` for the `parent.*_id` fields. We don't have an immediate
need for `doc_values` on those fields, although it might be interesting
for analysis.

Summary
------

While not technically required for [Elasticsearch 5 support](pelias/pelias#461), this PR does bring us more in line with the best practices of ES5.

It also sets us up for [Elasticsearch 6](pelias/pelias#719) where the `string`
datatype we use now is completely removed.

Fixes pelias#99
orangejulius added a commit that referenced this pull request Nov 2, 2018
As guessed in #99, there _are_
differences between setting `"index": "not_analyzed"` for a field, and
merely setting the analyzer to `keyword`.

They are detailed in the Elasticsearch 2.4 [String datatype](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/string.html#string-params)
documentation, although it's a little bit confusing.

In Elasticsearch 5+, there are _two_ different types of string
datatypes:

- [`text`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/text.html) and
- [`keyword`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/keyword.html).

These documentation pages make the difference much more clear. In short,
in Elasticsearch 2.4, setting `"index": "not_analyzed"` gives the
following changes, all of which we'd like for these literal fields:

- Analysis is skipped all together, the raw value is added to the index
directly (this is pretty much equivalent to setting `analyzer: keyword`)
- [norms](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/norms.html) are disabled for the field, saving some disk space
- [doc_values](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/doc-values.html) are _en_abled.

The last one is most interesting. In short, doc_values take up a little
disk space but allow us to very efficiently perform aggregations. Pelias
doesn't generally perform aggregations today. However, after we begin
using a [single mapping type](#293), we will have no way for the [pelias dashboard](https://github.com/pelias/dashboard) or any of our own analysis scripts to provide document counts for different sources or layers. The dasbhoard currently uses an API to get the count of various mapping types, which won't be supported going forward.

While minor, we needed a solution to this, and the only other one is
fielddata which is extremely expensive in terms of memory usage.

This PR disables doc_values for all fields except `source` and `layer`,
which gives us about a 4% disk space savings. Merely changing the literal
field to use `not_analyzed` _increases_ disk space goes up around 3%, so
this is roughly a 7% win!

Summary
------

While not technically required for [Elasticsearch 5 support](pelias/pelias#461), this PR does bring us more in line with the best practices of ES5.

It also sets us up for [Elasticsearch 6](pelias/pelias#719) where the `string`
datatype we use now is completely removed.

Fixes #99
orangejulius added a commit that referenced this pull request Nov 2, 2018
As guessed in #99, there _are_
differences between setting `"index": "not_analyzed"` for a field, and
merely setting the analyzer to `keyword`.

They are detailed in the Elasticsearch 2.4 [String datatype](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/string.html#string-params)
documentation, although it's a little bit confusing.

In Elasticsearch 5+, there are _two_ different types of string
datatypes:

- [`text`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/text.html) and
- [`keyword`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/keyword.html).

These documentation pages make the difference much more clear. In short,
in Elasticsearch 2.4, setting `"index": "not_analyzed"` gives the
following changes, all of which we'd like for these literal fields:

- Analysis is skipped all together, the raw value is added to the index
directly (this is pretty much equivalent to setting `analyzer: keyword`)
- [norms](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/norms.html) are disabled for the field, saving some disk space
- [doc_values](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/doc-values.html) are _en_abled.

The last one is most interesting. In short, doc_values take up a little
disk space but allow us to very efficiently perform aggregations. Pelias
doesn't generally perform aggregations today. However, after we begin
using a [single mapping type](#293), we will have no way for the [pelias dashboard](https://github.com/pelias/dashboard) or any of our own analysis scripts to provide document counts for different sources or layers. The dasbhoard currently uses an API to get the count of various mapping types, which won't be supported going forward.

While minor, we needed a solution to this, and the only other one is
fielddata which is extremely expensive in terms of memory usage.

This PR disables doc_values for all fields except `source` and `layer`,
which gives us about a 4% disk space savings. Merely changing the literal
field to use `not_analyzed` _increases_ disk space goes up around 3%, so
this is roughly a 7% win!

Summary
------

While not technically required for [Elasticsearch 5 support](pelias/pelias#461), this PR does bring us more in line with the best practices of ES5.

It also sets us up for [Elasticsearch 6](pelias/pelias#719) where the `string`
datatype we use now is completely removed.

Fixes #99
orangejulius added a commit that referenced this pull request Nov 3, 2018
As guessed in #99, there _are_
differences between setting `"index": "not_analyzed"` for a field, and
merely setting the analyzer to `keyword`.

They are detailed in the Elasticsearch 2.4 [String datatype](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/string.html#string-params)
documentation, although it's a little bit confusing.

In Elasticsearch 5+, there are _two_ different types of string
datatypes:

- [`text`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/text.html) and
- [`keyword`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/keyword.html).

These documentation pages make the difference much more clear. In short,
in Elasticsearch 2.4, setting `"index": "not_analyzed"` gives the
following changes, all of which we'd like for these literal fields:

- Analysis is skipped all together, the raw value is added to the index
directly (this is pretty much equivalent to setting `analyzer: keyword`)
- [norms](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/norms.html) are disabled for the field, saving some disk space
- [doc_values](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/doc-values.html) are _en_abled.

The last one is most interesting. In short, doc_values take up a little
disk space but allow us to very efficiently perform aggregations. Pelias
doesn't generally perform aggregations today. However, after we begin
using a [single mapping type](#293), we will have no way for the [pelias dashboard](https://github.com/pelias/dashboard) or any of our own analysis scripts to provide document counts for different sources or layers. The dasbhoard currently uses an API to get the count of various mapping types, which won't be supported going forward.

While minor, we needed a solution to this, and the only other one is
fielddata which is extremely expensive in terms of memory usage.

This PR disables doc_values for all fields except `source` and `layer`,
which gives us about a 4% disk space savings. Merely changing the literal
field to use `not_analyzed` _increases_ disk space goes up around 3%, so
this is roughly a 7% win!

Summary
------

While not technically required for [Elasticsearch 5 support](pelias/pelias#461), this PR does bring us more in line with the best practices of ES5.

It also sets us up for [Elasticsearch 6](pelias/pelias#719) where the `string`
datatype we use now is completely removed.

Fixes #99
orangejulius added a commit that referenced this pull request Nov 3, 2018
As guessed in #99, there _are_
differences between setting `"index": "not_analyzed"` for a field, and
merely setting the analyzer to `keyword`.

They are detailed in the Elasticsearch 2.4 [String datatype](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/string.html#string-params)
documentation, although it's a little bit confusing.

In Elasticsearch 5+, there are _two_ different types of string
datatypes:

- [`text`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/text.html) and
- [`keyword`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/keyword.html).

These documentation pages make the difference much more clear. In short,
in Elasticsearch 2.4, setting `"index": "not_analyzed"` gives the
following changes, all of which we'd like for these literal fields:

- Analysis is skipped all together, the raw value is added to the index
directly (this is pretty much equivalent to setting `analyzer: keyword`)
- [norms](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/norms.html) are disabled for the field, saving some disk space
- [doc_values](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/doc-values.html) are _en_abled.

The last one is most interesting. In short, doc_values take up a little
disk space but allow us to very efficiently perform aggregations. Pelias
doesn't generally perform aggregations today. However, after we begin
using a [single mapping type](#293), we will have no way for the [pelias dashboard](https://github.com/pelias/dashboard) or any of our own analysis scripts to provide document counts for different sources or layers. The dasbhoard currently uses an API to get the count of various mapping types, which won't be supported going forward.

While minor, we needed a solution to this, and the only other one is
fielddata which is extremely expensive in terms of memory usage.

This PR disables doc_values for all fields except `source` and `layer`,
which gives us about a 4% disk space savings. Merely changing the literal
field to use `not_analyzed` _increases_ disk space goes up around 3%, so
this is roughly a 7% win!

Summary
------

While not technically required for [Elasticsearch 5 support](pelias/pelias#461), this PR does bring us more in line with the best practices of ES5.

It also sets us up for [Elasticsearch 6](pelias/pelias#719) where the `string`
datatype we use now is completely removed.

Fixes #99
orangejulius added a commit to pelias/dashboard that referenced this pull request Jul 3, 2019
This project previously depended on Elasticsearch types in order to
provide a count of the number of records in each layer.

Elasticsearch types will be going away in Elasticsearch 6, and are
already deprecated in ES5.

Instead of relying on types, the count of records per layer is now
created using a [terms aggregation](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html)
similar to the source and layers [autodetection
code](pelias/api#1316) recently added to
pelias/api.

This is one of the last few roadblocks for dropping Pelias's use of
types completely and merging PRs like
pelias/schema#293 that will drastically simplify
our schema.

Connects pelias/pelias#461
Connects pelias/pelias#719
orangejulius added a commit to pelias/dashboard that referenced this pull request Jul 3, 2019
This project previously depended on Elasticsearch types in order to
provide a count of the number of records in each layer.

Elasticsearch types will be going away in Elasticsearch 6, and are
already deprecated in ES5.

Instead of relying on types, the count of records per layer is now
created using a [terms aggregation](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html)
similar to the source and layers [autodetection
code](pelias/api#1316) recently added to
pelias/api.

This is one of the last few roadblocks for dropping Pelias's use of
types completely and merging PRs like
pelias/schema#293 that will drastically simplify
our schema.

Connects pelias/pelias#461
Connects pelias/pelias#719
@orangejulius orangejulius force-pushed the use-a-single-type branch 3 times, most recently from a6887ae to 9ef11be Compare August 30, 2019 12:38
@orangejulius
Copy link
Member Author

I just updated this PR. When it was originally written, Travis didn't run the integration tests, which now pass.

There's now 7000 lines of duplicate mapping configuration that will be removed :)

It's good to merge at any time, however we should probably wait a month or so after both pelias/model#95 and pelias/model#121 are merged first.

@orangejulius orangejulius force-pushed the use-a-single-type branch 2 times, most recently from 4d36398 to fab8365 Compare September 13, 2019 16:48
@orangejulius
Copy link
Member Author

pelias/model#95 was merged a week ago. I think we should merge this around October 15th!

this can help in updating that massive file
Currently, we define a unique Elasticsearch mapping type for each layer.
There are 20 different layers in our standard datsets now, but all the
mapping types are identical.

This currently is non-optimal, but not really a big deal. However, in
Elasticsearch 6, multiple mapping types are no longer supported. So we
might as well get ahead of things now and do it.

One immediate benefit, this change removes 4370 duplicate mapping type
definition lines from the expected schema fixture. The time saved in
updating that file alone when we make future changes will be huge! :)

Connects pelias/pelias#719
@orangejulius orangejulius merged commit e31f8e2 into master Oct 4, 2019
@orangejulius orangejulius deleted the use-a-single-type branch October 4, 2019 20:33
@orangejulius
Copy link
Member Author

Merged this a bit earlier than planned: it will be nice to have ES6 compatible schemas and Elasticsearch builds being completed ASAP, and anyone who runs into compatibility issues will only do so as they are about to start a new build. They will be able to update their importers and be ok.

ES6 here we come!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants