Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce the new convention for multi-fields text indexing to the README. #140

Merged
merged 3 commits into from
Oct 24, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@ All notable changes to this project will be documented in this file based on the
* Remove `*.timezone.offset.sec` fields as too specific for ECS at the moment. #134
* Make the following fields keyword: device.vendor, file.path, file.target_path, http.response.body, network.name, organization.name, url.href, url.path, url.query, user_agent.original
* Rename `url.host.name` to `url.hostname` to better align with industry convention.
* Make the following fields keyword: device.vendor, file.path, file.target_path, http.response.body, network.name, organization.name, url.href, url.path, url.query, user_agent.original. #137
* Only two fields using `text` indexing at this time are `message` and `error.message`.

### Bugfixes

Expand Down
58 changes: 34 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -458,40 +458,50 @@ Contributions of additional uses cases on top of ECS are welcome.

### Multi-fields text indexing

ElasticSearch can index text multiple ways:
Elasticsearch can index text multiple ways:

* [text](https://www.elastic.co/guide/en/elasticsearch/reference/current/text.html) indexing allows for full text search, or searching arbitrary words that
* [text](https://www.elastic.co/guide/en/elasticsearch/reference/current/text.html)
indexing allows for full text search, or searching arbitrary words that
are part of the field.
* [keyword](https://www.elastic.co/guide/en/elasticsearch/reference/current/keyword.html) indexing allows for much faster
[exact match](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-term-query.html)
and [prefix search](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-prefix-query.html),
* [keyword](https://www.elastic.co/guide/en/elasticsearch/reference/current/keyword.html)
indexing allows for much faster
[exact match filtering](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-term-query.html),
[prefix search](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-prefix-query.html),
and allows for [aggregations](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations.html)
(what Kibana visualizations are built on).

In some cases, only one type of indexing makes sense for a field.
By default, unless your index mapping or index template specifies otherwise
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of writing this here, could we link to the ECS docs? We should if possible not explain how Elasticsearch works in ECS.

(as the ECS index template does),
Elasticsearch indexes text field as `text` at the canonical field name,
and indexes a second time as `keyword`, nested in a multi-field.

However there are cases where both types of indexing can be useful, and we want
to index both ways.
As an example, log messages can sometimes be short enough that it makes sense
to sort them by frequency (that's an aggregation). They can also be long and
varied enough that full text search can be useful on them.
Default Elasticsearch convention:

Whenever both types of indexing are helpful, we use multi-fields indexing. The
convention used is the following:
* Canonical field: `myfield` is `text`
* Multi-field: `myfield.keyword` is `keyword`

* `foo`: `text` indexing.
The top level of the field (its plain name) is used for full text search.
* `foo.raw`: `keyword` indexing.
The nested field has suffix `.raw` and is what you will use for aggregations.
* Performance tip: when filtering your stream in Kibana (or elsewhere), if you
are filtering for an exact match or doing a prefix search,
both `text` and `keyword` field can be used, but doing so on the `keyword`
field (named `.raw`) will be much faster and less memory intensive.
For monitoring use cases, `keyword` indexing is needed almost exclusively, with
full text search on very few fields. Given this premise, ECS defaults
all text indexing to `keyword` at the top level (with very few exceptions).
Any use case that requires full text search indexing on additional fields
can simply add a [multi-field](https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-fields.html)
for full text search. Doing so does not conflict with ECS,
as the canonical field name will remain `keyword` indexed.

**Keyword only fields**
ECS multi-field convention for text:

The fields that only make sense as type `keyword` are not named `foo.raw`, the
plain field (`foo`) will be of type `keyword`, with no nested field.
* Canonical field: `myfield` is `keyword`
* Multi-field: `myfield.text` is `text`

#### Exceptions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to document this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, forgot to drop that old text.

Copy link
Contributor Author

@webmat webmat Oct 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, misread that. I thought we still had a reference to .raw fields.

I think it's worthwhile to document the break from this new convention. These are widely used fields and they will behave exactly the reverse of this new convention we're introducing. I think it's helpful to make sure it's clear.

Or alternately, I would make them follow the new convention, but declare them right away as a multi-field. E.g. message is keyword and message.text is text. This would avoid the uncomfortable explanation of explaining two exceptions, and would let people do fast exact match filtering based on the message field without having to reintroduce message.keyword ;-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned before, these are not exceptions for me.


The only exceptions to this convention are fields `message` and `error.message`,
which are indexed for full text search only, with no multi-field.
These two fields don't follow the new convention because they are deemed too big
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the purpose of message is to be index so even if we would not have it as text today I think it should be text.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean by this. message is indexed as text right now.

Are you mixing this up with my comment here #138 (review) 😄 ?

of a breaking change with these two widely used fields in Beats.

Any future field that will be indexed for full text search in ECS will however
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's really skip the Exception part here as I think we need to discuss when we encounter other fields with text purpose what we should do with it and not get ahead of us in the docs here.

follow the multi-field convention where `text` indexing is nested in the multi-field.

### IDs are keywords not integers

Expand Down
58 changes: 34 additions & 24 deletions docs/implementing.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,40 +26,50 @@

### Multi-fields text indexing

ElasticSearch can index text multiple ways:
Elasticsearch can index text multiple ways:

* [text](https://www.elastic.co/guide/en/elasticsearch/reference/current/text.html) indexing allows for full text search, or searching arbitrary words that
* [text](https://www.elastic.co/guide/en/elasticsearch/reference/current/text.html)
indexing allows for full text search, or searching arbitrary words that
are part of the field.
* [keyword](https://www.elastic.co/guide/en/elasticsearch/reference/current/keyword.html) indexing allows for much faster
[exact match](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-term-query.html)
and [prefix search](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-prefix-query.html),
* [keyword](https://www.elastic.co/guide/en/elasticsearch/reference/current/keyword.html)
indexing allows for much faster
[exact match filtering](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-term-query.html),
[prefix search](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-prefix-query.html),
and allows for [aggregations](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations.html)
(what Kibana visualizations are built on).

In some cases, only one type of indexing makes sense for a field.
By default, unless your index mapping or index template specifies otherwise
(as the ECS index template does),
Elasticsearch indexes text field as `text` at the canonical field name,
and indexes a second time as `keyword`, nested in a multi-field.

However there are cases where both types of indexing can be useful, and we want
to index both ways.
As an example, log messages can sometimes be short enough that it makes sense
to sort them by frequency (that's an aggregation). They can also be long and
varied enough that full text search can be useful on them.
Default Elasticsearch convention:

Whenever both types of indexing are helpful, we use multi-fields indexing. The
convention used is the following:
* Canonical field: `myfield` is `text`
* Multi-field: `myfield.keyword` is `keyword`

* `foo`: `text` indexing.
The top level of the field (its plain name) is used for full text search.
* `foo.raw`: `keyword` indexing.
The nested field has suffix `.raw` and is what you will use for aggregations.
* Performance tip: when filtering your stream in Kibana (or elsewhere), if you
are filtering for an exact match or doing a prefix search,
both `text` and `keyword` field can be used, but doing so on the `keyword`
field (named `.raw`) will be much faster and less memory intensive.
For monitoring use cases, `keyword` indexing is needed almost exclusively, with
full text search on very few fields. Given this premise, ECS defaults
all text indexing to `keyword` at the top level (with very few exceptions).
Any use case that requires full text search indexing on additional fields
can simply add a [multi-field](https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-fields.html)
for full text search. Doing so does not conflict with ECS,
as the canonical field name will remain `keyword` indexed.

**Keyword only fields**
ECS multi-field convention for text:

The fields that only make sense as type `keyword` are not named `foo.raw`, the
plain field (`foo`) will be of type `keyword`, with no nested field.
* Canonical field: `myfield` is `keyword`
* Multi-field: `myfield.text` is `text`

#### Exceptions

The only exceptions to this convention are fields `message` and `error.message`,
which are indexed for full text search only, with no multi-field.
These two fields don't follow the new convention because they are deemed too big
of a breaking change with these two widely used fields in Beats.

Any future field that will be indexed for full text search in ECS will however
follow the multi-field convention where `text` indexing is nested in the multi-field.

### IDs are keywords not integers

Expand Down