Better out-of-the-box mappings for logs, metrics and synthetics #64978

ruflin · 2020-11-12T09:33:20Z

One of the problems we have today with the default templates is that ip addresses and message fields are not mapped correct. Auto detection of ip addresses would be great: #64400 But in the meantime, we could also match on the naming convention that all *.ip fields are of type ip address.

During implementation I was not sure if I should use match: ip for path_match: *.ip. The main difference I would expect that the top level is not matched. Are there others?

Do we have fields which are named ip or message which do not adhere to these rules?

This PR is not complete but is to share the idea, also it was not tested. If we decide to move forward, also the other base templates need updating and it might be useful to add the same logic to all datasets created by Kibana.

One of the problems we have today with the default templates is that ip addresses and message fields are not mapped correct. Auto detection of ip addresses would be great: elastic#64400 But in the meantime, we could also match on the naming convention that all `*.ip` fields are of type ip address. During implementation I was not sure if I should use `match: ip` for `path_match: *.ip`. The main difference I would expect that the top level is not matched. Are there others? Do we have fields which are named `ip` or `message` which do not adhere to these rules? This PR is not complete but is to share the idea. If we decide to move forward, also the other base templates need updating and it might be useful to add the same logic to all datasets created by Kibana.

ruflin · 2020-11-12T09:34:10Z

@dakrone @exekias @axw @jpountz @jalvz Would be great to get your thoughts on the above. @webmat This might also be an interesting option for ECS.

jpountz · 2020-11-12T10:11:25Z

The main difference I would expect that the top level is not matched. Are there others?

This is the only difference indeed. So maybe we should use match rather than path_match then? And with your change we could also remove the message field from the mappings, which could be nice for the logging datasets that don't have a message field?

x-pack/plugin/core/src/main/resources/logs-mappings.json

Co-authored-by: Adrien Grand <jpountz@gmail.com>

webmat · 2020-11-12T14:09:55Z

In ECS all of the .ip fields are indeed of type ip. We also strive for consistency, so I don't expect any .ip leaf fields getting into ECS that would not be of type ip. Whenever there's a doubt (could be a socket / hostname or IP), we have the dual .address => .ip fields. So this will be fine from the ECS fields POV.

If you want to catch 100% of the ECS ip fields, you'd need to also catch network.forwarded_ip and dns.resolved_ip, but this may be out of scope for the intent here.

So I'm good with this approach as is.

axw · 2020-11-16T02:06:23Z

These look fine for the base logs template, but I don't think we could extend this to the datasets created for APM.

In APM we have some object fields called message related to messaging (message queue/bus) events. The specific fields we index at the moment are span.message.queue.name and span.message.age.ms.

jpountz · 2020-11-16T14:35:27Z

@axw The proposal uses a combination of match: message and match_mapping_type: string so it would only be considered for fields that both have message as a name and are represented as strings in the JSON document. So it wouldn't be an issue for objects whose name is message.

axw · 2020-11-17T03:26:35Z

Thanks @jpountz, I missed that bit.

This would be fine to include for all templates for APM then, and I think it's reasonable to assume all string-type fields called *.message should be mapped as text.

ruflin · 2020-11-17T08:47:59Z

@jpountz I'm torn if we should remove message or not by just using match. From a pure technical reason it seems the final mapping would be identical. But if a user looks at the template / mapping the user always see message and can think about how to use it. At the same time, the simpler the template the better ...

@webmat I'm wondering if the above is also a convention we should put into ECS itself. The same field name MUST have always the same type no matter where it is in the hierarchy. message and ip are the two obvious ones here, but it might also apply to id for example and could be used for "groups" that are usable in different places like user.

exekias · 2020-11-17T09:23:59Z

I have checked beats codebase and we have a few instances where *.ip is not of type ip. A few of them are mapped as keywords, but I would consider this a bug, others are alias, which may be more legit:

https://github.com/elastic/beats/blob/af4007eaef6e142bb0d46370918db4ce7135f316/x-pack/filebeat/module/suricata/eve/_meta/fields.yml#L92-L94
https://github.com/elastic/beats/blob/af4007eaef6e142bb0d46370918db4ce7135f316/filebeat/module/system/auth/_meta/fields.yml#L50-L53
https://github.com/elastic/beats/blob/af4007eaef6e142bb0d46370918db4ce7135f316/filebeat/module/haproxy/_meta/fields.yml#L90-L97

Are you trying to solve getting host.ip mapped correctly or do you have some other field in mind?

I wonder if we can make this decision for the user based on field names, at the very least we would need to do a couple of changes in the modules to make things consistent.

To some extent I wonder if we want to unify how type hinting should be done. It seems things are moving towards #61939, where we use path matching as a workaround.

jpountz · 2020-11-17T12:27:01Z

@jpountz I'm torn if we should remove message or not by just using match. From a pure technical reason it seems the final mapping would be identical. But if a user looks at the template / mapping the user always see message and can think about how to use it. At the same time, the simpler the template the better ...

It looks like we have some logging datasets that are completely structured and don't have a message field, so maybe we should just have it in the dynamic template so that it would only be introduced when relevant?

ruflin · 2020-11-17T15:44:35Z

I'm not trying to fix the host.ip problem here as even if we have this mapping in, someone could still use host as a keyword and break things. But much more try to solve the problem around all the other ip addresses which for example show up in the add_*_metadata processors so we don't have to map everything.

@exekias Thanks for checking the Beats fields on this. The alias fields should not be an issue as these do not exist as actual data. Also my assumption is, an actual mapping would overwrite the dynamic mapping in case both match.

For #61939 I think these two are complementary. #61939 is used for any ingestion where the fields are actually known and probably in most cases goes to a specific dataset. This change here is to get the base mappings correct in most cases without complicating the base template too much.

webmat · 2020-11-18T19:28:16Z

a convention we should put into ECS itself

@ruflin It's a convention we've implicitly been following indeed. It could make sense to capture this guidance explicitly as well.

ruflin · 2020-11-20T11:55:14Z

I updated the PR with all the above conversations:

disable norms on message:
- Is this a breaking change? @jpountz @exekias
- Should we add this to ECS too? @webmat
Use match and remove message

Should we move forward with this? If yes, how are these things tested? And we should also add it to the other templates? Would it be possible that the ES team takes over from here @dakrone ?

webmat · 2020-11-20T15:38:57Z

Should we add this to ECS too?

Yes I think it's a good idea. It's in line with how we try to keep field names consistent. Identifying those that come with an obvious associated type and extending the consistency this way makes a lot of sense both for ECS fields, and to help users following these conventions in custom fields.

I've opened this issue to track this elastic/ecs#1144

webmat · 2020-11-20T15:47:37Z

With the advent of wildcard, I wonder if we shouldn't think about moving the message fields to wildcard instead of text as the main data type, however.

Some message types like the multi-paragraph Windows event messages obviously work well with text. But there's also so many message types that are short and/or structured and would benefit more from wildcard. If we had the courage make this breaking change at the next major, then we'd have 100% string fields in ECS be aggregatable & offer fast exact matches when filtering. And we'd have all fields that benefit from an analyzer have a .text multi-field.

Right now ECS has many keyword fields with a .text multi-field. But message and error.message are the only two who don't behave like the rest of the string fields, and use type text on the canonical field instead of in a multi-field.

I definitely see why we might want to avoid this breaking change as well, so I could see this going both ways. But this issue seemed like a good time to bring this up 😄

ruflin · 2020-11-23T10:17:00Z

@webmat Thanks for the ECS follow up. I subscribed where it is head. For the message part, lets open a separate thread. As the change here should not be breaking one, we can decouple the two.

jpountz · 2020-11-24T14:02:51Z

Some message types like the multi-paragraph Windows event messages obviously work well with text. But there's also so many message types that are short and/or structured and would benefit more from wildcard.

Yes, the more I'm thinking of this, the more I believe that we won't be able to come up with a good default for all datasets, some datasets will benefit from having the message field mapped as wildcard but other datasets like many application logs should probably keep using text.

If we had the courage make this breaking change at the next major

For the aspect about changing the semantics of the message field, we've had a couple discussions about having a variant of wildcard that behaves like text (similarly to how wildcard behaves exactly like keyword today). This is something that could help going through the backward break I believe?

jpountz

+1 to move forward with this

jpountz · 2020-11-24T14:04:13Z

x-pack/plugin/core/src/main/resources/logs-mappings.json

+          },
+          "match_ip": {
+            "match_mapping_type": "string",
+            "match":   "ip",


keep a single space after the colon?

jpountz · 2020-11-24T14:04:15Z

x-pack/plugin/core/src/main/resources/logs-mappings.json

+          },
+          "match_message": {
+            "match_mapping_type": "string",
+            "match":   "message",


keep a single space after the colon?

ruflin · 2020-11-25T08:46:56Z

@jpountz Is there a good way to automatically test these changes?

jpountz

See Github actions on this PR, we have tests that run automatically on all PRs. I just made suggestions that should hopefully address the test failures we are seeing.

x-pack/plugin/core/src/main/resources/logs-mappings.json

Co-authored-by: Adrien Grand <jpountz@gmail.com>

ruflin · 2020-11-26T07:27:20Z

The tests I wanted to are something like:

Set template
Add doc with field foo.ip
Check if mapping is correct

The same for message to confirm the change works as expected. I found https://github.com/elastic/elasticsearch/pull/57629/files#diff-1e7ab3525fa67e621593216ed026d39eb002b44fe440e0cdb33b577f40631b00, might this be the right place to add something like this?

dakrone

LGTM, thanks for shepherding this Adrien. I left a few more comments about the application order. I think it's best to go from most general -> most specific in terms of mappings so that when a user changes a more specific mapping it doesn't get accidentally bypassed by a general mapping.

dakrone · 2021-04-22T22:31:30Z

x-pack/plugin/core/src/main/resources/logs-template.json

    "logs-mappings",
+    "data-streams-mappings",
    "logs-settings"


I think we should apply the data-streams-mappings settings first, so that any changes made to the logs-mappings component template always take precedence over the generic data stream mappings.

Suggested change

"logs-mappings",

"data-streams-mappings",

"logs-settings"

"data-streams-mappings",

"logs-mappings",

"logs-settings"

What do you think?

Actually this would break tests because then the dynamic template that maps strings as keywords would take precedence over the dynamic template that maps message fields as match_only_text. In order to change the order, we would also need to configure unmatch:message on the default dynamic template that maps strings as keywords. What is your preference?

dakrone · 2021-04-22T22:31:50Z

x-pack/plugin/core/src/main/resources/metrics-template.json

    "metrics-mappings",
+    "data-streams-mappings",


Same here with:

Suggested change

"metrics-mappings",

"data-streams-mappings",

"data-streams-mappings",

"metrics-mappings",

dakrone · 2021-04-22T22:32:05Z

x-pack/plugin/core/src/main/resources/synthetics-template.json

    "synthetics-mappings",
+    "data-streams-mappings",


And same here with:

Suggested change

"synthetics-mappings",

"data-streams-mappings",

"data-streams-mappings",

"synthetics-mappings",

ruflin

I quite like how you use components here to simplify the templates.

For the message field, I'm worried that if a metrics event will contain a message, it will be mapped as keyword but should be text. This field is so basic, I would keep it the default template. If not used, does this cause lots of overhead?

ruflin · 2021-04-23T11:26:39Z

x-pack/plugin/core/src/main/resources/data-streams-mappings.json

+        "host": {
+          "type": "object"
+        },
+        "observer": {


I think we could leave this one out.

ruflin · 2021-04-23T11:27:12Z

x-pack/plugin/core/src/main/resources/logs-mappings.json

-              "type": "keyword"
-            },
-            "match_mapping_type": "string"
+              "type": "match_only_text"


Is this a breaking change?

What aspect are you concerned with? I expect this to be transparent for the vast majority of our users, and I started discussions with the ECS team to make similar changes on ECS.

Read up on it by now a bit more. What happens if a user runs a non "supported" query across logs-* and some message fields are text and some are match_only_text. Is an error returned or the non supported fields are just skipped?

The only unsupported queries are span queries. If a span query is run, then all shards that have the field mapped as match_only_text will be ignored from the search response and reported as "failed".

jpountz · 2021-04-24T13:06:48Z

so that when a user changes a more specific mapping it doesn't get accidentally bypassed by a general mapping.

@dakrone Can you give more details about the use-case you have in mind? How could a user be affected by this? My understanding was that users couldn't inherit from these templates and would need to copy them anyway if they wanted to make changes?

For the message field, I'm worried that if a metrics event will contain a message, it will be mapped as keyword but should be text. This field is so basic, I would keep it the default template. If not used, does this cause lots of overhead?

@ruflin At this stage it doesn't matter much in terms of overhead, but if we put message in the base template I worry that it would set a precedent that things should be folded into the base template if possible while I'd rather like to make sure that everything that ends up in a template is actually relevant to the considered index pattern. When I talked with @exekias, he seemed confident that it would be a bug if there was a message field in a metrics dataset.

ruflin · 2021-04-26T08:55:26Z

@exekias I remember we used error.message or similar in the past. Maybe an option here is to do the same trick as we do with ip. Everything that is called message must be text.

@jpountz Don't block on this, we can always follow up.

exekias · 2021-04-26T09:13:30Z

Yes, error.message is a thing, I think it's fine to have a rule to map *message as text. I wonder how expected will that be to our users, but at least for our datasets I don't think that would cause any issue.

dakrone · 2021-04-26T16:48:27Z

so that when a user changes a more specific mapping it doesn't get accidentally bypassed by a general mapping.

@dakrone Can you give more details about the use-case you have in mind? How could a user be affected by this? My understanding was that users couldn't inherit from these templates and would need to copy them anyway if they wanted to make changes?

It is certainly recommended that a user copy them rather than make changes directly.

That being said, I think in terms of our defaults, we should always go from most general to most specific. First, because we should set a good example of a recommended pattern for component template usage, and second, so that in the case that a user does edit the template directly, we are closer to the desired and expected effect.

Do you agree? Do you have a different reason for giving the more generic template higher precedence?

jpountz · 2021-04-27T07:28:28Z

Do you agree? Do you have a different reason for giving the more generic template higher precedence?

This is a good question.

One problem we have is that the first component template wins when it comes to dynamic templates (because dynamic templates are appended and the first one that matches wins), but the last component template wins when it comes to properties (because properties get overridden).

This makes me wonder if we should split properties and dynamic templates into different component templates so that we could order them that way:

specialized dynamic templates
generic dynamic templates
generic properties
specialized properties

What do you think?

dakrone · 2021-04-30T20:08:33Z

Sorry for the delay on this,

This makes me wonder if we should split properties and dynamic templates into different component templates so that we could order them that way ... What do you think?

This sounds reasonable to me, though I do think we start growing in complexity the more we add. In that regard, maybe it's okay to leave it the way it is, with two component templates rather than four.

I think I'm okay either way, regardless, this still LGTM.

jpountz · 2021-05-03T04:35:04Z

Right, this complexity is the reason why I didn't jump on this solution. My gut feeling is that we won't often need to override properties given how they are supposed to be more-or-less standardized via ECS, so a simpler path forward might be to put the most specific templates first and avoid defining the same field across multiple templates.

jpountz · 2021-05-03T09:55:19Z

I kept templates from the most specific to the most generic for now and applied remaining feedback:

observer object field removed,
moved message to the base template due to the error.message field.

jpountz · 2021-05-04T13:30:43Z

@elasticmachine run elasticsearch-ci/2

ruflin · 2021-05-04T18:49:43Z

@jpountz Thanks for pushing this over the line.

This template was added in elastic#64978, however, there can be some test failures if we try to remove built-in templates. It was missing from the list and now needs to be added back.

This template was added in #64978, however, there can be some test failures if we try to remove built-in templates. It was missing from the list and now needs to be added back.

This template was added in elastic#64978, however, there can be some test failures if we try to remove built-in templates. It was missing from the list and now needs to be added back.

…73672) This template was added in #64978, however, there can be some test failures if we try to remove built-in templates. It was missing from the list and now needs to be added back.

ruflin self-assigned this Nov 12, 2020

jpountz reviewed Nov 12, 2020

View reviewed changes

x-pack/plugin/core/src/main/resources/logs-mappings.json Outdated Show resolved Hide resolved

Update x-pack/plugin/core/src/main/resources/logs-mappings.json

dae84d6

Co-authored-by: Adrien Grand <jpountz@gmail.com>

update mapping with discussion

90ca4b1

ruflin mentioned this pull request Nov 20, 2020

Detect ip addresses as part of dynamic field mappings #64400

Open

webmat mentioned this pull request Nov 20, 2020

Establish field name => data type conventions for leaf fields elastic/ecs#1144

Open

jpountz approved these changes Nov 24, 2020

View reviewed changes

jpountz reviewed Nov 25, 2020

View reviewed changes

x-pack/plugin/core/src/main/resources/logs-mappings.json Outdated Show resolved Hide resolved

x-pack/plugin/core/src/main/resources/logs-mappings.json Outdated Show resolved Hide resolved

ruflin and others added 2 commits November 26, 2020 08:20

Update x-pack/plugin/core/src/main/resources/logs-mappings.json

35d0727

Co-authored-by: Adrien Grand <jpountz@gmail.com>

Update x-pack/plugin/core/src/main/resources/logs-mappings.json

ce72eef

Co-authored-by: Adrien Grand <jpountz@gmail.com>

Put the most specific dynamic templates first.

2e130a5

dakrone approved these changes Apr 22, 2021

View reviewed changes

ruflin commented Apr 23, 2021

View reviewed changes

jpountz added 2 commits May 3, 2021 11:34

Move message dynamic template to the shared mapping template.

f7376e5

Remove observer field.

42c9f58

Merge branch 'master' into path-match-ip-message

89eb06d

jpountz merged commit 2bd1b33 into elastic:master May 4, 2021

jpountz mentioned this pull request May 4, 2021

Better out-of-the-box mappings for logs, metrics and synthetics #72703

Merged

jpountz added v7.14.0 v8.0.0 >enhancement labels May 4, 2021

ruflin deleted the path-match-ip-message branch May 4, 2021 18:49

tsg mentioned this pull request May 28, 2021

[Security Solution][Alerts] Use dynamic mappings to create smaller templates & mappings elastic/kibana#100884

Open

dakrone mentioned this pull request Jun 1, 2021

Add data-streams-mappings to isXPackTemplate method #73633

Merged

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Better out-of-the-box mappings for logs, metrics and synthetics #64978

Better out-of-the-box mappings for logs, metrics and synthetics #64978

Conversation

ruflin commented Nov 12, 2020

ruflin commented Nov 12, 2020

jpountz commented Nov 12, 2020 • edited Loading

webmat commented Nov 12, 2020

axw commented Nov 16, 2020

jpountz commented Nov 16, 2020

axw commented Nov 17, 2020

ruflin commented Nov 17, 2020

exekias commented Nov 17, 2020

jpountz commented Nov 17, 2020

ruflin commented Nov 17, 2020

webmat commented Nov 18, 2020

ruflin commented Nov 20, 2020

webmat commented Nov 20, 2020

webmat commented Nov 20, 2020 • edited Loading

ruflin commented Nov 23, 2020

jpountz commented Nov 24, 2020

jpountz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ruflin commented Nov 25, 2020

jpountz left a comment

Choose a reason for hiding this comment

ruflin commented Nov 26, 2020

dakrone left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ruflin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpountz commented Apr 24, 2021

ruflin commented Apr 26, 2021

exekias commented Apr 26, 2021

dakrone commented Apr 26, 2021

jpountz commented Apr 27, 2021 • edited Loading

dakrone commented Apr 30, 2021

jpountz commented May 3, 2021

jpountz commented May 3, 2021

jpountz commented May 4, 2021

ruflin commented May 4, 2021

jpountz commented Nov 12, 2020 •

edited

Loading

webmat commented Nov 20, 2020 •

edited

Loading

jpountz commented Apr 27, 2021 •

edited

Loading