-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better out-of-the-box mappings for logs, metrics and synthetics #64978
Conversation
One of the problems we have today with the default templates is that ip addresses and message fields are not mapped correct. Auto detection of ip addresses would be great: elastic#64400 But in the meantime, we could also match on the naming convention that all `*.ip` fields are of type ip address. During implementation I was not sure if I should use `match: ip` for `path_match: *.ip`. The main difference I would expect that the top level is not matched. Are there others? Do we have fields which are named `ip` or `message` which do not adhere to these rules? This PR is not complete but is to share the idea. If we decide to move forward, also the other base templates need updating and it might be useful to add the same logic to all datasets created by Kibana.
This is the only difference indeed. So maybe we should use |
Co-authored-by: Adrien Grand <jpountz@gmail.com>
In ECS all of the If you want to catch 100% of the ECS So I'm good with this approach as is. |
These look fine for the base logs template, but I don't think we could extend this to the datasets created for APM. In APM we have some object fields called |
@axw The proposal uses a combination of |
Thanks @jpountz, I missed that bit. This would be fine to include for all templates for APM then, and I think it's reasonable to assume all string-type fields called |
@jpountz I'm torn if we should remove @webmat I'm wondering if the above is also a convention we should put into ECS itself. The same field name MUST have always the same type no matter where it is in the hierarchy. |
I have checked beats codebase and we have a few instances where https://github.com/elastic/beats/blob/af4007eaef6e142bb0d46370918db4ce7135f316/x-pack/filebeat/module/suricata/eve/_meta/fields.yml#L92-L94 Are you trying to solve getting I wonder if we can make this decision for the user based on field names, at the very least we would need to do a couple of changes in the modules to make things consistent. To some extent I wonder if we want to unify how type hinting should be done. It seems things are moving towards #61939, where we use path matching as a workaround. |
It looks like we have some logging datasets that are completely structured and don't have a |
I'm not trying to fix the @exekias Thanks for checking the Beats fields on this. The alias fields should not be an issue as these do not exist as actual data. Also my assumption is, an actual mapping would overwrite the dynamic mapping in case both match. For #61939 I think these two are complementary. #61939 is used for any ingestion where the fields are actually known and probably in most cases goes to a specific dataset. This change here is to get the base mappings correct in most cases without complicating the base template too much. |
@ruflin It's a convention we've implicitly been following indeed. It could make sense to capture this guidance explicitly as well. |
I updated the PR with all the above conversations:
Should we move forward with this? If yes, how are these things tested? And we should also add it to the other templates? Would it be possible that the ES team takes over from here @dakrone ? |
Yes I think it's a good idea. It's in line with how we try to keep field names consistent. Identifying those that come with an obvious associated type and extending the consistency this way makes a lot of sense both for ECS fields, and to help users following these conventions in custom fields. I've opened this issue to track this elastic/ecs#1144 |
With the advent of Some message types like the multi-paragraph Windows event messages obviously work well with Right now ECS has many I definitely see why we might want to avoid this breaking change as well, so I could see this going both ways. But this issue seemed like a good time to bring this up 😄 |
@webmat Thanks for the ECS follow up. I subscribed where it is head. For the message part, lets open a separate thread. As the change here should not be breaking one, we can decouple the two. |
Yes, the more I'm thinking of this, the more I believe that we won't be able to come up with a good default for all datasets, some datasets will benefit from having the message field mapped as
For the aspect about changing the semantics of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to move forward with this
}, | ||
"match_ip": { | ||
"match_mapping_type": "string", | ||
"match": "ip", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keep a single space after the colon?
}, | ||
"match_message": { | ||
"match_mapping_type": "string", | ||
"match": "message", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keep a single space after the colon?
@jpountz Is there a good way to automatically test these changes? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See Github actions on this PR, we have tests that run automatically on all PRs. I just made suggestions that should hopefully address the test failures we are seeing.
Co-authored-by: Adrien Grand <jpountz@gmail.com>
Co-authored-by: Adrien Grand <jpountz@gmail.com>
The tests I wanted to are something like:
The same for message to confirm the change works as expected. I found https://github.com/elastic/elasticsearch/pull/57629/files#diff-1e7ab3525fa67e621593216ed026d39eb002b44fe440e0cdb33b577f40631b00, might this be the right place to add something like this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for shepherding this Adrien. I left a few more comments about the application order. I think it's best to go from most general -> most specific in terms of mappings so that when a user changes a more specific mapping it doesn't get accidentally bypassed by a general mapping.
"logs-mappings", | ||
"data-streams-mappings", | ||
"logs-settings" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should apply the data-streams-mappings
settings first, so that any changes made to the logs-mappings
component template always take precedence over the generic data stream mappings.
"logs-mappings", | |
"data-streams-mappings", | |
"logs-settings" | |
"data-streams-mappings", | |
"logs-mappings", | |
"logs-settings" |
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this would break tests because then the dynamic template that maps strings as keywords would take precedence over the dynamic template that maps message fields as match_only_text
. In order to change the order, we would also need to configure unmatch:message
on the default dynamic template that maps strings as keywords. What is your preference?
"metrics-mappings", | ||
"data-streams-mappings", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here with:
"metrics-mappings", | |
"data-streams-mappings", | |
"data-streams-mappings", | |
"metrics-mappings", |
"synthetics-mappings", | ||
"data-streams-mappings", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And same here with:
"synthetics-mappings", | |
"data-streams-mappings", | |
"data-streams-mappings", | |
"synthetics-mappings", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I quite like how you use components here to simplify the templates.
For the message field, I'm worried that if a metrics event will contain a message, it will be mapped as keyword but should be text. This field is so basic, I would keep it the default template. If not used, does this cause lots of overhead?
"host": { | ||
"type": "object" | ||
}, | ||
"observer": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could leave this one out.
"type": "keyword" | ||
}, | ||
"match_mapping_type": "string" | ||
"type": "match_only_text" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a breaking change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What aspect are you concerned with? I expect this to be transparent for the vast majority of our users, and I started discussions with the ECS team to make similar changes on ECS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Read up on it by now a bit more. What happens if a user runs a non "supported" query across logs-*
and some message fields are text
and some are match_only_text
. Is an error returned or the non supported fields are just skipped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only unsupported queries are span
queries. If a span query is run, then all shards that have the field mapped as match_only_text
will be ignored from the search response and reported as "failed".
@dakrone Can you give more details about the use-case you have in mind? How could a user be affected by this? My understanding was that users couldn't inherit from these templates and would need to copy them anyway if they wanted to make changes?
@ruflin At this stage it doesn't matter much in terms of overhead, but if we put |
Yes, |
It is certainly recommended that a user copy them rather than make changes directly. That being said, I think in terms of our defaults, we should always go from most general to most specific. First, because we should set a good example of a recommended pattern for component template usage, and second, so that in the case that a user does edit the template directly, we are closer to the desired and expected effect. Do you agree? Do you have a different reason for giving the more generic template higher precedence? |
This is a good question. One problem we have is that the first component template wins when it comes to dynamic templates (because dynamic templates are appended and the first one that matches wins), but the last component template wins when it comes to properties (because properties get overridden). This makes me wonder if we should split properties and dynamic templates into different component templates so that we could order them that way:
What do you think? |
Sorry for the delay on this,
This sounds reasonable to me, though I do think we start growing in complexity the more we add. In that regard, maybe it's okay to leave it the way it is, with two component templates rather than four. I think I'm okay either way, regardless, this still LGTM. |
Right, this complexity is the reason why I didn't jump on this solution. My gut feeling is that we won't often need to override properties given how they are supposed to be more-or-less standardized via ECS, so a simpler path forward might be to put the most specific templates first and avoid defining the same field across multiple templates. |
I kept templates from the most specific to the most generic for now and applied remaining feedback:
|
@elasticmachine run elasticsearch-ci/2 |
@jpountz Thanks for pushing this over the line. |
This template was added in elastic#64978, however, there can be some test failures if we try to remove built-in templates. It was missing from the list and now needs to be added back.
This template was added in #64978, however, there can be some test failures if we try to remove built-in templates. It was missing from the list and now needs to be added back.
This template was added in elastic#64978, however, there can be some test failures if we try to remove built-in templates. It was missing from the list and now needs to be added back.
One of the problems we have today with the default templates is that ip addresses and message fields are not mapped correct. Auto detection of ip addresses would be great: #64400 But in the meantime, we could also match on the naming convention that all
*.ip
fields are of type ip address.During implementation I was not sure if I should use
match: ip
forpath_match: *.ip
. The main difference I would expect that the top level is not matched. Are there others?Do we have fields which are named
ip
ormessage
which do not adhere to these rules?This PR is not complete but is to share the idea, also it was not tested. If we decide to move forward, also the other base templates need updating and it might be useful to add the same logic to all datasets created by Kibana.