Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discuss] Default mappings for dynamic fields #178

Open
ruflin opened this issue May 31, 2021 · 9 comments
Open

[Discuss] Default mappings for dynamic fields #178

ruflin opened this issue May 31, 2021 · 9 comments
Assignees
Labels
Team:Ecosystem Label for the Packages Ecosystem team Team:Integrations Label for the Integrations team

Comments

@ruflin
Copy link
Member

ruflin commented May 31, 2021

Today all the templates are setup with the following dynamic templates

            "dynamic_templates" : [
              {
                "strings_as_keyword" : {
                  "mapping" : {
                    "ignore_above" : 1024,
                    "type" : "keyword"
                  },
                  "match_mapping_type" : "string"
                }
              }
            ],

By default all the unknown fields are mapped by Elasticsearch and all matching string to keyword. This issue is to discuss what our defaults should be moving forward.

Mapping all fields by default comes at a storage cost. In recent releases Elasticsearch has introduced runtime fields and dynamic: runtime as an option. An other option is to disable indexing completely and only allow runtime fields on query time.

Related issue: #39

@ruflin
Copy link
Member Author

ruflin commented May 31, 2021

@jpountz This also ties into what we should use for the ES base templates. Would be great to get your thoughts on this.
@tsg @simitt @urso @andrewkroh Pinging you as you also have quite a bit of history on this.

@tsg
Copy link

tsg commented Jun 1, 2021

I think we'd want the "custom" fields (anything outside of ECS) to be either runtime or index: false. This would mean that we'd need to again specify explicitly all ECS keyword fields that we want to be indexed on write.

On the alerts Alerts side (elastic/kibana#100884) we're having the advantage that all custom fields are not top-level, but nested under alert.original_event

Could we do something similar, in the sense that all custom fields must go under a given prefix? Probably not, right?

@ruflin
Copy link
Member Author

ruflin commented Jun 1, 2021

Do we really need to index all the ECS fields? In case of a package, I would expect the fields that should be really index are specified. To cover better for ECS fields there might be the middle ground, that we defined all the ECS prefixes like user.* to be mapped as a runtime field. Everything else is not indexed. If users add fields inside the ECS prefixes, these will also be index.

Having all ECS fields in each mapping even as runtime fields I would assume would mean lots of data in the cluster state.

@ruflin
Copy link
Member Author

ruflin commented May 16, 2022

I would like to revive this discussion. The Elasticsearch templates for logs-*-*, metrics-*-* etc. still map by default to keyword, ip etc. Same is true for the templates created by Fleet.

In general, I think we should stop indexing all fields by default. The question is, should we go for runtime by default or index false? What are the pros and cons from an Elasticsearch perspective? @jpountz

If we make the change, it should apply in Elasticsearch and Fleet (@joshdover ). Would we consider this a breaking change?

@ruflin ruflin self-assigned this May 16, 2022
@jpountz
Copy link

jpountz commented May 16, 2022

@ruflin In short,

  • runtime fields would be super slow to search and slow to aggregate
  • fields with index: false would be slow to search (still much faster than runtime fields) and fast to aggregate
  • fields with default options would be fast to search and fast to aggregate

In terms of disk usage, fields with index: false are in-between runtime fields and fields with default options.

Would we consider this a breaking change?

I would not consider this a breaking change. We are free to change the performance characteristics of mappings produced by our integrations. The same queries that work with today's mappings would still work after this change.

@jpountz
Copy link

jpountz commented May 16, 2022

My intuition is that index: false (doc-value-only field) is a good compromise, because it should still perform fast enough so that users could leverage these fields in dashboards, while reducing index-time CPU usage and disk usage compared to default mappings.

Maybe it would help to use a concrete example. Would you only make this change to dynamically-mapped changes, which I expect to be application-specific fields, or also to some of the fields that are populated by Agent like host.name, etc.?

@ruflin
Copy link
Member Author

ruflin commented May 16, 2022

I had a follow up discussion on this with @jpountz . Some general thoughts:

  • index: false This is especially useful for metrics / numbers.
  • runtime true: This works for non numbers.

If we would expose the dynamic_mapping to packages, we could start experiment with some of the packages and later make a more fundamental change in ES itself.

I would not consider this a breaking change. We are free to change the performance characteristics of mappings produced by our integrations. The same queries that work with today's mappings would still work after this change.

It sounds like before we do the change, there should be a mechanism in Fleet for users to potentially overwrite the mapping to have the old performance back. @joshdover My understand is that this is possible with the @custom template? So if we set foo to a runtime field by default but a user wants to have it as a keyword indexed, the @custom template could be used?

For the examples:

  • The system.cpu dataset ships a lot of fields by default but I challenge if we should index all of them by default.
  • json logs: There can be lots of fields in json logs we don't have mappings for as we don't know in advance what fields are in. Mapping these as default to runtime would save storage.

@jpountz
Copy link

jpountz commented May 17, 2022

Some general thoughts:
index: false This is especially useful for metrics / numbers.
runtime true: This works for non numbers.

For clarity, index:false also works with keywords with a similar trade-off: little or no impact on aggregations, only slower filtering (to a much lesser extent than runtime fields). The point about metrics and index:false is that metric fields are likely a good place to get started changing our defaults to be more space-efficient and less CPU-intensive at index time, since metric fields are less likely used for filtering than dimensions. This will likely be the recommendation for TSDB.

@ruflin ruflin added the Team:Ecosystem Label for the Packages Ecosystem team label May 23, 2022
@ruflin ruflin assigned joshdover and unassigned ruflin Jun 24, 2022
@ruflin
Copy link
Member Author

ruflin commented Jun 24, 2022

@joshdover Assigned to you for now but likely best driven forward by someone in the ecosystem or Fleet team.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Ecosystem Label for the Packages Ecosystem team Team:Integrations Label for the Integrations team
Projects
None yet
Development

No branches or pull requests

4 participants