-
Notifications
You must be signed in to change notification settings - Fork 456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove event.duration
and event.ingested
from metric events
#4894
Comments
Looking into package-spec, it seems runtime fields are not supported yet on specific fields: elastic/package-spec#39 Would be great to see this moving forward. |
If we drop docvalues for this field and synthetic source is enabled then this can't a runtime field. Because there is no source where the field can be generated from at query time. So using runtime fields isn't an option here. |
At the moment we haven't dropped docvalues and it we are not using synthetic source yet. In the scenario today, we can use runtime fields, but will it bring us any benefits? Moving forward adopting TSDB and synthetic source, what should we do with field in the following two scenarios:
@martijnvg Can you answer separately for scenario a and b independent of each other? And dig into if today runtime fields brings us some benefits. It is important to note, that with integrations we can ship such changes back to 7.x releases to also bring improvements there where synthetic source etc. did not exist yet. |
This could be an option, though today it would have to be for an entire integration package and every agent pushing data into those data streams. If we improve the granularity of data stream assets we could make this more granular, such as with elastic/kibana#121118. There may also be a way we could add a tag to the incoming events that the ingest pipeline reads to skip adding these fields. If we went this route, I'd prefer we have a single tag to control this that would both strip
Yeah we could drop the indexing from these fields to save some storage. It seems this would still be compatible with TSDB+synth source too since we'll need to keep the doc_values for these for synthetic source. I like this because it avoids changing this multiple times across releases and resulting in a disjointed experience when looking at data over time that was ingested across different versions.
I'm not sure what you're asking here, Ruflin. I don't see how we'll be able to ever support a field in a TSDB+synth source index that meets criteria (a). But I'm also not sure why we'd need that. We have to to store the field somewhere and having the single copy in doc_values should be minimal enough, from a storage perspective, and it's actually more desirable from a runtime field perspective since accessing doc_values in runtime fields is faster than from _source. |
This was a question that I had: can runtime field use doc-value only fields (i.e. |
Runtime fields should always use doc values if possible. _source is slow
and should be avoided unless you have a really good excuse.
I really wanted this to be obvious in the docs. But I guess I never put
enough effort into that. Could you point to why you thought you had to use
_source? I'm sure I've made a mistake somewhere.
…On Fri, Dec 23, 2022, 10:09 AM Andres Rodriguez ***@***.***> wrote:
it's actually more desirable from a runtime field perspective since
accessing doc_values in runtime fields is faster than from _source
This was a question that I had: can runtime field use doc-value only
fields (i.e. index:false) when using synthetic source or is _source
always needed for runtime fields?
—
Reply to this email directly, view it on GitHub
<#4894 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABUXITV2YYFPGD7LMBUSXDWOW6C5ANCNFSM6AAAAAATHPQFL4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Thanks @nik9000 , probably a misconception from my side, I see this sentence in the docs that actually clarifies it, I had previously missed it:
|
Should have been more specific on this. What I meant by this is that
I was thinking even broader that it would be disabled for all data streams.
Where would this tag be added? On the edge or centrally? |
@nik9000 Where I think the conflicting part is, that for the data streams that have
But this returns an error as runtime fields don't work in combination with synthetic source. Now before we use TSDB / Synthetic source, it seems using a runtime field would create the minimal footprint for the
|
IMO this should be tagged on the edge so a single agent could be debugged. The ingest pipeline would read this tag and make the necessary changes to the document before indexing. |
We are now in the middle of adoption of TSDB for metric data streams but no decision is reached on this topic. I suggest whatever decision we make, we focus our change on TSDB data streams. @joshdover We had recently some discussions around making the final pipeline optional. As event.ingested is added in the final pipeline, this would already help. @martijnvg For TSDB, is the ideal mapping for
|
Pinging @elastic/fleet (Team:Fleet) |
@ruflin With this configuration we store |
Yes I need to open an issue to discuss this, but maybe we can make some progress here. Would this be something we want integration devs to specify / opt-out of or do users need control over this? If it's only required for security integrations so we can make this a package-level setting. |
I think we could go one step further and make it a Fleet wide setting. If you have a SIEM use case, you might want it on all of your logs, for observability likely not. |
@martijnvg What you mean y "not as points"? And will the above work with TSDB / synthetic source? It means, event duration would be indexed, meaning filtering on it would still work just not querying? If we pick doc_values only, it would still be aggregatable and slow on filtering? What are the pros / cons of these two options? |
@ruflin So I'm wondering whether this duration field is important at all? If not then we ingest it in the first place? Since the title of this issue is about removing the event.duration field.
It means that there is no data structure for efficiently querying this field.
The above configuration will work with TSDB and synthetic source. The duration field is indexed with only doc values enabled. This means the field is aggregatable and queryable. But querying can be slow, because there is no dedicate data structure just for querying. If If this field will be used rarely then I would suggest to set |
Removing the fields is also part of the discussion but if we have to keep it, I wanted to know the ideal "storage" solution that creates the minimal overhead. @martijnvg What you are describing above is basically that
If we have |
Ok, I understand now.
In that case the field is accepted during indexing but not stored at all. Meaning that it can no longer be retrieved or used. I think this is not something that you're looking for. In that case the configuration that you initially mentioned ( |
For |
For In parallel we should also tackle the problem on the shipper side (beats) to make |
IIUC, |
I cannot get a sense from the links how
From experience with customers, this is applicable to both metrics and logging. Potentially, in order to save space, if the index is guaranteed to contain time-ordered data, then we can simply use an alias between |
This is for me a compelling reason to make it a setting as proposed in elastic/elasticsearch#100324 because based on the above, this is a key field for functionality in our stack. If it is a setting, Elasticsearch then can tune how the field is persisted and accessed. |
The arrival timestamp is a field that's inherently difficult to compress. There's definitely more that we can do like disabling indexing and storing it as a gauge in TSDB so that the more efficient doc_value encoding is enabled for that field. However especially, in the context of elastic/elasticsearch#91775, we'll be at a distinctive disadvantage when storing both the event and the arrival timestamp for each data point when other metric stores only store the event timestamp, which is usually much easier to compress due to the more predictable intervals between data points. Therefore, I don't think relying on
Have we considered relying on the Having said that, we're also discussing to drop
When you say "time-ordered", do you mean ordered by arrival time or event time? In TSDB, metrics are ordered by the event timestamp. But data can come in late even in TSDB. Usually, data points from the same time series are coming in order but even for that, there's no guarantee. |
I am answering this with the lens of the current sets of features which "search directly after ingest". We initially tried For "time-ordered", I mean arrival time. In the case of transforms, we use From our experience, customers have late arriving data -- even with highly optimised ingest, there will be some event which seldom/sometimes/often occurs to delay data, even if temporarily. Some use cases are fine to use event_time to identify recent data. For these use cases we tend to have a query delay (we process data between the time we last checked and Some use cases require all newly arrived data to be processed as close to the time it arrived as possible. We use The downside of not having Note. In case its not clear, I have a preference to keep |
Hi! We just realized that we haven't looked into this issue in a while. We're sorry! We're labeling this issue as |
In elastic/beats#31574 and elastic/elasticsearch#85649 it is discussed, that
event.duration
takes up a significant amount of disk space (~16%) even though the fields is not used in most scenarios. Ideally we stop shipping the field from where not needed and make it opt in.On the integrations side, there are several ways to deal with the field:
@nik9000 @jpountz If we make the field a runtime field, will this already bring benefits?
Below I run the disk stats on a smaller set of disk metrics to get an overview of what storage is used:
metrics-system.memory-default disk usage
POST metrics-system.memory-default/_disk_usage?run_expensive_tasks=true
Another field that stands out for me is
event.ingested
. This is currently added by the [final pipeline](final pipeline: https://github.com/elastic/kibana/blob/7.14/x-pack/plugins/fleet/server/constants/fleet_es_assets.ts#L38) but as discussed in #4462 should be optional. In the case of event.ingested, there should be way to disable (or enable) the final pipeline. (@amitkanfer @joshdover )The text was updated successfully, but these errors were encountered: