-
Notifications
You must be signed in to change notification settings - Fork 421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Add host metric fields to ECS (stage 2) #1028
Conversation
This may have come up in discussion already previously; apologies if so. The initial adoption and usage of these fields is focused on the "big three" public cloud providers:
Are there plans for any other metricsets to adopt these |
Was going thru snmp polling with a client - thinking it would be nice to have interfaces broken out, on any host with more than 1, rolling them all up into a single pair of metrics is less than useful than splitting them out. Things like discards, errors, are definitely helpful in terms of network troubleshooting... (cpu's and physical disks seem like separation would be beneficial - 1 pegged cpu/disk out of 4 could be indicative of a problem that would be masked with a average of the multiple ). |
@dainperkins Yes definitely sometimes detailed metrics per interface is useful and sometimes aggregated value is good enough. We are making this list of host fields across different kinds of VMs and mainly to benefit metrics UI to have a centralized location to display metrics from all hosts. If user sees some problem in one host, they can get more granular metrics, such as data points for network performance per interface. |
@ebeahan We also have |
@kaiyan-sheng my point is more that if everything is fine - aggregated device metrics are fine, if somewhat irrelevant from any sort of performance perspective where individual device performance can be specifically related to overall performance issues (as opposed to e.g. pooled resources like memory, and to some degree cpu - tho I have seen performance issues with a specific cpus/core at 99% in a multi cpu server - I think it was SAP iirc - that were completely hidden by aggregate stat analysis). Scalar values like ingress.bytes vs utilization percentages are even worse, as they provide no easy to read context in terms of utilization. As soon as there's an issue however, viewing aggregated metrics across multiple devices that are not a pooled resource [ interfaces, disks, less so cpu in most cases ] often hides the problem - particularly in cases of e.g. multiple nics (management+ production, inside & outside, , physical / virtual interfaces, etc.) or multiple disks (where performance is often tied to a specific device in the group [ os volume vs. data volume ]) I haven't looked at the major cloud provider metric sets, but certainly from a bare metal / vm perspective, gathering detailed per device stats from a given system will ultimately be much more useful for troubleshooting, likely easier to implement outside of certain cloud models (thinking of snmp, wmi, statistic gathering tools and anyone unable to use Beats on their appliances) that can then be rolled up into visualization that provide aggregations, as opposed to building aggregations into host fields. From an ECS perspective I suggest that building out the interface fields (which has been begun, and could quickly be further built out for both hosts and e.g. observers), and adding a disk or volume field set, would be the place to put discrete metrics, which could then be aggregated as needed. Usually ECS has preferred to skip field sets for pre-aggregated metrics (with some exceptions for information provided by a standard input type (e.g. netflow can send avg thruput, etc) @MikePaquette any thoughts on the fields as proposed? |
Thanks for the thoughts @dainperkins, we definitely need to consider this for the future. I like the current focus of this RFC of establishing the common baseline that can be gathered across various sources; a lowest common denominator, if you like. I definitely agree there's also value in also standardizing more metrics, even if they're not available everywhere, though. I suggest @kaiyan-sheng you could capture these ideas in the "Concerns" section, to be kept in mind for a subsequent RFC. I want to keep the scope of this RFC fixed, so we can continue to move forward. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking great, @kaiyan-sheng!
As part of advancing this proposal, we'll also want to capture experimental field definitions for the proposed fields. These YAML files will allow anyone who wishes to test out these fields to generate artifacts using the ECS tooling's --include
flag.
The definitions can added in a single host.yml
file. Here's a starting point example using the first two fields:
- name: host
fields:
- name: cpu.usage
type: scaled_float
scaling_factor: 1000
level: extended
short: Percent CPU used with scaling factor of 1000.
description: >
Percent CPU used with scaling factor of 1000. This value is normalized by
the number of CPU cores and it ranges from 0 to 1.
For example: For a two core host, this value should be the average of the
two cores, between 0 and 1.
- name: network.ingress.bytes
type: long
level: extended
short: The number of bytes received on all network interfaces.
description: >
The number of bytes received (gauge) on all network interfaces by the
host in a given period of time.
Examples from other RFCS: 0001 and 0007.
Since we will have both example source documents and field definitions, it may make sense to create two subdirectories underneath rfcs/text/0005
? Perhaps rfcs/text/0005/fields
and rfcs/text/0005/examples
?
Thank you so much @kaiyan-sheng for writing this up! At my workplace, we use SNMP to collect a lot of data, and I was about to do nearly the same thing and discovered that you've already got something good started. I have a few concerns. The description of
This suggests that the field could be populated by summing all of the SNMP What I'm arriving at is this claim: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for chiming in @andrewthad, you make a very good point.
We skirted the subject when we discussed the monotonic counters. We've decided to stick to rates, but it's true that the current definitions are still a bit loose.
In the current wording, it sounds like the periodicity of the collection could change, and the number may change accordingly (e.g. collect every second, the value is 300 kb vs collect every 10 seconds, the value is 3000kb).
I initially understood the intent to be that these fields are meant to record rates per seconds. Was my understanding correct?
If these fields are meant to capture rates per second, please adjust their descriptions accordingly.
If the value of the metrics were meant to be dependent on the collection period, then I agree we're missing a piece. We should clarify.
I’m not sure what the author intended, but to me, it seems more useful to have a byte rate rather than a total and a duration. In Kibana, it is nearly impossible to work with fields that need to be multiplied or divided by other fields. if it is going to be interpreted as a rate rather than as a total, I have a different concern: The name of the field is misleading and inconsistent with the way other ECS fields are named. Everywhere else, fields suffixed with the word “bytes” refer to a total, not a rate. The suffix “bytes_per_second” is more clear, and it paves the way for extensions like “client.bytes_per_second”. |
Thank you @andrewthad and @webmat ! I agree, rate will definitely be more helpful here. Our intention here is to actually rely on |
Thanks for bringing that up! There are some benefits of storing the delta change since the previous fetch, as compared to calculating the rate per second. Specially when aggregating different time series, the current definition for these metrics would perform better. We are currently discussing the possibility of formalizing this type of metric a little bit more and add better support for it in Kibana, happy to report back when we have an update on this. |
It's not clear to me which direction this is going, based on your two comments @kaiyan-sheng and @exekias 😂 Are these fields meant to capture a rate/s? |
@webmat Sorry for the confusion! We had a discussion earlier and decided we would like to keep these fields as gauge for now. Rates would be useful but in some cases we need the original gauge value to aggregate. |
Ok, whenever the decision reached, please make sure to update the RFC to clearly state how these fields will work, and how users should populate them in custom data sources. When I think of CPU usage, a gauge is a no-brainer. There's no time component necessary. The CPU usage was a certain percentage at the time of collection, and that's the end of it. When I think of IO however it's a bit trickier. What I understand from your answers is that the intent is for these IO fields to contain a gauge of the "total since the last metric collection". For example, total bytes/packets in the last 10s. Am I understanding this correctly? If that's the intent, I think this can work. But then the metrics as described in this RFC are no longer self-contained: to interpret the field containing "23kb", we need to know what the metric collection frequency is. This is a value that can be different based on host, or can change over time, etc. So if that's where this is going, perhaps we should add a field to represent the collection interval? |
Thank you for the comment @webmat! Yes, this value will be meaningless without a collection frequency. In the beats situation, we have |
Ah, I was not familiar with the I think it would make more sense to add both of those to ECS as is, rather than define a purpose-specific Does anyone see issues with adding |
I don't think we need the period to make these metrics useful. We don't really use this info when graphing the metrics, as the collection period gets diluted by the bucket size (graphing period). All this said, I agree it would make sense to add period to ECS, but as a different effort, WDYT? |
For |
I'm fine with separating these two tasks, yeah. Could you adjust the RFC to mention that the IO metrics are dependent on the collection interval? Here's two things I think would make the proposal clear:
The above would close one of the loops raised by @andrewthad a while ago. Andrew's other point was about field naming. We've established that these aren't going to be rates per second. I'm not sure we need to change the names in this proposal, however. The fields from this proposal are similar enough to the existing |
2 Questions as I go thru SNMP for troubleshooting the home network (what pto is for right?).
|
@dainperkins Thanks for the review.
Fields defined in this RFC are aggregated values. For example:
Memory will be very helpful but in this case, we have public cloud providers that do not report memory utilization. That's why we did not include it in the list. |
Done! Thank you @webmat! Please let me know if the new commit works for you!
Agree with keeping the names. |
Responding to #1028 (comment), I too gather metrics over SNMP, and my take is that the SNMP use case will need different fields for almost all of these. Fields like:
I've been using fields named like this in some internal applications. I believe that these are all possible future extensions to ECS. |
@andrewthad the interface fields are eventually meant to hold interface stats (e.g. standard SNMP type metrics) I'll work on adding the rest of the fields (from the standard if-mib) |
Thanks @dainperkins and @andrewthad! My thought is to keep this RFC limited to the 7 new fields and the rest of the fields can be added in a separate PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, the proposal looks good to me. I'm good to merge at stage 2.
@exekias can you confirm everything's good on your end as well? There's been no recent developments around this? I'll wait for your final review before merging.
One thing I'd like to cover in the stage 3 PR is how these host metric events are identified. In other words, if one wants to publish these host metrics from a custom source, what do they have to do, for it to be picked up by observability?
@webmat Thank you for the review.
My understanding is: Once we have these fields into ECS, I will make all the changes in Metricbeat to match these new field names. After that, Observability Metrics UI in Kibana will start adopting these new fields. If ones published events with these host metrics, they should be able to see them in the Metrics UI. |
This is looking good from my side! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, thanks for your involvement, everyone!
This PR is stage 2 of the RFC for adding host metric fields into ECS.
Preview of the RFC