[RFC] Add host metric fields to ECS (stage 2) #1028

kaiyan-sheng · 2020-10-15T04:55:44Z

This PR is stage 2 of the RFC for adding host metric fields into ECS.

rfcs/text/0005-host-metric-fields.md

ebeahan · 2020-10-16T18:38:36Z

This may have come up in discussion already previously; apologies if so.

The initial adoption and usage of these fields is focused on the "big three" public cloud providers:

* aws ec2 metricset
* googlecloud compute metricset
* azure compute_vm metricset

Are there plans for any other metricsets to adopt these host.* fields as well over time? Or is the focus primarily on public cloud infrastructure?

dainperkins · 2020-10-16T20:57:04Z

Was going thru snmp polling with a client - thinking it would be nice to have interfaces broken out, on any host with more than 1, rolling them all up into a single pair of metrics is less than useful than splitting them out. Things like discards, errors, are definitely helpful in terms of network troubleshooting... (cpu's and physical disks seem like separation would be beneficial - 1 pegged cpu/disk out of 4 could be indicative of a problem that would be masked with a average of the multiple ).

rfcs/text/0005-host-metric-fields.md

kaiyan-sheng · 2020-10-28T01:20:49Z

Was going thru snmp polling with a client - thinking it would be nice to have interfaces broken out, on any host with more than 1, rolling them all up into a single pair of metrics is less than useful than splitting them out. Things like discards, errors, are definitely helpful in terms of network troubleshooting... (cpu's and physical disks seem like separation would be beneficial - 1 pegged cpu/disk out of 4 could be indicative of a problem that would be masked with a average of the multiple ).

@dainperkins Yes definitely sometimes detailed metrics per interface is useful and sometimes aggregated value is good enough. We are making this list of host fields across different kinds of VMs and mainly to benefit metrics UI to have a centralized location to display metrics from all hosts. If user sees some problem in one host, they can get more granular metrics, such as data points for network performance per interface.

kaiyan-sheng · 2020-10-28T01:22:20Z

This may have come up in discussion already previously; apologies if so.

The initial adoption and usage of these fields is focused on the "big three" public cloud providers:
* aws ec2 metricset
* googlecloud compute metricset
* azure compute_vm metricset
Are there plans for any other metricsets to adopt these host.* fields as well over time? Or is the focus primarily on public cloud infrastructure?

@ebeahan We also have system module adopted these host.* fields on top of the three public cloud providers. In the future, we will require/suggest any new integrations related to host/VM to adopt these fields.

dainperkins · 2020-10-28T14:53:21Z

@kaiyan-sheng my point is more that if everything is fine - aggregated device metrics are fine, if somewhat irrelevant from any sort of performance perspective where individual device performance can be specifically related to overall performance issues (as opposed to e.g. pooled resources like memory, and to some degree cpu - tho I have seen performance issues with a specific cpus/core at 99% in a multi cpu server - I think it was SAP iirc - that were completely hidden by aggregate stat analysis).

Scalar values like ingress.bytes vs utilization percentages are even worse, as they provide no easy to read context in terms of utilization.

As soon as there's an issue however, viewing aggregated metrics across multiple devices that are not a pooled resource [ interfaces, disks, less so cpu in most cases ] often hides the problem - particularly in cases of e.g. multiple nics (management+ production, inside & outside, , physical / virtual interfaces, etc.) or multiple disks (where performance is often tied to a specific device in the group [ os volume vs. data volume ])

I haven't looked at the major cloud provider metric sets, but certainly from a bare metal / vm perspective, gathering detailed per device stats from a given system will ultimately be much more useful for troubleshooting, likely easier to implement outside of certain cloud models (thinking of snmp, wmi, statistic gathering tools and anyone unable to use Beats on their appliances) that can then be rolled up into visualization that provide aggregations, as opposed to building aggregations into host fields.

From an ECS perspective I suggest that building out the interface fields (which has been begun, and could quickly be further built out for both hosts and e.g. observers), and adding a disk or volume field set, would be the place to put discrete metrics, which could then be aggregated as needed. Usually ECS has preferred to skip field sets for pre-aggregated metrics (with some exceptions for information provided by a standard input type (e.g. netflow can send avg thruput, etc)

@MikePaquette any thoughts on the fields as proposed?

webmat · 2020-10-28T15:06:36Z

Thanks for the thoughts @dainperkins, we definitely need to consider this for the future.

I like the current focus of this RFC of establishing the common baseline that can be gathered across various sources; a lowest common denominator, if you like.

I definitely agree there's also value in also standardizing more metrics, even if they're not available everywhere, though. I suggest @kaiyan-sheng you could capture these ideas in the "Concerns" section, to be kept in mind for a subsequent RFC.

I want to keep the scope of this RFC fixed, so we can continue to move forward.

ebeahan

This is looking great, @kaiyan-sheng!

As part of advancing this proposal, we'll also want to capture experimental field definitions for the proposed fields. These YAML files will allow anyone who wishes to test out these fields to generate artifacts using the ECS tooling's --include flag.

The definitions can added in a single host.yml file. Here's a starting point example using the first two fields:

- name: host
  fields:
    - name: cpu.usage
      type: scaled_float
      scaling_factor: 1000
      level: extended
      short: Percent CPU used with scaling factor of 1000.
      description: >
        Percent CPU used with scaling factor of 1000. This value is normalized by
        the number of CPU cores and it ranges from 0 to 1.

        For example: For a two core host, this value should be the average of the
        two cores, between 0 and 1.

    - name: network.ingress.bytes
      type: long
      level: extended
      short: The number of bytes received on all network interfaces.
      description: >
        The number of bytes received (gauge) on all network interfaces by the
        host in a given period of time.

Examples from other RFCS: 0001 and 0007.

Since we will have both example source documents and field definitions, it may make sense to create two subdirectories underneath rfcs/text/0005? Perhaps rfcs/text/0005/fields and rfcs/text/0005/examples?

rfcs/text/0005/ec2.yml

andrewthad · 2020-11-09T14:57:05Z

Thank you so much @kaiyan-sheng for writing this up! At my workplace, we use SNMP to collect a lot of data, and I was about to do nearly the same thing and discovered that you've already got something good started. I have a few concerns. The description of network.egress.bytes is

The number of bytes received (gauge) on all network interfaces by the host in a given period of time.

This suggests that the field could be populated by summing all of the SNMP ifInOctets values (ignoring the fact that aggregate links can cause some bytes to be counted twice). In SNMP, ifInOctets and it's 64-bit variant are of type COUNTER. That is, there are monotonically increasing counters, so to get anything useful from them, you have to poll periodically and then subtract the earlier value from the later one. Tracking the interval is important, and typically, users want this information as bits-per-second, not as bytes-over-arbitrary-interval.

What I'm arriving at is this claim: network.ingress.bytes and the related fields are meaningless when event.duration is not set. (Perhaps there is some other field used to communicate a timespan, but I am not aware of one.) The documentation should clarify that event.duration should be set if any of these fields are used. But it would be good to hear if others feel the same way. The issue of duration didn't come up in #950, so maybe there are reasons that capturing the duration is not important, or maybe I've overlooked a way that this is supposed to be done.

webmat

Thanks for chiming in @andrewthad, you make a very good point.

We skirted the subject when we discussed the monotonic counters. We've decided to stick to rates, but it's true that the current definitions are still a bit loose.

In the current wording, it sounds like the periodicity of the collection could change, and the number may change accordingly (e.g. collect every second, the value is 300 kb vs collect every 10 seconds, the value is 3000kb).

I initially understood the intent to be that these fields are meant to record rates per seconds. Was my understanding correct?

If these fields are meant to capture rates per second, please adjust their descriptions accordingly.

If the value of the metrics were meant to be dependent on the collection period, then I agree we're missing a piece. We should clarify.

rfcs/text/0005/host.yml

andrewthad · 2020-11-09T22:23:41Z

I’m not sure what the author intended, but to me, it seems more useful to have a byte rate rather than a total and a duration. In Kibana, it is nearly impossible to work with fields that need to be multiplied or divided by other fields.

if it is going to be interpreted as a rate rather than as a total, I have a different concern: The name of the field is misleading and inconsistent with the way other ECS fields are named. Everywhere else, fields suffixed with the word “bytes” refer to a total, not a rate. The suffix “bytes_per_second” is more clear, and it paves the way for extensions like “client.bytes_per_second”.

kaiyan-sheng · 2020-11-11T00:51:21Z

Thank you @andrewthad and @webmat ! I agree, rate will definitely be more helpful here. Our intention here is to actually rely on collection period for metrics collected from different modules/metricsets. Calculating rate from these values can be done in Kibana. But I think it might be more beneficial to directly reporting rate instead. We are actually already calculating rate in aws module for EC2 instances. @exekias WDYT?

exekias · 2020-11-11T16:17:15Z

Thanks for bringing that up! There are some benefits of storing the delta change since the previous fetch, as compared to calculating the rate per second. Specially when aggregating different time series, the current definition for these metrics would perform better.

We are currently discussing the possibility of formalizing this type of metric a little bit more and add better support for it in Kibana, happy to report back when we have an update on this.

webmat · 2020-11-11T18:32:53Z

It's not clear to me which direction this is going, based on your two comments @kaiyan-sheng and @exekias 😂

Are these fields meant to capture a rate/s?

kaiyan-sheng · 2020-11-11T20:46:20Z

Are these fields meant to capture a rate/s?

@webmat Sorry for the confusion! We had a discussion earlier and decided we would like to keep these fields as gauge for now. Rates would be useful but in some cases we need the original gauge value to aggregate.

webmat · 2020-11-18T19:57:32Z

Ok, whenever the decision reached, please make sure to update the RFC to clearly state how these fields will work, and how users should populate them in custom data sources.

When I think of CPU usage, a gauge is a no-brainer. There's no time component necessary. The CPU usage was a certain percentage at the time of collection, and that's the end of it.

When I think of IO however it's a bit trickier. What I understand from your answers is that the intent is for these IO fields to contain a gauge of the "total since the last metric collection". For example, total bytes/packets in the last 10s. Am I understanding this correctly?

If that's the intent, I think this can work. But then the metrics as described in this RFC are no longer self-contained: to interpret the field containing "23kb", we need to know what the metric collection frequency is. This is a value that can be different based on host, or can change over time, etc. So if that's where this is going, perhaps we should add a field to represent the collection interval?

kaiyan-sheng · 2020-11-20T23:09:06Z

If that's the intent, I think this can work. But then the metrics as described in this RFC are no longer self-contained: to interpret the field containing "23kb", we need to know what the metric collection frequency is. This is a value that can be different based on host, or can change over time, etc. So if that's where this is going, perhaps we should add a field to represent the collection interval?

Thank you for the comment @webmat! Yes, this value will be meaningless without a collection frequency. In the beats situation, we have metricset.period to record the frequency. Do you mean we should also add a field under host.* like host.period to represent this?

webmat · 2020-11-23T15:34:06Z

Ah, I was not familiar with the metricset fields. I've looked at a few Metricbeat events, and it looks like there's both metricset.period and metricset.name.

I think it would make more sense to add both of those to ECS as is, rather than define a purpose-specific host.period.

Does anyone see issues with adding metricset.period and metricset.name to ECS? @exekias @ruflin

exekias · 2020-11-24T10:10:48Z

I don't think we need the period to make these metrics useful. We don't really use this info when graphing the metrics, as the collection period gets diluted by the bucket size (graphing period).

All this said, I agree it would make sense to add period to ECS, but as a different effort, WDYT?

ruflin · 2020-11-24T15:14:59Z

For metricset.name, this is a field replaced by event.dataset/ data_stream.dataset. I don't expect that we will still use this long term.

webmat · 2020-11-24T19:03:53Z

I'm fine with separating these two tasks, yeah.

Could you adjust the RFC to mention that the IO metrics are dependent on the collection interval? Here's two things I think would make the proposal clear:

In the field definitions we should replace "in a given period of time" which is vague, with something like "since the last metric collection".
In the Fields section, we should mention that to interpret the IO metrics, the metric collection period is also needed, but that adding this field to ECS will be done separately.

The above would close one of the loops raised by @andrewthad a while ago.

Andrew's other point was about field naming. We've established that these aren't going to be rates per second. I'm not sure we need to change the names in this proposal, however. The fields from this proposal are similar enough to the existing .bytes and .packets fields we already have, IMO.

dainperkins · 2020-11-24T21:00:38Z

2 Questions as I go thru SNMP for troubleshooting the home network (what pto is for right?).

For more discrete metrics (e.g. per disk/volume, cpu, or interface) would we expect individual metrics be written to specific field set (interface, cpu, disk/volume) with averages going into host?
Should we consider overall memory utilization for these as well (e.g. memory.free/used)?

kaiyan-sheng · 2020-11-25T04:03:16Z

@dainperkins Thanks for the review.

For more discrete metrics (e.g. per disk/volume, cpu, or interface) would we expect individual metrics be written to specific field set (interface, cpu, disk/volume) with averages going into host?

Fields defined in this RFC are aggregated values. For example: host.disk.read.bytes is bytes read across all disks/volumes and host.network.ingress.bytes is ingress value across all network interfaces.

Should we consider overall memory utilization for these as well (e.g. memory.free/used)?

Memory will be very helpful but in this case, we have public cloud providers that do not report memory utilization. That's why we did not include it in the list.

kaiyan-sheng · 2020-11-25T04:05:30Z

In the field definitions we should replace "in a given period of time" which is vague, with something like "since the last metric collection".

In the Fields section, we should mention that to interpret the IO metrics, the metric collection period is also needed, but that adding this field to ECS will be done separately.

Done! Thank you @webmat! Please let me know if the new commit works for you!

Andrew's other point was about field naming. We've established that these aren't going to be rates per second. I'm not sure we need to change the names in this proposal, however. The fields from this proposal are similar enough to the existing .bytes and .packets fields we already have, IMO.

Agree with keeping the names.

andrewthad · 2020-11-25T14:09:46Z

Responding to #1028 (comment), I too gather metrics over SNMP, and my take is that the SNMP use case will need different fields for almost all of these. Fields like:

component.cpu.usage (component is often referred to as a "plane" in switches)
component.memory.usage
component.name
host.interface.name
host.interface.bytes (ingress plus egress)
host.interface.ingress.bytes
host.interface.egress.bytes

I've been using fields named like this in some internal applications. I believe that these are all possible future extensions to ECS.

dainperkins · 2020-11-25T18:23:58Z

@andrewthad the interface fields are eventually meant to hold interface stats (e.g. standard SNMP type metrics) I'll work on adding the rest of the fields (from the standard if-mib)

rfcs/text/0005-host-metric-fields.md

kaiyan-sheng · 2020-11-30T18:33:35Z

Thanks @dainperkins and @andrewthad! My thought is to keep this RFC limited to the 7 new fields and the rest of the fields can be added in a separate PR.

webmat

Alright, the proposal looks good to me. I'm good to merge at stage 2.

@exekias can you confirm everything's good on your end as well? There's been no recent developments around this? I'll wait for your final review before merging.

One thing I'd like to cover in the stage 3 PR is how these host metric events are identified. In other words, if one wants to publish these host metrics from a custom source, what do they have to do, for it to be picked up by observability?

rfcs/text/0005/host.yml

kaiyan-sheng · 2020-11-30T20:00:44Z

@webmat Thank you for the review.

One thing I'd like to cover in the stage 3 PR is how these host metric events are identified. In other words, if one wants to publish these host metrics from a custom source, what do they have to do, for it to be picked up by observability?

My understanding is: Once we have these fields into ECS, I will make all the changes in Metricbeat to match these new field names. After that, Observability Metrics UI in Kibana will start adopting these new fields. If ones published events with these host metrics, they should be able to see them in the Metrics UI.

exekias · 2020-12-01T16:28:38Z

@exekias can you confirm everything's good on your end as well? There's been no recent developments around this? I'll wait for your final review before merging.

This is looking good from my side!

webmat

Alright, thanks for your involvement, everyone!

Stage 2 for adding host metric fields

8d414ea

kaiyan-sheng self-assigned this Oct 15, 2020

kaiyan-sheng added the RFC label Oct 15, 2020

add RFC stage 2 PR link

43b3c06

webmat reviewed Oct 15, 2020

View reviewed changes

rfcs/text/0005-host-metric-fields.md Outdated Show resolved Hide resolved

ebeahan reviewed Oct 16, 2020

View reviewed changes

rfcs/text/0005-host-metric-fields.md Outdated Show resolved Hide resolved

Merge remote-tracking branch 'upstream/master' into add_host_fields

57e08dc

webmat reviewed Oct 20, 2020

View reviewed changes

rfcs/text/0005-host-metric-fields.md Outdated Show resolved Hide resolved

move ec2 file example into rfcs/text/0005/ec2.yml

80397cf

ebeahan reviewed Nov 2, 2020

View reviewed changes

rfcs/text/0005/ec2.yml Outdated Show resolved Hide resolved

rfcs/text/0005/ec2.yml Outdated Show resolved Hide resolved

add 0005/host.yml

03a4cc9

webmat reviewed Nov 9, 2020

View reviewed changes

rfcs/text/0005/host.yml Outdated Show resolved Hide resolved

rfcs/text/0005/host.yml Outdated Show resolved Hide resolved

rfcs/text/0005/host.yml Outdated Show resolved Hide resolved

webmat changed the title ~~[RFC] Add host metric fields to ECS~~ [RFC] Add host metric fields to ECS (stage 2) Nov 11, 2020

change host.cpu.usage description and disk.write.bytes

b19e7fa

webmat added the 1.8.0 label Nov 23, 2020

update fields to specify values are collected since the last collection

55390d0

webmat reviewed Nov 30, 2020

View reviewed changes

rfcs/text/0005-host-metric-fields.md Outdated Show resolved Hide resolved

webmat previously approved these changes Nov 30, 2020

View reviewed changes

webmat requested a review from exekias November 30, 2020 19:29

andrewthad reviewed Nov 30, 2020

View reviewed changes

rfcs/text/0005/host.yml Outdated Show resolved Hide resolved

Fix typo

be0e4ff

kaiyan-sheng dismissed webmat’s stale review via be0e4ff November 30, 2020 19:39

webmat removed the request for review from exekias December 1, 2020 19:16

webmat approved these changes Dec 1, 2020

View reviewed changes

webmat merged commit c2f141b into elastic:master Dec 1, 2020

This was referenced Dec 1, 2020

Set the merge date on RFC 0005 stage 2 #1158

Merged

Stage 2 changes for RFC 0005 (host metrics) #1159

Merged

kaiyan-sheng deleted the add_host_fields branch December 1, 2020 22:29

ebeahan removed the 1.8.0 label Dec 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Add host metric fields to ECS (stage 2) #1028

[RFC] Add host metric fields to ECS (stage 2) #1028

kaiyan-sheng commented Oct 15, 2020 •

edited

Loading

ebeahan commented Oct 16, 2020

dainperkins commented Oct 16, 2020

kaiyan-sheng commented Oct 28, 2020

kaiyan-sheng commented Oct 28, 2020

dainperkins commented Oct 28, 2020

webmat commented Oct 28, 2020

ebeahan left a comment

andrewthad commented Nov 9, 2020

webmat left a comment

andrewthad commented Nov 9, 2020 •

edited

Loading

kaiyan-sheng commented Nov 11, 2020

exekias commented Nov 11, 2020

webmat commented Nov 11, 2020

kaiyan-sheng commented Nov 11, 2020

webmat commented Nov 18, 2020

kaiyan-sheng commented Nov 20, 2020

webmat commented Nov 23, 2020

exekias commented Nov 24, 2020

ruflin commented Nov 24, 2020

webmat commented Nov 24, 2020

dainperkins commented Nov 24, 2020 •

edited

Loading

kaiyan-sheng commented Nov 25, 2020

kaiyan-sheng commented Nov 25, 2020

andrewthad commented Nov 25, 2020

dainperkins commented Nov 25, 2020

kaiyan-sheng commented Nov 30, 2020

webmat left a comment

kaiyan-sheng commented Nov 30, 2020

exekias commented Dec 1, 2020

webmat left a comment

[RFC] Add host metric fields to ECS (stage 2) #1028

[RFC] Add host metric fields to ECS (stage 2) #1028

Conversation

kaiyan-sheng commented Oct 15, 2020 • edited Loading

ebeahan commented Oct 16, 2020

dainperkins commented Oct 16, 2020

kaiyan-sheng commented Oct 28, 2020

kaiyan-sheng commented Oct 28, 2020

dainperkins commented Oct 28, 2020

webmat commented Oct 28, 2020

ebeahan left a comment

Choose a reason for hiding this comment

andrewthad commented Nov 9, 2020

webmat left a comment

Choose a reason for hiding this comment

andrewthad commented Nov 9, 2020 • edited Loading

kaiyan-sheng commented Nov 11, 2020

exekias commented Nov 11, 2020

webmat commented Nov 11, 2020

kaiyan-sheng commented Nov 11, 2020

webmat commented Nov 18, 2020

kaiyan-sheng commented Nov 20, 2020

webmat commented Nov 23, 2020

exekias commented Nov 24, 2020

ruflin commented Nov 24, 2020

webmat commented Nov 24, 2020

dainperkins commented Nov 24, 2020 • edited Loading

kaiyan-sheng commented Nov 25, 2020

kaiyan-sheng commented Nov 25, 2020

andrewthad commented Nov 25, 2020

dainperkins commented Nov 25, 2020

kaiyan-sheng commented Nov 30, 2020

webmat left a comment

Choose a reason for hiding this comment

kaiyan-sheng commented Nov 30, 2020

exekias commented Dec 1, 2020

webmat left a comment

Choose a reason for hiding this comment

kaiyan-sheng commented Oct 15, 2020 •

edited

Loading

andrewthad commented Nov 9, 2020 •

edited

Loading

dainperkins commented Nov 24, 2020 •

edited

Loading