Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Add host metric fields to ECS (stage 2) #1028

Merged
merged 8 commits into from
Dec 1, 2020

Conversation

kaiyan-sheng
Copy link
Contributor

@kaiyan-sheng kaiyan-sheng commented Oct 15, 2020

This PR is stage 2 of the RFC for adding host metric fields into ECS.

Preview of the RFC

@kaiyan-sheng kaiyan-sheng self-assigned this Oct 15, 2020
@ebeahan
Copy link
Member

ebeahan commented Oct 16, 2020

This may have come up in discussion already previously; apologies if so.

The initial adoption and usage of these fields is focused on the "big three" public cloud providers:

* aws ec2 metricset
* googlecloud compute metricset
* azure compute_vm metricset

Are there plans for any other metricsets to adopt these host.* fields as well over time? Or is the focus primarily on public cloud infrastructure?

@dainperkins
Copy link
Contributor

Was going thru snmp polling with a client - thinking it would be nice to have interfaces broken out, on any host with more than 1, rolling them all up into a single pair of metrics is less than useful than splitting them out. Things like discards, errors, are definitely helpful in terms of network troubleshooting... (cpu's and physical disks seem like separation would be beneficial - 1 pegged cpu/disk out of 4 could be indicative of a problem that would be masked with a average of the multiple ).

@kaiyan-sheng
Copy link
Contributor Author

Was going thru snmp polling with a client - thinking it would be nice to have interfaces broken out, on any host with more than 1, rolling them all up into a single pair of metrics is less than useful than splitting them out. Things like discards, errors, are definitely helpful in terms of network troubleshooting... (cpu's and physical disks seem like separation would be beneficial - 1 pegged cpu/disk out of 4 could be indicative of a problem that would be masked with a average of the multiple ).

@dainperkins Yes definitely sometimes detailed metrics per interface is useful and sometimes aggregated value is good enough. We are making this list of host fields across different kinds of VMs and mainly to benefit metrics UI to have a centralized location to display metrics from all hosts. If user sees some problem in one host, they can get more granular metrics, such as data points for network performance per interface.

@kaiyan-sheng
Copy link
Contributor Author

This may have come up in discussion already previously; apologies if so.

The initial adoption and usage of these fields is focused on the "big three" public cloud providers:

* aws ec2 metricset
* googlecloud compute metricset
* azure compute_vm metricset

Are there plans for any other metricsets to adopt these host.* fields as well over time? Or is the focus primarily on public cloud infrastructure?

@ebeahan We also have system module adopted these host.* fields on top of the three public cloud providers. In the future, we will require/suggest any new integrations related to host/VM to adopt these fields.

@dainperkins
Copy link
Contributor

@kaiyan-sheng my point is more that if everything is fine - aggregated device metrics are fine, if somewhat irrelevant from any sort of performance perspective where individual device performance can be specifically related to overall performance issues (as opposed to e.g. pooled resources like memory, and to some degree cpu - tho I have seen performance issues with a specific cpus/core at 99% in a multi cpu server - I think it was SAP iirc - that were completely hidden by aggregate stat analysis).

Scalar values like ingress.bytes vs utilization percentages are even worse, as they provide no easy to read context in terms of utilization.

As soon as there's an issue however, viewing aggregated metrics across multiple devices that are not a pooled resource [ interfaces, disks, less so cpu in most cases ] often hides the problem - particularly in cases of e.g. multiple nics (management+ production, inside & outside, , physical / virtual interfaces, etc.) or multiple disks (where performance is often tied to a specific device in the group [ os volume vs. data volume ])

I haven't looked at the major cloud provider metric sets, but certainly from a bare metal / vm perspective, gathering detailed per device stats from a given system will ultimately be much more useful for troubleshooting, likely easier to implement outside of certain cloud models (thinking of snmp, wmi, statistic gathering tools and anyone unable to use Beats on their appliances) that can then be rolled up into visualization that provide aggregations, as opposed to building aggregations into host fields.

From an ECS perspective I suggest that building out the interface fields (which has been begun, and could quickly be further built out for both hosts and e.g. observers), and adding a disk or volume field set, would be the place to put discrete metrics, which could then be aggregated as needed. Usually ECS has preferred to skip field sets for pre-aggregated metrics (with some exceptions for information provided by a standard input type (e.g. netflow can send avg thruput, etc)

@MikePaquette any thoughts on the fields as proposed?

@webmat
Copy link
Contributor

webmat commented Oct 28, 2020

Thanks for the thoughts @dainperkins, we definitely need to consider this for the future.

I like the current focus of this RFC of establishing the common baseline that can be gathered across various sources; a lowest common denominator, if you like.

I definitely agree there's also value in also standardizing more metrics, even if they're not available everywhere, though. I suggest @kaiyan-sheng you could capture these ideas in the "Concerns" section, to be kept in mind for a subsequent RFC.

I want to keep the scope of this RFC fixed, so we can continue to move forward.

Copy link
Member

@ebeahan ebeahan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great, @kaiyan-sheng!

As part of advancing this proposal, we'll also want to capture experimental field definitions for the proposed fields. These YAML files will allow anyone who wishes to test out these fields to generate artifacts using the ECS tooling's --include flag.

The definitions can added in a single host.yml file. Here's a starting point example using the first two fields:

- name: host
  fields:
    - name: cpu.usage
      type: scaled_float
      scaling_factor: 1000
      level: extended
      short: Percent CPU used with scaling factor of 1000.
      description: >
        Percent CPU used with scaling factor of 1000. This value is normalized by
        the number of CPU cores and it ranges from 0 to 1.

        For example: For a two core host, this value should be the average of the
        two cores, between 0 and 1.

    - name: network.ingress.bytes
      type: long
      level: extended
      short: The number of bytes received on all network interfaces.
      description: >
        The number of bytes received (gauge) on all network interfaces by the
        host in a given period of time.

Examples from other RFCS: 0001 and 0007.

Since we will have both example source documents and field definitions, it may make sense to create two subdirectories underneath rfcs/text/0005? Perhaps rfcs/text/0005/fields and rfcs/text/0005/examples?

rfcs/text/0005/ec2.yml Outdated Show resolved Hide resolved
rfcs/text/0005/ec2.yml Outdated Show resolved Hide resolved
@andrewthad
Copy link
Contributor

Thank you so much @kaiyan-sheng for writing this up! At my workplace, we use SNMP to collect a lot of data, and I was about to do nearly the same thing and discovered that you've already got something good started. I have a few concerns. The description of network.egress.bytes is

The number of bytes received (gauge) on all network interfaces by the host in a given period of time.

This suggests that the field could be populated by summing all of the SNMP ifInOctets values (ignoring the fact that aggregate links can cause some bytes to be counted twice). In SNMP, ifInOctets and it's 64-bit variant are of type COUNTER. That is, there are monotonically increasing counters, so to get anything useful from them, you have to poll periodically and then subtract the earlier value from the later one. Tracking the interval is important, and typically, users want this information as bits-per-second, not as bytes-over-arbitrary-interval.

What I'm arriving at is this claim: network.ingress.bytes and the related fields are meaningless when event.duration is not set. (Perhaps there is some other field used to communicate a timespan, but I am not aware of one.) The documentation should clarify that event.duration should be set if any of these fields are used. But it would be good to hear if others feel the same way. The issue of duration didn't come up in #950, so maybe there are reasons that capturing the duration is not important, or maybe I've overlooked a way that this is supposed to be done.

Copy link
Contributor

@webmat webmat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for chiming in @andrewthad, you make a very good point.

We skirted the subject when we discussed the monotonic counters. We've decided to stick to rates, but it's true that the current definitions are still a bit loose.

In the current wording, it sounds like the periodicity of the collection could change, and the number may change accordingly (e.g. collect every second, the value is 300 kb vs collect every 10 seconds, the value is 3000kb).

I initially understood the intent to be that these fields are meant to record rates per seconds. Was my understanding correct?

If these fields are meant to capture rates per second, please adjust their descriptions accordingly.

If the value of the metrics were meant to be dependent on the collection period, then I agree we're missing a piece. We should clarify.

rfcs/text/0005/host.yml Outdated Show resolved Hide resolved
rfcs/text/0005/host.yml Outdated Show resolved Hide resolved
rfcs/text/0005/host.yml Outdated Show resolved Hide resolved
@andrewthad
Copy link
Contributor

andrewthad commented Nov 9, 2020

I’m not sure what the author intended, but to me, it seems more useful to have a byte rate rather than a total and a duration. In Kibana, it is nearly impossible to work with fields that need to be multiplied or divided by other fields.

if it is going to be interpreted as a rate rather than as a total, I have a different concern: The name of the field is misleading and inconsistent with the way other ECS fields are named. Everywhere else, fields suffixed with the word “bytes” refer to a total, not a rate. The suffix “bytes_per_second” is more clear, and it paves the way for extensions like “client.bytes_per_second”.

@kaiyan-sheng
Copy link
Contributor Author

Thank you @andrewthad and @webmat ! I agree, rate will definitely be more helpful here. Our intention here is to actually rely on collection period for metrics collected from different modules/metricsets. Calculating rate from these values can be done in Kibana. But I think it might be more beneficial to directly reporting rate instead. We are actually already calculating rate in aws module for EC2 instances. @exekias WDYT?

@exekias
Copy link

exekias commented Nov 11, 2020

Thanks for bringing that up! There are some benefits of storing the delta change since the previous fetch, as compared to calculating the rate per second. Specially when aggregating different time series, the current definition for these metrics would perform better.

We are currently discussing the possibility of formalizing this type of metric a little bit more and add better support for it in Kibana, happy to report back when we have an update on this.

@webmat webmat changed the title [RFC] Add host metric fields to ECS [RFC] Add host metric fields to ECS (stage 2) Nov 11, 2020
@webmat
Copy link
Contributor

webmat commented Nov 11, 2020

It's not clear to me which direction this is going, based on your two comments @kaiyan-sheng and @exekias 😂

Are these fields meant to capture a rate/s?

@kaiyan-sheng
Copy link
Contributor Author

Are these fields meant to capture a rate/s?

@webmat Sorry for the confusion! We had a discussion earlier and decided we would like to keep these fields as gauge for now. Rates would be useful but in some cases we need the original gauge value to aggregate.

@webmat
Copy link
Contributor

webmat commented Nov 18, 2020

Ok, whenever the decision reached, please make sure to update the RFC to clearly state how these fields will work, and how users should populate them in custom data sources.

When I think of CPU usage, a gauge is a no-brainer. There's no time component necessary. The CPU usage was a certain percentage at the time of collection, and that's the end of it.

When I think of IO however it's a bit trickier. What I understand from your answers is that the intent is for these IO fields to contain a gauge of the "total since the last metric collection". For example, total bytes/packets in the last 10s. Am I understanding this correctly?

If that's the intent, I think this can work. But then the metrics as described in this RFC are no longer self-contained: to interpret the field containing "23kb", we need to know what the metric collection frequency is. This is a value that can be different based on host, or can change over time, etc. So if that's where this is going, perhaps we should add a field to represent the collection interval?

@kaiyan-sheng
Copy link
Contributor Author

If that's the intent, I think this can work. But then the metrics as described in this RFC are no longer self-contained: to interpret the field containing "23kb", we need to know what the metric collection frequency is. This is a value that can be different based on host, or can change over time, etc. So if that's where this is going, perhaps we should add a field to represent the collection interval?

Thank you for the comment @webmat! Yes, this value will be meaningless without a collection frequency. In the beats situation, we have metricset.period to record the frequency. Do you mean we should also add a field under host.* like host.period to represent this?

@webmat
Copy link
Contributor

webmat commented Nov 23, 2020

Ah, I was not familiar with the metricset fields. I've looked at a few Metricbeat events, and it looks like there's both metricset.period and metricset.name.

I think it would make more sense to add both of those to ECS as is, rather than define a purpose-specific host.period.

Does anyone see issues with adding metricset.period and metricset.name to ECS? @exekias @ruflin

@webmat webmat added the 1.8.0 label Nov 23, 2020
@exekias
Copy link

exekias commented Nov 24, 2020

I don't think we need the period to make these metrics useful. We don't really use this info when graphing the metrics, as the collection period gets diluted by the bucket size (graphing period).

All this said, I agree it would make sense to add period to ECS, but as a different effort, WDYT?

@ruflin
Copy link
Contributor

ruflin commented Nov 24, 2020

For metricset.name, this is a field replaced by event.dataset/ data_stream.dataset. I don't expect that we will still use this long term.

@webmat
Copy link
Contributor

webmat commented Nov 24, 2020

I'm fine with separating these two tasks, yeah.

Could you adjust the RFC to mention that the IO metrics are dependent on the collection interval? Here's two things I think would make the proposal clear:

  • In the field definitions we should replace "in a given period of time" which is vague, with something like "since the last metric collection".
  • In the Fields section, we should mention that to interpret the IO metrics, the metric collection period is also needed, but that adding this field to ECS will be done separately.

The above would close one of the loops raised by @andrewthad a while ago.

Andrew's other point was about field naming. We've established that these aren't going to be rates per second. I'm not sure we need to change the names in this proposal, however. The fields from this proposal are similar enough to the existing .bytes and .packets fields we already have, IMO.

@dainperkins
Copy link
Contributor

dainperkins commented Nov 24, 2020

2 Questions as I go thru SNMP for troubleshooting the home network (what pto is for right?).

  1. For more discrete metrics (e.g. per disk/volume, cpu, or interface) would we expect individual metrics be written to specific field set (interface, cpu, disk/volume) with averages going into host?
  2. Should we consider overall memory utilization for these as well (e.g. memory.free/used)?

@kaiyan-sheng
Copy link
Contributor Author

@dainperkins Thanks for the review.

  1. For more discrete metrics (e.g. per disk/volume, cpu, or interface) would we expect individual metrics be written to specific field set (interface, cpu, disk/volume) with averages going into host?

Fields defined in this RFC are aggregated values. For example: host.disk.read.bytes is bytes read across all disks/volumes and host.network.ingress.bytes is ingress value across all network interfaces.

  1. Should we consider overall memory utilization for these as well (e.g. memory.free/used)?

Memory will be very helpful but in this case, we have public cloud providers that do not report memory utilization. That's why we did not include it in the list.

@kaiyan-sheng
Copy link
Contributor Author

  • In the field definitions we should replace "in a given period of time" which is vague, with something like "since the last metric collection".
  • In the Fields section, we should mention that to interpret the IO metrics, the metric collection period is also needed, but that adding this field to ECS will be done separately.

Done! Thank you @webmat! Please let me know if the new commit works for you!

Andrew's other point was about field naming. We've established that these aren't going to be rates per second. I'm not sure we need to change the names in this proposal, however. The fields from this proposal are similar enough to the existing .bytes and .packets fields we already have, IMO.

Agree with keeping the names.

@andrewthad
Copy link
Contributor

Responding to #1028 (comment), I too gather metrics over SNMP, and my take is that the SNMP use case will need different fields for almost all of these. Fields like:

  • component.cpu.usage (component is often referred to as a "plane" in switches)
  • component.memory.usage
  • component.name
  • host.interface.name
  • host.interface.bytes (ingress plus egress)
  • host.interface.ingress.bytes
  • host.interface.egress.bytes

I've been using fields named like this in some internal applications. I believe that these are all possible future extensions to ECS.

@dainperkins
Copy link
Contributor

@andrewthad the interface fields are eventually meant to hold interface stats (e.g. standard SNMP type metrics) I'll work on adding the rest of the fields (from the standard if-mib)

@kaiyan-sheng
Copy link
Contributor Author

Thanks @dainperkins and @andrewthad! My thought is to keep this RFC limited to the 7 new fields and the rest of the fields can be added in a separate PR.

webmat
webmat previously approved these changes Nov 30, 2020
Copy link
Contributor

@webmat webmat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, the proposal looks good to me. I'm good to merge at stage 2.

@exekias can you confirm everything's good on your end as well? There's been no recent developments around this? I'll wait for your final review before merging.

One thing I'd like to cover in the stage 3 PR is how these host metric events are identified. In other words, if one wants to publish these host metrics from a custom source, what do they have to do, for it to be picked up by observability?

@webmat webmat requested a review from exekias November 30, 2020 19:29
rfcs/text/0005/host.yml Outdated Show resolved Hide resolved
@kaiyan-sheng
Copy link
Contributor Author

@webmat Thank you for the review.

One thing I'd like to cover in the stage 3 PR is how these host metric events are identified. In other words, if one wants to publish these host metrics from a custom source, what do they have to do, for it to be picked up by observability?

My understanding is: Once we have these fields into ECS, I will make all the changes in Metricbeat to match these new field names. After that, Observability Metrics UI in Kibana will start adopting these new fields. If ones published events with these host metrics, they should be able to see them in the Metrics UI.

@exekias
Copy link

exekias commented Dec 1, 2020

@exekias can you confirm everything's good on your end as well? There's been no recent developments around this? I'll wait for your final review before merging.

This is looking good from my side!

@webmat webmat removed the request for review from exekias December 1, 2020 19:16
Copy link
Contributor

@webmat webmat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, thanks for your involvement, everyone!

@webmat webmat merged commit c2f141b into elastic:master Dec 1, 2020
@kaiyan-sheng kaiyan-sheng deleted the add_host_fields branch December 1, 2020 22:29
@ebeahan ebeahan removed the 1.8.0 label Dec 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants