-
Notifications
You must be signed in to change notification settings - Fork 427
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Schema for metrics #474
Comments
Great write-up @axw 😃
Could you elaborate on why the metrics would contain "runtime details" like names and versions?
Do you mean I also just want to mention that in the cases where the metric that an APM agent wants to collect, exists across all languages and is counted in the same manner, we should strongly consider a shared namespace for it instead of having the same metric repeated under each language namespace. E.g. in the Node.js agent want to count the number of user and system CPU ticks for the process. This is a prime candidate for standardization and would be |
I just meant that the field sets (e.g.
No, I meant rather that we should pick one format and be consistent. We're currently using "read.count" for one metric, and "ticks" for another metric. I tend to think we should use ".ticks".
👍 In this specific example, I'd say those metric fields belong in the process field set. |
Thanks for this great writeup @axw . I think it's a good time to challenge the metricbeat naming convention and improve them. I would really like to have elastic/elasticsearch#31244 but agree we should not be blocked by it and is a reason some fields in ECS already today have For the migration: We have the fields alias feature so I would directly start with the names we want in the future and potentially alias old fields (if possible). To move this effort forward we could start with a PR for one specific environment like nodejs or Golang (or both). Haven specific fields to discuss will indirectly drive also the discussion about building / updating the convention. @axw @watson Willing to open PR's against https://github.com/elastic/ecs/tree/ecs-metrics ? |
For Go it's a little tricky, since we've got "golang" metrics coming from Metricbeat already. There's a PR ongoing for the Node.js agent which we can extract the new fields from: https://github.com/elastic/apm-agent-nodejs/pull/1021/files. I'll take a look at doing that later to get things started, unless @watson or @Qard particularly want to :) |
Got'ya, It's just that you wrote
If you could take it, it would be great 👍 |
Thanks for this detailed writeup, @axw! I love the attention given to improving the consistency of the naming of the various metrics. One thing I'd like to challenge is the creation of metrics spaces per language. In ECS, I would much prefer having one place to nest runtime metrics, that we fill the best we can, depending on the runtime/language. The goal here is not to mix/consume the metrics of different runtimes together (although that opens up the possibility). In most cases, you'd of course filter for JVM stuff for a JVM-centric dashboard, for example (e.g. This may sound like a 180 degree change from this comment I made in April. But back then, I didn't expect this to be proposed to be added to ECS right away. Right now I'm seeing a proposal to add what's essentially an unbounded amount of field sets to ECS, for the metrics of all the runtimes. I'd much rather have the common part of these metricsets be added to ECS in one common place, like One thing we have steered away from in the Elastic Common Schema is including brand or technology names (e.g. so it's |
If we do that, we're kind of back to square one on elastic/beats#11836. What we're trying to do here is define a common dictionary of metrics, many of which will be runtime-specific. e.g. "number of goroutines". Even "event loop delay" - while it may be common thing to a subset of languages, it's definitely not common to all so probably doesn't belong directly on "runtime". If it were just APM implementing these metrics I'd be inclined to agree, we could just document the runtime-specific metrics separately. In fact, that's what we started out doing. Metricbeat will start gathering some of these metrics too. @webmat Is the issue you have with "[adding] what's essentially an unbounded amount of field sets to ECS, for the metrics of all the runtimes" about the effect it'll have on the readability of the docs? (If so is it a matter of reorganising things somehow, to be nested under a common "runtime" namespace?) Or something else? |
I think we should stick to the plan for now to have one prefix for each runtime and we will figure out the generalisation later on. I still think we should do the same for kubernetes etc. But there is the problem that @webmat mentioned about having in the end too many fields in the ecs template. I think it's time to come up with a solution here, for example introduce different parts of ECS and potentially have 1 template for each. So someone can only use basic if he wants or also get metrics. In any case, I think this discussion should not block any progress to be made on the metrics branch as it having the fields already there will make the discussion more concrete. |
to summarize: we'll stage a proposal in the ECS metrics branch while we sort out the metrics for each runtime. For now we're going with individual namespaces for each runtime because the metrics are going to be quite different between runtimes. |
We've started a publicly viewable spreadsheet here |
One additional interesting note here that came out of #477: Prometheus adds the units to its metric names: https://prometheus.io/docs/practices/naming/#metric-names Perhaps to prevent future conflicts with objects vs keyword we could also use an underline notation like |
@ruflin pointed out that we have more JVM metrics recorded for Logstash via its Node Stats API: https://www.elastic.co/guide/en/logstash/current/node-stats-api.html#jvm-stats. All very similar to what the APM Java agent is reporting, or will report in the future. |
Elasticsearch has a new histogram field in the work. This could become relevant for some of the metrics: elastic/elasticsearch#48580 |
There is now a proposed convention for system metrics from otel: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/semantic_conventions/system-metrics.md |
I think it is really time we move forward here and come to a conclusion. We now also use some system metrics in the upcoming host view: elastic/kibana#138173 I know this issue was initially opened in the context of apm and system metrics but to reduce scope, I propose to scale this down to only minimal set of system metrics. What @neptunian uses in elastic/kibana#138173 seems like a good initial set. @lalit-satapathy This would also be helpful to standardise on some host metrics from otel. |
Hi @ruflin! I think that introducing a minimal set of system metrics into ECS would be nice. While not exactly the same, there's currently a stage 1 RFC for adding cgroup metrics into ECS. |
I opened a proposal PR for host metrics here: #2129 It is still in draft mode but I keep adding content and will convert it to a full RFC soonish. |
@ruflin Any updates to this? I'm working on a couple of projects for infra (network/file/system) monitoring and it would be very helpful for customers to refer to ECS metric field best practices instead of looking up system metrics over here: https://docs.elastic.co/en/integrations/system. |
@ChrsMark Can you chime in here? I think there are some moving parts here now related the merger with semconv. |
Hey! There is a WG taking place these weeks in Otel SemConv: open-telemetry/opentelemetry-specification#3556 |
In elastic/beats#11836 we identified the need to maintain some kind of central definition of metrics that are implemented by both Beats and Elastic APM. The APM agents now produce a subset of
system.cpu.*
,system.memory.*
, andsystem.process.*
metrics defined by Metricbeat. The APM agents are now starting to define language/runtime-specific metrics, and Metricbeat will also start producing at least some of these.We would like to propose extending ECS to cover metrics.
Organisation
I propose we introduce new field sets for:
jvm.*
)golang.*
)nodejs.*
)dotnet.*
)python.*
)ruby.*
)These field sets may contain a mixture of metrics (e.g. heap usage, GC pause timings) and runtime details (e.g. JVM implementation name and version).
Existing system metrics should be renamed to fit into the existing ECS field sets, e.g.
system.cpu.*
->host.cpu.*
system.memory.*
->host.memory.*
system.process.*
->process.*
Naming conventions
Ideally, we would extend the conventions documentation to cover metric naming. Specifically, we should ensure the consistent use of units in metric names, to ensure metrics are discoverable and self-documenting, both of which are important in Metrics Explorer.
One challenge here is that some existing metrics have inconsistent naming. For example:
system.diskio.read.count
,system.process.cpu.user.ticks
(we should either pick.reads
or.ticks.count
)system.process.cpu.total.value
(.value
isn't meaningful, and there's no unit)system.memory.total
vs.system.memory.used.bytes
(both are bytes, but only one says so in the name)We should review https://github.com/elastic/beats/blob/master/docs/devguide/event-conventions.asciidoc#standardised-names, revising it for inclusion in the ECS conventions.
Key points:
all metrics must specify a unit. This means finding an alternative to the following rule from the Beats conventions, since "value" is not a unit:
units are not necessary SI, they may be domain-specific. e.g. "objects", "mallocs" are fine
Alternatively to including units in names, we could wait for elastic/elasticsearch#31244 or elastic/elasticsearch#33267. However, there are existing fields in ECS that include units (
network.bytes
,http.request.bytes
, ...), so it may be best to not wait, and do that a future revision across the board.Open questions:
host.*
andprocess.*
field sets above.Metrics implementation guide
The definition of metrics is sometimes subtle and platform-specific; without providing a guide to this, it is easy for inconsistencies to arise in implementations. We should provide a detailed guide to calculating these metrics either in ECS, or in a companion document linked from ECS.
The text was updated successfully, but these errors were encountered: