-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Guidance needed: process
vs system
vs container
vs k8s
vs runtime metrics
#1161
Comments
cc @open-telemetry/semconv-system-approvers @open-telemetry/semconv-container-approvers |
I can try to address some of the problems.
I don't think we've written the expectation anywhere. There was a discussion about this in a PR from almost a year ago that was adding a platform specific process metric, where it seemed like
That's true, but in the case of
Right now there are three
Those are metrics that I would probably only expect to see on a process resource. However if I'm understanding correctly, the preferred state would be instead of
That's true afaik, but I think the reason there is this separation in the first place is because there's the |
Thank you for the context! First, I'm not trying to push into any specific direction. I'd be happy with any outcome that would minimize confusion and duplication. If we look into process/system metrics:
Would we be able to unify them?
It should be fine to start using resource attributes as attributes on metrics - today we just imply them, but still, without pid attribute (or Would we still need to provide a cheaper cross-machine cpu time metric convenience in case someone doesn't want to have per-process metrics? There would be metrics that won't have pid as an attribute, e.g number of active|started|stopped processes - they'd happily stay under Some metrics could have required process.pid attribute if they don't make sense across the machine. What problems would it solve:
I'm sure there could be some problems with this approach too. Happy to hear any feedback. |
If I remember correctly that's the main reasoning that the System Metrics WG has concluded to so far.
An equivalent Node Vs Pod example would imply to report sth like
What would happen if users decide to switch from one option to the other? It's still not clear to me how the options would look like but I guess that could end up being more complicated for users compared to the current distinction? Also what would be the carnality impact and query load impact from this? |
I disagree on this point. They are both reported from inside the system, but some are about the entire system itself and some are about each individual process. They are describing distinct entities.
Might be misunderstanding this one, but there are a few process-specific metrics that do not apply to runtimes. I also think it's untenable to create semantic conventions for every possible runtime; there should be a generic system-level option. There's lots of precedence for monitoring processes directly from the system as it can be a good fallback.
Is this to say that these metrics would all be reported under the
I would say this is actually a very important decision that we should expect users to make.
I think there is a third boundary there.
And I think the current semantic conventions maps to these boundaries pretty directly. I think any direction forward should absolutely keep
These could be moved into their own either shared namespace or individual namespaces, and then have different meanings when reported under a |
It's not documented in the semantic conventions - at least I don't see any mention of it on the We should be able to document the attributes applicable to each metric regardless of the unification. By documenting specific attributes we'd also make the cardinality of the metric clear. So, if we explicitly document the attributes we expect to have on this metrics, we could also explain that it does not matter how these attributes are populated (as resource attributes or regular metrics attributes). With this the attachment to resource no longer applies. E.g.
now system is the same metric but without the pid
I don't understand the boundary between runtime and process from semantic convention standpoint. E.g. if I'm designing .NET runtime metrics, should I report Or maybe both so that someone could build cross-language dashboard and alerts?
The current path to success seem to look like:
To decide what your need, you have to
I agree that some of this is inevitable, but as I user I would not like the lack of clarity and no simple default experience I can start with. |
It definitely should be. The intention is definitely for all metrics in the process metrics document to be expected as being reported under this process resource. I can make that change, assuming there is a way to do that with the semconv tooling.
In my eyes they are completely different, but given what we have actually written today I can see it's not very clear. The resource attributes and metrics in the This much isn't clear from current docs generated from semconv yaml, I don't know if it used to be with the handwritten docs. Is there a way to make this more clear using tooling in a way we aren't currently, or should I write something manually somewhere to make it more clear?
I think with above clarifications that are currently missing from semconv docs, the experience is much more straightforward:
I don't see how container and runtime metrics are intertwined with these decisions. They seem separate. If the user is using particular runtimes or using containers, then they should use special instrumentation for those. but the instrumentation for |
You can just list the attributes that should be reported on metric. There is no way to say that metric should be reported under specific resource, and it would not be precise enough anyway. I.e. if someone specified To build dashboard we'd need at least There is no separation between resource vs regular attributes on semantic conventions. Also if someone wants to report the metric and add attributes explicitly on each measurement instead of using resources, this would be totally fine. I think having those specified would be a great improvement.
I think this should also be mentioned in semconv - that OTel language SIGs are not expected to have process/system instrumentation libraries. But we have plenty of them already:
As instrumentation libraries they leave it up to the user to configure resources. Tagging @open-telemetry/javascript-maintainers @open-telemetry/dotnet-contrib-maintainers @open-telemetry/
What I'm offering seems similar:
So you start from a safe (hopefully documented) default and you add details. The process vs runtime still concerns me - we're doing cpu/memory duplication by design and forcing users to build different alerts/dashboards for each runtime whether they care about differences or not. I'd prefer the default to be:
They have a certain level of duplication (cpu, memory), the key difference is where you observe these metrics from. As a user I might be able to record both and effectively I'd need to pick one or another to build my alerts/dashboards/etc on. |
Thanks for this, I was definitely incorrect when I said: I guess this probably works out most of the time, cause the metrics are reported under whatever resource is instrumented, so the metrics are probably typically reported under some manner of application resource that makes it obvious what those metrics are for even though they aren't particularly under a There is still a difference between these So I think they are different, but there probably is still a way for there to just be a
In this scenario, the meaning of
Given that these namespaces probably still need to continue to exist due to having certain metrics that won't be shared, it is probably easier in the long run to keep duplicate named metrics in each namespace because in some scenarios they do mean something quite different based on the context that particular metric point is reported for.
That's kind of disappointing actually. I think I understand why, but it is too bad for the I notice in the instrumentation examples you provided they don't add any identifying attribute like
The example I gave above on the difference between memory usage reported by the Go runtime vs by the OS for the process is one counterexample to support these things remaining separate. The duplication in names doesn't always imply that they are duplicating the exact same value. Sometimes it does; on Linux, a container runtime reading metrics from Unfortunately I don't have enough expertise in all the runtimes and their metrics to say if there are more counterexamples. If this counterexample with memory usage in particular is the only one, or if there are very few, then maybe the unification would be fine and we deal with the prickly differences one by one. For what it's worth, we discussed this in the System Semantic Conventions meeting today. We generally agreed we think it is still worth keeping the metrics in namespaces
I'd welcome additional feedback from the other @open-telemetry/semconv-system-approvers folks. |
In this case, I think the namespace is key to easily identify similar metrics, but that have been computed differently because of their source. Even some signals have the same suffix (e.g |
@open-telemetry/semconv-system-approvers is there any conclusion on this which would result in changing the existing model? Otherwise we can close this if there is no majority towards these changes. |
We have multiple layers of metrics:
process
which reports OS level metrics per process as observed by the OS itselfsystem
metrics that report OS metrics from the OS perspectivecontainer
metrics that are reported by the container runtime about containerPlus we have attributes is all of these namespaces that have something in common:
*.cpu.state
metric attributes #840Problems:
system.linux.memory.available
(add system.linux.memory.slab to system metrics #1078), it's not clear if we'd expect to have OS-specific metrics in each of the namespaces (container.linux.memory.*
,system.linux.*
,process.linux.memory.*
) add system.linux.memory.slab to system metrics #1078 (comment)system.cpu.time
is a sum of allprocess.cpu.time
on the same machine?The text was updated successfully, but these errors were encountered: