-
Notifications
You must be signed in to change notification settings - Fork 889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add process.uptime
and system.uptime
metrics to semantic conventions
#2824
Add process.uptime
and system.uptime
metrics to semantic conventions
#2824
Conversation
60189f8
to
d09d87a
Compare
| `process.threads` | UpDownCounter | {threads} | Process threads count. | | | ||
| `process.open_file_descriptors` | UpDownCounter | {count} | Number of file descriptors in use by the process. | | | ||
| `process.context_switches` | Counter | {count} | Number of times the process has been context switched. | `type` SHOULD be one of: `involuntary`, `voluntary` | | ||
| `process.uptime` | Counter | s | Number of seconds that the process has been running. | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a counter or gauge?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it should be a gauge, the value represents the uptime of the system at the given time of recording.
Any further aggregation or calculations hold no additional statistical meaning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, gauge is more appropriate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should be debating whether this is a Counter or an UpDownCounter. I will put my rationale in the main thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we agree on "what is uptime"? I suspect we are not on the same page 🤣
https://en.wikipedia.org/wiki/Uptime#Using_uptime
If let's say have a Chrome browser running 10 tabs (which might give us 11 processes), do we expect the uptime to be added across the 11 processes, and what does that mean?
If the browser is closed, then reopened, what does the uptime mean?
The Linux uptime command seems to be focusing on "how long has it been since the operating system started" https://en.wikipedia.org/wiki/Uptime#Linux, and the uptime would reset if the system restarted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we take the semantic here https://en.wikipedia.org/wiki/Uptime#Records
A Cisco router has been reported to have been running continuously for 21 years.
then making it a counter sounds like wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few minor comments. Thanks for raising this PR!
| `process.threads` | UpDownCounter | {threads} | Process threads count. | | | ||
| `process.open_file_descriptors` | UpDownCounter | {count} | Number of file descriptors in use by the process. | | | ||
| `process.context_switches` | Counter | {count} | Number of times the process has been context switched. | `type` SHOULD be one of: `involuntary`, `voluntary` | | ||
| `process.uptime` | Counter | s | Number of seconds that the process has been running. | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, gauge is more appropriate
@@ -29,6 +30,14 @@ instruments not explicitly defined in the specification. | |||
|
|||
## Metric Instruments | |||
|
|||
### `system.` - General system metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A question I asked myself - would it make sense to have a namespace for system metadata like this? Perhaps system.info.*
. It would mean you could group other system information together like system.info.boottime
, system.info.uptime
and any others. Personally I'm not sure, but it's something to think about if we are looking to add other system metadata to the semconv.
Counter is appropriate because the value is monotonic. We are explicitly interested in detecting resets via this metric, which suggests that UpDownCounter is appropriate. Note the expression As for statistics, the sum (thus, the rate) can meaningfully be aggregated. Consider 10 processes running at a point in time--the This works for spatial aggregation: the This also works for temporal aggregation: if a process has been started and stopped and restarted over a period of time, we can divide Note that I'm writing |
@jmacd you put a good case forward for it being an UpDownCounter, and after reading your points I think I agree. It seems that an Asynchronous UpDownCounter would result in a meaningful aggregations for these metrics, as per your examples. Note that it must be asynchronous because the absolute value is reported, and the API doesn't allow synchronous counters to report absolute values. These implementation details are not referenced in the semantic convention but it's worthwhile mentioning here. |
d09d87a
to
facafef
Compare
I have a few complaints, mostly based on naming.
|
78e91f5
to
ad4ce5e
Compare
I see your point. I agree wrt. to the Would your preference Josh be to name the system metric |
I think there are three things we should have clear distinction. Imagine a server started at 10AM, running till 11AM then put to sleep/hibernate, woke up at 1PM and run till it was shutdown at 2PM. Then the server started again at 3PM and run till now (4PM). I guess the |
Thanks Reiley for putting it out clearly. Yes, you could say these are three different pieces of information. Where this pull request started is this proposal in contrib repo: open-telemetry/opentelemetry-collector-contrib#14130. It talks specifically about |
@astencel-sumo thanks! I think we should clarify it in the spec to avoid confusion and misinterpretation. Wikipedia seems to have different explanation, e.g. https://en.wikipedia.org/wiki/Uptime#Determining_system_uptime If it is 5 hours, it looks like a good fit for Counter; if it is 1 hour, it looks like a good fit for Gauge. |
ad4ce5e
to
c4bfb93
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The semantic and the wording "has been running" are murky. I think we should be crystal clear about #2824 (comment).
This PR was marked stale due to lack of activity. It will be closed in 7 days. |
c4bfb93
to
97994e9
Compare
I don't currently have an answer on how to resolve this. Given that this PR does not block my work and that I have other more important PRs that I want to focus on, I'm going to close this PR. For anybody interested in continuing this work, feel free to use this work in any way. |
Fixes open-telemetry/semantic-conventions#648
Changes
Adds new metrics:
process.uptime
andsystem.uptime
to the semantic conventions.Related issue: open-telemetry/opentelemetry-collector-contrib#14130