-
Notifications
You must be signed in to change notification settings - Fork 889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cumulative data points should set StartTimeUnixNano
per timeseries
#4184
Comments
StartTimeUnixNano
per timeseriesStartTimeUnixNano
per timeseries
This is important for compatibility between Prometheus and OpenTelemetry going forward. @ArthurSens it might be helpful for folks here to describe what issues this causes when pushing OTLP to Prometheus today. |
See also #2232 |
Is this really a specification issue? I read the spec as saying that implementations should set StartTime to match when the unbroken series of data points started. To me this is an implementation issue. |
@bboreham I believe it is a spec issue (my initial analysis of the spec is here). The spec says:
Which just means that the start time is repeated, not that it matches when the series of data points started. |
I am referring to this text in the spec:
EDIT: After discussion with @jesusvazquez, I wish to point to this part:
I agree the spec could be more explicit, but using a time hours or days earlier as the "known start time" seems perverse to me. |
Current behavior
As mentioned by the Metrics Data Model, the goals of
StartTimeUnixNano
is to help measure the increase rate of unbroken data points and identifying resets and gaps.The current implementation of most OTel SDKs sets the value of
StartTimeUnixNano
as the timestamp representing the start of the instrumented process. This works pretty well when all known data points start being pushed/exposed right at the beginning of the process lifetime, but not so much if unknown data points appear sometime after the process starts.Problem statement
Let me try to explain with an example:
Let's say an application starts at a time
T = 0
, some HTTP requests are happening and they are all successful (i.e. status code 200). After 30s (T+30
), the first request with code 404 happens and it continues to happen every 1s (therefore 1 request per second is the increase rate).The increase rate can be measured with a formula like this:
1 minute after
T
, HTTP requests with status code 404 would have happened 30 times, but let's see how different the measurement would be if we useT
orT+30
as the start time.As mentioned in Current Behavior, the measurement works well when the series is initialized alongside the process, but not so when the series is initialized after.
Requested change
The requested change is that
StartTimeUnixNano
is set separately per time series.But of course, this comes with performance drawbacks. For that to be possible, SDKs will need to store in memory some sort of
time series ID + StartTimeUnixNano
for all time series ever created. This can be a huge drawback for processes wishing to expose high cardinality metrics.I believe the appropriate approach is to allow users to configure the SDK, where they can opt-in to the current behavior or to the behavior I'm suggesting here.
Additional context
The discussion comes from an issue in OpenTelemetry-Collector-Contrib, where we're implementing support for Prometheus/OpenMetrics's created timestamp: open-telemetry/opentelemetry-collector-contrib#32521
The text was updated successfully, but these errors were encountered: