-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[exporterhelper] fix missed metric aggregations #9048
[exporterhelper] fix missed metric aggregations #9048
Conversation
This PR ensures that context cancellation in the exporter doesn't interfere with metric aggregation. The OTel SDK currently returns if there's an error in the context used in `Add` (see https://github.com/open-telemetry/opentelemetry-go/blob/6cee2b4a4c76b581115d0d0ca150ad8b2e683db6/sdk/metric/instrument.go#L241-L243). This means that if there's a cancelled context in an export, the metrics are not recorded. Signed-off-by: Alex Boten <aboten@lightstep.com>
…g this when 1.22 is out Signed-off-by: Alex Boten <aboten@lightstep.com>
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #9048 +/- ##
=======================================
Coverage 91.57% 91.57%
=======================================
Files 316 316
Lines 17147 17147
=======================================
Hits 15702 15702
Misses 1150 1150
Partials 295 295 ☔ View full report in Codecov by Sentry. |
@@ -154,7 +154,7 @@ func (or *ObsReport) StartTracesOp(ctx context.Context) context.Context { | |||
// EndTracesOp completes the export operation that was started with StartTracesOp. | |||
func (or *ObsReport) EndTracesOp(ctx context.Context, numSpans int, err error) { | |||
numSent, numFailedToSend := toNumItems(numSpans, err) | |||
or.recordMetrics(ctx, component.DataTypeTraces, numSent, numFailedToSend) | |||
or.recordMetrics(context.Background(), component.DataTypeTraces, numSent, numFailedToSend) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it true that we'd never need the original context inside recordMetrics
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
go 1.21 has a WithoutCancel func I would suggest we move to in the future (whenever 1.22 is out). Thischange removes any context (including baggage), which likely you'd want to keep
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we go forward with this can we add a TODO or an issue to track that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah i was planning on filing an issue to track this change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added an issue #9049
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest use of context.TODO()
here. I understand this solves a short-term problem, but I'd like to ask a bigger question about why we would bypass metric events on canceled contexts. It seems like a recipe for lost time and user confusion, as happened here. If we could convince the Otel-Go SDK to modify this behavior, perhaps restore the context.TODO()
to the original context.
As you suggested, an alternate SDK with an extension for interpreting context will lose access to baggage, something we could fix in the long run. We have a few internal uses where exporterhelper metrics such as these are being extended from context with header values from the gRPC metadata. (Using this.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmacd i followed @fatsheep9146's suggestion and implemented withoutCancel temporarily in this code
I also have met this problem when I implemented the grpc instrumentation library, here is the comment open-telemetry/opentelemetry-go-contrib#4356 (comment) and @pellared from sig go sdk have created an issue talked about this behavior in go sdk sig meeting open-telemetry/opentelemetry-go#4671 The conclusion is go sdk will keep the behavior. And suggest user to use |
Signed-off-by: Alex Boten <aboten@lightstep.com>
thanks @fatsheep9146, copied the implementation of withoutCancel in this PR |
Signed-off-by: Alex Boten <aboten@lightstep.com>
I am thinking about "reviving" the issue. Especially in the context of Logs API+SDK where I do not want the same behavior for Logs SDK. I want to be inline with e.g. https://pkg.go.dev/log/slog#Handler which says:
Changing the metrics SDK behavior would be a breaking change, but I think it would help 90% of the use cases and I doubt it will cause any harm to anyone. Even if so I think it may be an acceptable breaking behavioral change. Please let me know what you think. I think that OTel Go "consumers" point of view is very important here. Because of the current behavior we also have to deal with issues/PRs like this one: open-telemetry/opentelemetry-go-contrib#4634. EDIT: I recreated open-telemetry/opentelemetry-go#4671. Feel free to review and leave comments. I also added the PR to the agenda for the next Go SIG meeting. |
Personally, I can't imagine a user who expects to skip metric events because of context cancellation. Has the group heard any user stories about that? Mainly, I agree with the comment quoted above:
The way I think about it, context cancellation is a mechanism to stop blocking, avoid waiting, or skip useless work. So, I expect a user to inspect the context and avoid instrumented code, not to avoid instrumentation events. It is true that the aggregator could acquire a mutex, and that could briefly block the application. However, so does a call to In my implementation, the context is last used and processed in the synchronous instrument path, before an aggregator sees the event. If the SDK wants to inspect the baggage or other contextual information--to me--it probably wants to be done at a higher level, so before the aggregators, so that context-related behavior is aggregator-independent. |
This PR ensures that context cancellation in the exporter doesn't interfere with metric aggregation. The OTel SDK currently returns if there's an error in the context used in `Add` (see https://github.com/open-telemetry/opentelemetry-go/blob/6cee2b4a4c76b581115d0d0ca150ad8b2e683db6/sdk/metric/instrument.go#L241-L243). This means that if there's a cancelled context in an export, the metrics are not recorded. --------- Signed-off-by: Alex Boten <aboten@lightstep.com>
This PR ensures that context cancellation in the exporter doesn't interfere with metric aggregation. The OTel SDK currently returns if there's an error in the context used in
Add
(see https://github.com/open-telemetry/opentelemetry-go/blob/6cee2b4a4c76b581115d0d0ca150ad8b2e683db6/sdk/metric/instrument.go#L241-L243). This means that if there's a cancelled context in an export, the metrics are not recorded.