-
Notifications
You must be signed in to change notification settings - Fork 888
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Context to stop tracing #530
Comments
OpenTelemetry Ruby does this and I really like the idea |
See the Python solution, which I think is rather elegant: open-telemetry/opentelemetry-python#181 (updated in open-telemetry/opentelemetry-python#395 for OTEP 66). You introduce a conventional context variable "suppres_instrumentation" that all instrumentations check and the span processors set to true while exporting. |
Why do the instrumentations check? Would you not just have the tracer check the context and return a non-reporting span? |
That would also work, but personally I prefer the more explicit no-magic approach here. Also, the instrumentation overhead can be further reduced if the instrumentation checks manually and shortcuts, instead of doing all the usual instrumentation with a no-op Span. Checking Span.IsRecording would probably work in most cases though. |
I think that is the primary use for the |
I agree with @dyladan . (A) The tracer is already managing context (activeSpan, withActiveSpan). (B) Fewer places is better. |
@pauldraper The tracer will stop "managing" context but it will still need to access the context when creating a new span, see #527 |
Hi, another opentelemetry-python contributor chiming in, thanks @Oberon00!
I think have the tracer check the flag is a good idea. I don't think it's too much magic, and comes at the benefit of not having to require instrumentation authors to implement this pattern. We can control this behavior on the sdk level, which enables stronger guarantees on behavior consistency.
It doesn't look like that's the intention of IsRecording, at least from what I see in the spec:
I haven't seen examples of these expensive computations yet, but it seems unrelated to whether the span itself should be exported. I would argue for a separate context variable (suppressInstrumentation to borrow our current python convention). The additional pro/con of a suppressInstrumentation flag separate from the Span itself is the fact that both metrics and spans can share that value. But maybe, it would be better if we had those as two separate flags as most likely you will only want to suppress one or the other, most likely in the exporter. So maybe:
|
You have to look at everything the instrumentation does for each span. E.g. the sum of all the things that the Python WSGI instrumentation does per request is relatively expensive. |
@toumorokoshi I was saying the primary use of In my opinion the "correct" implementation on the instrumentation side would be:
This avoids the need for every instrumentation to check this flag. Because there will be many instrumentations, we cannot guarantee none of them will forget. The flag is set on the context by span processors before they call exporters, or optionally in the exporter. I have no strong opinion on this difference. If an instrumentation wants to disable the instrumentation of the lower-level protocols, it may also set the flag. e.g. some database driver which uses http to talk to the database may want to disable the http instrumentation in favor of instrumenting the protocol they've built on top of it. If starting the span itself requires one of these expensive operations, then nothing is preventing the instrumentations from also checking the context flag. |
I think I understand. You're referring to adding a check, per integration, to see if the IsRecording flag is set. if it's false, then we avoid adding attributes altogether, which would skip a non-trivial amount of code to collect and add attributes: I think if the spec called out explicitly that the resulting span would actually be missing attributes that would have been added in an IsRecording scenario, that would have been a little clearer to me. Thank you for clarifying.
Make sense. As a clarification, is it just a non-recording span? It's a non-recording, no-op span that ensures that the span will never reach the exporter in the first place, correct? Although it makes sense to me intuitively, I don't see a call-out for a no-op or default span object that would not emit spans to spanprocessors in the specification. So maybe that's something that should be added as well. |
yes
A non-recording span doesn't record any attributes at all
A no-op span is one step further than a non-recording span. A non-recording span still propagates context, just doesn't record anything to the backend. It is useful in cases where you do not want to break traces. A no-op span can't be propagated, because it has no trace context or trace state. The non-recording span isn't sent to the backend, but propagates trace context correctly so that traces aren't broken. In this instance, a no-op span would be equally effective and possibly an even better choice. |
@toumorokoshi Your link is actually perfect to illustrate why we additionally could use the "surpress_instrumentation" context variable: In https://github.com/open-telemetry/opentelemetry-python/blob/v0.6.0/ext/opentelemetry-ext-wsgi/src/opentelemetry/ext/wsgi/__init__.py#L121, attributes are collected to be passed to start_span (so the sampler has more info and knows e.g. the URL), so the is_recording property could only be queried after. |
Ah! That's a very good point. I guess between the two, I would probably op to just use the context variable, which can then be wrapped in a utiltity method on the trace module itself like |
@pauldraper will a PR be filed on this? I like this approach and would love to see it in the spec so I can implement it. |
@trask I wonder, how does auto-instr-java deal with that? It has a gRPC instrumentation, would that not cause a feedback loop with the OTLP exporter currently? |
In auto-instr-java, we load the exporter (and it's dependencies) inside of a separate class loader. This is primarily done to avoid version conflicts, but has the nice the side-effect that we can skip instrumentation of those classes (by looking at what classloader they're in) and avoid the feedback loop. |
One interesting caveat to the above: open-telemetry/opentelemetry-java-instrumentation#375 (comment) |
For a reference point, I've run into 3) myself. I find spans from e.g., the netty instrumentation noise when they're being wrapped by a client library, for example the aws sdk. Others may not though meaning some configurability would be nice. But it creates the problem that
Being able to pause new spans for a stack frame would help with this. |
@dyladan was there a decision around what the right way to move forward for suppressing instrumentations? I see the proposed solution of making a key available through the API was closed. |
This new instrumentationDisabled flag should be also propagated from process to process. Servers/Consumers could use it to automatically disable instrumentation for their spans. My use case: multiple services uses common Redis library with added instrumentation. Service A periodically sends KeepAlive messages to other services, and I want to disable tracing for them on both Publisher and Subscriber sides. On Publisher side this is quite easy, I can tell Redis library skip instrumentation. However on Subscriber side this is not so easy, Redis library would have to examine messages to check if this is KeepAlive or not. One potential solution to it is to not send TraceParent/State from Publisher for KeepAlives, and create new span in Subscriber only if they are received. However this has a downside that spans are not created for messages sent from services which are not instrumented yet. Because of this it is better to explicitly send instrumentationDisabled flag. |
Nope. The PR was closed because there wasn't agreement and as far as I know nobody ever followed up. I stopped working on it because I have to admit I was feeling a little defeated.
This introduces serious trust concerns. Without somehow signing the request to ensure that it is trusted, this would be ripe for abuse. I would argue that paying to cost of inspecting the message to see that it is a keepAlive on the server side is a much simpler solution less prone to errors and abuse. |
@sirzooro If you need propagation of the flag, you are probably looking for the existing sampled flag, not this feature. |
Thanks, this is exactly what I was looking for. |
The need for a mechanism like this was brought up again by @tsloughter in the spec call again today. At this point most if not all SIGs have some mechanism like the context solution that my PR proposed. At least JS, Ruby, Python, and .NET have all implemented something similar. Not sure which of those implemented in the API vs SDK, but there are definitely instrumentations in the wild using it in multiple languages. In JS it was implemented as a context key which the SDK recognizes and provides a helper function to set and unset. IMO this is a poor situation because it tightly couples the instrumentation with the SDK. I would prefer to implement it in the API but did not want to do that without specification. The proposal was a simple key which suppressed tracing by indicating a nonrecording span should be returned. This prevents exporter loops and allows higher level module instrumentations like axios to suppress lower level ones like http. Other signals were not needed to be suppressed because they don't have these problems (incrementing a metric on export does not cause another export like in tracing, and axios/http would have separate metric streams). I would like to propose that we move forward with that original proposal. It solves the problems which need to be solved, is already implemented in many places, and does not preclude a more advanced per-signal, or any other, mechanism from being specified in the future. I think this is one place where we have let the desire for perfection get in the way of pragmatism. |
@dyladan by "original proposal" do you mean an existing PR? I was going to make one today and planned it to be a function like |
@tsloughter I meant this #1653 |
Catching up a bit on the old conversation from #1653, it looks like the original proposal was to suppress all signals with this feature. Personally, I think that's a good thing. It is what .NET's implementation currently does. It can be necessary to suppress other signals as well. Other signals can also generate undesirable loops in a sense. An http client metric incremented on export will be exported on the next cycle - which would occur infinitely. Same kind of thing with logs. Tangentially, it can complicate things for us if/when we want to generate telemetry about the SDK itself, but I think this is a different problem to solve. |
Is there a clean way to have a feature flag which stops exporting the trace to the backend? (I'm using gcp and opentelemetry with python, and want to disable tracing while developing locally, but enable in production) |
@dyladan I have a need to suppress some noisy traces (that effectively poll a message queue) in our app. Can you elaborate on how this can be done in Ruby? |
This is relevant for opentelemetry-cpp. The OTLP GRPC exporter uses grpc, and the grpc C++ library can have instrumentation that uses opentelemetry-cpp. cc @open-telemetry/cpp-maintainers |
There are circumstances for instrumented actions where:
(1) has come up in #173
(2) has been a problem in opentelemetry-js (open-telemetry/opentelemetry-js#332) HTTP-based exporters, since the Node.js stdlib is instrumented globally. The solution in that has was adding a special HTTP header
x-opentelemetry-outgoing-request
headers, that the http instrumentation ignores. In essence it uses HTTP headers as a poor-man's context API. (This may not scale to other APIs.)It would be nice to cut off all automatic tracing "below this current scope" by setting a context key. The default tracer implementation would create no-op spans if the context disables spans.
EDIT: Case (1) may call for sampling, rather than full disabling.
The text was updated successfully, but these errors were encountered: