Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API for minimizing no-tracer / non-sampled overhead #31

Closed
dkuebric opened this issue Jan 18, 2016 · 7 comments
Closed

API for minimizing no-tracer / non-sampled overhead #31

dkuebric opened this issue Jan 18, 2016 · 7 comments

Comments

@dkuebric
Copy link
Contributor

As discussed in #27, a no-op tracer is a good default for the API, allowing instrumentation to be sitting in a project without most project users being active tracers.

However, if the instrumentation paths will be hit in cases where users have no tracer, or are using a tracer which samples, there should be some way for instrumentation to avoid work that might currently be done "above the API"--i.e. pulling backtraces, formatting data structures, etc.

An example API might be along the lines of a bool isTracing() method. Thoughts?

@bhs
Copy link
Contributor

bhs commented Jan 18, 2016

@dankosaur this is an important and rich topic, thanks for bringing it up.

As food for thought, I'll link to this thing (which I have mixed feelings about for other reasons, but I'll try to stay on topic!): https://godoc.org/golang.org/x/net/trace#Trace

In particular, there are methods like these:

    // LazyLog adds x to the event log. It will be evaluated each time the
    // /debug/requests page is rendered. Any memory referenced by x will be
    // pinned until the trace is finished and later discarded.
    LazyLog(x fmt.Stringer, sensitive bool)

The function signature is not important here, but more the idea that the expensive work (in this case, whatever the fmt.Stringer has to do) is deferred until it's needed. Sometimes that is not feasible (e.g., the stack trace example), but I try to remember that sampling need not be a decision made before-the-fact; i.e., isTracing() may be difficult to define with certainty in a system that makes sampling decisions after a span has finished or nearly-finished. (Not to say we couldn't have an isNoop() method that would just always return false for Tracers that do sampling after the fact)

@bogdandrutu
Copy link

My vote will be for an always on system. Imagine you have a stuck operation and you try to debug it. Agreed we want to have levels:

  • Is distributed tracing sampled Yes/No
  • Verbose level?

But by default should be always on and the cost to add an annotation(log) + tag should be minimal to not affect the overall system performance.

@yurishkuro
Copy link
Member

I can see merits in both points of view. The lazy tagging/logging is definitely the way to go, but it does not completely remove the overhead. Do we know any performance metrics from tracing systems that do post-trace sampling? I.e. are they feasible in really high throughput services? If services over certain qps do require pre-trace sampling, then it would be the argument in favor of having a shortcut check is_traced().

@bhs
Copy link
Contributor

bhs commented Jan 18, 2016

@yurishkuro there are some truly high-throughput, low-latency systems at google (e.g., bigtable) that do deferred evaluation of logging data: i.e., it definitely can scale. Of course there's a trade-off, though, and in that case it was an API which involved (subtle) refcounting/pinning.

@bogdandrutu
Copy link

In terms of scalability I think it can scale well enough, if a tracing system is < 1% of the total cost in my opinion it is acceptable and for a very low-latency system to be <1% is usually more than 1us (100us per op). So we can definitely do that and we have some good examples which I cannot present right now but they will be presented soon :).

Lazy evaluation is definitely a good thing that needs to be implemented, also the cost is direct proportional with the number of logs (annotations) so I think for a low-latency system this can be managed by the owner of the service.

About the post-trace sampling part it is important only if you guys want to support something like dapper and push or pull the data out of the task.

Also kind of an "is_traced" it is useful for very expensive annotations that are useful but not critical for the system as I mentioned in case you want to debug something live.

@dkuebric
Copy link
Contributor Author

LazyLog-style is definitely one way to address this concern, and to the extent that we want to obey the when in Rome directive, it may be more or less idiomatic in various languages. It also may provide some even more interesting post-hoc sampling possibilities in a truly sophisticated tracer-- @bensigelman is that what you are saying?

I would be fine with LazyX, though a general bool guard seems useful as a basic cross-platform primitive and is easy to reason about. The disadvantage of the general bool guard is it may hamstring lazy collection of guarded data. A compromise might be a callback-based system, which would allow for arbitrary code execution and also deferred processing?

Wanted to clarify that I think this is valuable regardless of philosophy on tracing backends--we will need to prove OT instrumentation introduces minimal overhead in the no-op case, and this type of stuff helps ensure that is the case by providing easy safety tools for instrumentation authors.

@bhs
Copy link
Contributor

bhs commented Nov 16, 2016

... moved to opentracing/specification#8

@bhs bhs closed this as completed Nov 16, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants