-
Notifications
You must be signed in to change notification settings - Fork 784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rename or rejig Sampler #92
Comments
here's a related issue with the ruby tracer, which was accidentally gathering state for spans that are tossed openzipkin/zipkin-ruby#40 |
I wouldn't do anything about this for the 1.0.0 release. Let's move it for the next big release. |
since a year has passed, probably needs another look in case (my) understanding is better now |
I actually have an idea that we could leave the sampler as a name but maybe think of moving sampling as far as possible? Actually to the moment when the span is closed? That way users can do adaptive sampling (which is impossible ATM?). WDYT @adriancole ? |
You can think of adaptive sampling as a combination of parts.
first, it is expressed like reservoir sampling, say accepting up to 1k
traces/second. That's the target rate. Your actual rate will be different,
and need to be corrected (adapted).
In easiest case, you never have 1k traces/second, so you never have to
correct the rate. How would you know?
First, you need to count sample requests over time, like sample requests
per minute. Then, you need to count how many sample requests you approve
over time. When the latter rate is higher than the target, you need to
adapt the sample rate below 100% (you might start with 0%)
Commonly, you would "check" your rate at a higher frequency, like once a
second. If you are higher than your target, you use some strategy to lower
the sample rate. If the latter, you raise the rate. This is the adaptive
part.
This isn't super-smart because Zipkin doesn't have a capacity in terms of
traces because traces can have an arbitrary amount of spans. However, it is
smarter than a fixed percentage of traces (because it allows for budget).
If you want to get smarter, you can either get smarter about what you keep
(trace with error vs normal trace), or get smarter about how much you can
keep (system capacity).
Getting smarter about what you keep is not possible in sleuth, as it cannot
see the result of downstream, and has no means to coordinate a sampling
decision. For this, you'd need to do something like send 100% to
zipkin-sparkstreaming and move the sample decision there.
Getting smarter about how much you can keep is possible. For example, if
you read back a metric of spans accepted by a collector, you can do some
math to adjust to a target rate based on that. Tricky part is coordination
as multiple servers may be sampling (even if it is equivalent nodes in the
same cluster). For example, do you split 10k spans/second equally between
10 nodes? what if one dies, or 10 more are added?
This gets to rather more complex code, like the zookeeper sampler used by
twitter (note this is a collector sampler)
https://github.com/openzipkin/zipkin/tree/master/zipkin-zookeeper
I don't want to distract the issue here, too much, just think of adaptive
as some function and the hard part is modeling the flow of information that
can tell you to accept more or less. You can get arbitrarily fancy and the
edge cases can become really difficult.
|
Thanks for the analysis. Makes perfect sense. I simplified the adaptive sampling issue. I think that we should close this one for now. |
with this pull request we have rewritten the whole Sleuth internals to use Brave. That way we can leverage all the functionalities & instrumentations that Brave already has (https://github.com/openzipkin/brave/tree/master/instrumentation). Migration guide is available here: https://github.com/spring-cloud/spring-cloud-sleuth/wiki/Spring-Cloud-Sleuth-2.0-Migration-Guide fixes #711 - Brave instrumentation fixes #92 - we move to Brave's Sampler fixes #143 - Brave is capable of passing context fixes #255 - we've moved away from Zipkin Stream server fixes #305 - Brave has GRPC instrumentation (https://github.com/openzipkin/brave/tree/master/instrumentation/grpc) fixes #459 - Brave (openzipkin/brave#510) & Zipkin (openzipkin/zipkin#1754) will deal with the AWS XRay instrumentation fixes #577 - Messaging instrumentation has been rewritten
Sampled returning false in sleuth is different than most instrumentation I've seen. For example, a span is managed regardless of whether it is sampled (i.e. still generates callbacks etc). The Sampled state controls the exportable field, which in turn controls propagation and storage (and a special-case for users of
TraceManager.addAnnotation(key, value)
as opposed togetCurrentSpan
)I'd suggest 2 paths, which could both be taken.
If controlling overhead thing is an intended goal of the Sampler, then maybe the before-the-fact is still the right choice. I'd have a few suggestions on that.
The text was updated successfully, but these errors were encountered: