-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discuss the possibility of deprecating the tail-based sampling processor #1797
Comments
@tigrannajaryan, once you enable "Discussions" for this repository (it's under settings), please convert this into a discussion. |
I agree with the new design. and what I'm trying now is to use different customized processors to tag span, using two special tags,
this is kinda close to your final proposal, without breaking tail sampler into |
Otel is still deciding whether to have separate "Discussions" per repo: https://github.com/open-telemetry/community/discussions/583 |
Internally we already are using a separate processor for sampling and it works well enough; though we are using the tail sampler + always_sample instead of groupbytrace Before migrating to groupbytrace I'd like to see how it performs compared to the tail sampler. There are these loadtest results in the main repo but I'm not sure the last time they've been updated https://github.com/open-telemetry/opentelemetry-collector/blob/c0e2a27a5aaf40c1556d5d36cb15c3d042669976/docs/performance.md |
I think the tail-sampling in its current form has a limited usability and the plan sounds great, though splitting the procedure across several processors also expands the configuration section, which might be somewhat error prone and verbose. One alternative to tail-sampling processor I would like to throw-in is (essentially) a fork I was working on: cascadingfilterprocessor. It is built around several concepts:
For example, lets say that I have a budget of 1000 spans per second. I want to allocate no more than 500 of spans/second for traces with duration > 10s and no more than 300 spans/second for traces where one of the spans starts with "foo". I want to fill up whatever was left with random traces. The configuration might look like following: processors:
cascading_filter:
spans_per_second: 1000
policies:
[
{
name: duration-policy,
spans_per_second: 500,
properties: {min_duration: 10s }
},
{
name: operation-name-policy,
spans_per_second: 300,
properties: {
name_pattern: "foo.*",
}
},
{
name: everything-else,
spans_per_second: -1
},
] Now consider, following processed batches: Batch 1 On the output we would get Batch 2 On the output we would get So each time, the output batch is filled up to the global limit. The "cascading" name comes from the idea that defined rules are first evaluated in order they are defined, and then the remainder of traces can be "cascaded" to the rule that can fill up the global rate from them. Additionally, the processor allows to do probabilistic sampling of a portion of data to have a good representation of all traces (the current |
One thing that is missing from the original proposal is that we'll need a library with a base context key to hold the current decision for the trace: sample, drop, undecided. Most processors would set either |
There are some questions to the finalizer. How can it decide anything if traces can be modified? For example, using batch processor (its context is not propagated to the next processor). Not obvious, how the decision should be made for a batch. I see only one workaround - manually prohibit some processors in between. |
That's what I had in mind. |
@tigrannajaryan, is the plan sound enough to deprecate the tail-based sampling? |
I have concerns about ease-of-configuration of this proposal. What does a typical configuration going to look like compared to an equivalent configuration when using existing tailbasedsampling processor? Normally I am for composable behavior and for breaking down into smaller processors that can be composed in some other ways. However, I would like to understand in what other ways can these 3 new processors be combined that gives a useful and necessary (not made-up) functionality. Please show some examples. Also, requirements like this "manually prohibit some processors in between" are a red flag for me. (We already have some requirements of this sort for other processors and I think they are a mistake and should be fixed). The alternate is to try to fix the tailbasedsampling processor instead of killing it. I don't see the arguments which say why tailbasedsampling is bad and why it cannot be fixed. |
I'll work on this in January, but I'm also not entirely convinced at this point. To recap: this started once I changed the current tail-based sampling to remove features that exist as dedicated processors, like the groupbytrace. All that was left was the policy part, which I extracted as a new |
* Re-enabling TestExportProtocolConformation_metricsInFirstMessage and removing flaky TestAcceptAllGRPCProtoAffiliatedContentTypes test * removing vestigial code from unused tests
This seems to be blocked, so, here's my proposal to unblock this:
My motivation to move with this is that I want to document the appropriate way to achieve tail based sampling with the collector, and I don't want to document something that won't live much longer. If we do decide that the current tail based sampling processor is indeed what we want to support for the future, that's the one I'll document. |
Talking about sampling logic. How does |
Sounds like a good contender as well! |
I would like to suggest splitting this roadmap into multiple issues with a detailed and definite description of the implementation. This would allow engaging more contributors |
I think the main problem right now is that we don't have a clear answer on whether multiple processors will provide a similar performance when compared to the current solution. I think it might, and I'm running some tests to assess it: https://github.com/jpkrohling/groupbytrace-tailbasedsampling-perf-comparison . Rigth now, I'm waiting on cncf/cluster#167 to get more reliable test runs. That said, my current proposal is to:
How about we build a few proof of concepts? @morigs taking care of the filterprocessor and @pmm-sumo on his previous proposal? I think @pmm-sumo might even be in use already :-) |
Okay, now I better understand the idea. Indeed Also speaking about this
I think it's not always bad. This does not always mean that processor is not composable. This is the matter of dependencies between processors and exporters. We could either combine them together or split them into multiple units for better flexibility in the cost of some manual restrictions. Perhaps otel-collector will have some dependency validation feature someday 🤔 Right now I'm ready to work on open-telemetry/opentelemetry-collector#2310 but I have some questions unanswered thus don't have the full vision of target solution |
Hi, just came across this issue and I was wondering what the status of this effort is? cc/ @mapno |
Given that this processor is part of the contrib, it can be deprecated in the future without huge problems. Given that we are focused on releasing the main distribution v1 soon, we are likely going to take a look at this again only in a few months. That said: I'm convinced that there are some reliability issues with the current tail-based sampling processor but I'm not convinced they are not solvable, nor that the proposed architecture (groupbytrace + policy processors) is better than the current solution. I've seen your PRs and I think they will be useful no matter where we want to go: if we decide to split the policy aspects of the tail-based sampling (which are working fine), we can reuse your contributions quite easily. If you have some time and want to help on this front, here's more information about the tests: https://github.com/jpkrohling/groupbytrace-tailbasedsampling-perf-comparison Let me know if this sounds interesting to you. I work on giving you access to the bare metals to use with the tests :-) |
From time to time, we get questions from people involving this issue. Some of them are concerned about using this only to have it removed in the future. Despite the title of this issue, don't let this prevent you from using this processor! It's still quite useful and we are not going to replace it without giving you an upgrade path. If you use this today and we decide to move forward with the deprecation, we'll make sure the config and behavior are compatible with the current processor. That said, we haven't decided yet on whether we are going to deprecate this or not. This issue here is to discuss the possibility, and it involves a lot of work in terms of building the new components and running performance tests comparing both. |
* Fix typo in jaeger exporter * Update changelog * Add PR number to changelog * Move entry in changelog to Fixed Co-authored-by: Tyler Yahn <MrAlias@users.noreply.github.com>
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping |
This issue has been closed as inactive because it has been stale for 120 days with no activity. |
Hi, @jpkrohling, while trying to go through the README of the Tail Sampling Processor, I was not clear that what is the use case of the rate_sampling policy and how it works internally. I seriously feel there is a need to update the README as well with explaining the use of the policies in more detail and what are the values supported for other parameters for e.g. in the ottl_condition policy there is a parameter called error_mode but nowhere its explained what are the supported values for that parameter. I have opened another issue to track the same here: #35419 |
As discussed during yesterday's SIG Collector meeting, I would like to start planning the deprecation of the tail-based sampling. The replacement would be:
groupbytrace
processor, which replaces the part of the tail-based sampling that holds spans in memory for a specific amount of timeattributesamplingprocessor
, replacing thestring
andint
policies from the tail-based samplingratelimitingsamplingprocessor
, replacing the rate-limiting policyThe rate limiting policy in the current tail-based sampling isn't well documented and it's not very clear how it's supposed to work. For instance, if the policy is set to rate-limit at 10 spans per sec, and 10 traces with 2 spans come in, will 5 traces be dropped? What if 6 of them are marked to be sampled from the string attribute policy? Does the order of the policies matter?
Having each policy as its own processor, the chain of events is well-defined and understood, as it uses the semantics we have in place for the pipeline. It also makes it clear to consumers how it's supposed to work, as they won't need to fork the project to add a custom policy: they can just create their own processors following one of the existing sampling processors.
On a second phase, we should consider adding a rate limiting feature to the individual sampling processors as well, which can be used with or without a rate limiting processor.
Related PRs and issues: open-telemetry/opentelemetry-collector#1786, open-telemetry/opentelemetry-collector#1651, #16161, #16162, open-telemetry/opentelemetry-collector#558
cc @tigrannajaryan, @kbrockhoff, @chris-smith-zocdoc, @ZhengHe-MD, @pmm-sumo
The text was updated successfully, but these errors were encountered: