Discuss the possibility of deprecating the tail-based sampling processor #1797

jpkrohling · 2020-12-10T10:48:34Z

As discussed during yesterday's SIG Collector meeting, I would like to start planning the deprecation of the tail-based sampling. The replacement would be:

(exists) groupbytrace processor, which replaces the part of the tail-based sampling that holds spans in memory for a specific amount of time
(doesn't exist yet) attributesamplingprocessor, replacing the string and int policies from the tail-based sampling
(doesn't exist yet) ratelimitingsamplingprocessor, replacing the rate-limiting policy
no replacement for the "always" policy

The rate limiting policy in the current tail-based sampling isn't well documented and it's not very clear how it's supposed to work. For instance, if the policy is set to rate-limit at 10 spans per sec, and 10 traces with 2 spans come in, will 5 traces be dropped? What if 6 of them are marked to be sampled from the string attribute policy? Does the order of the policies matter?

Having each policy as its own processor, the chain of events is well-defined and understood, as it uses the semantics we have in place for the pipeline. It also makes it clear to consumers how it's supposed to work, as they won't need to fork the project to add a custom policy: they can just create their own processors following one of the existing sampling processors.

On a second phase, we should consider adding a rate limiting feature to the individual sampling processors as well, which can be used with or without a rate limiting processor.

Related PRs and issues: open-telemetry/opentelemetry-collector#1786, open-telemetry/opentelemetry-collector#1651, #16161, #16162, open-telemetry/opentelemetry-collector#558

cc @tigrannajaryan, @kbrockhoff, @chris-smith-zocdoc, @ZhengHe-MD, @pmm-sumo

The text was updated successfully, but these errors were encountered:

jpkrohling · 2020-12-10T10:57:42Z

@tigrannajaryan, once you enable "Discussions" for this repository (it's under settings), please convert this into a discussion.

ZhengHe-MD · 2020-12-10T11:59:54Z

I agree with the new design. and what I'm trying now is to use different customized processors to tag span, using two special tags, sampling.priority and sampling.reason. and use only one numeric attribute policy in tail-based sampler:

{
      name: sample_with_high_priority,
      type: numeric_attribute,
      numeric_attribute: {key: "sampling.priority", min_value: 1, max_value: 1}
}

this is kinda close to your final proposal, without breaking tail sampler into groupbytrace sampler and attributesamplingprocessor. your solution is a more elegant way to separate concerns.

tigrannajaryan · 2020-12-10T15:04:20Z

@tigrannajaryan, once you enable "Discussions" for this repository (it's under settings), please convert this into a discussion.

Otel is still deciding whether to have separate "Discussions" per repo: https://github.com/open-telemetry/community/discussions/583

chris-smith-zocdoc · 2020-12-10T19:52:15Z

Internally we already are using a separate processor for sampling and it works well enough; though we are using the tail sampler + always_sample instead of groupbytrace

Before migrating to groupbytrace I'd like to see how it performs compared to the tail sampler. There are these loadtest results in the main repo but I'm not sure the last time they've been updated https://github.com/open-telemetry/opentelemetry-collector/blob/c0e2a27a5aaf40c1556d5d36cb15c3d042669976/docs/performance.md

pmm-sumo · 2020-12-10T20:42:59Z

I think the tail-sampling in its current form has a limited usability and the plan sounds great, though splitting the procedure across several processors also expands the configuration section, which might be somewhat error prone and verbose.

One alternative to tail-sampling processor I would like to throw-in is (essentially) a fork I was working on: cascadingfilterprocessor. It is built around several concepts:

no separate policy types for numeric/string/other attributes
ability to limit the rate at policy level
ability to limit the rate at global (processor) level
ability to have a policy that fills-up the available global rate with random traces (throughput the code, if anyone is willing to look around, it's called "SecondChance")

For example, lets say that I have a budget of 1000 spans per second. I want to allocate no more than 500 of spans/second for traces with duration > 10s and no more than 300 spans/second for traces where one of the spans starts with "foo". I want to fill up whatever was left with random traces. The configuration might look like following:

processors:
  cascading_filter:
    spans_per_second: 1000
    policies:
      [
          {
            name: duration-policy,
            spans_per_second: 500,
            properties: {min_duration: 10s }
          },
          {
            name: operation-name-policy,
            spans_per_second: 300,
            properties: {
              name_pattern: "foo.*",
            }
         },
        {
          name: everything-else,
          spans_per_second: -1
        },
      ]

Now consider, following processed batches:

Batch 1
(a) 100 traces, 10 spans each where duration > 10s
(b) 100 different traces, 10 spans each where there's a span wth operation name = "foo"
(c) 800 more traces, 10 spans each which do not fit either

On the output we would get 50x (a), 30x (b) and 20x traces randomly selected from (c) and remainder of (not previously selected) (a) and (b)

Batch 2
(a) 10 traces, 10 spans each where duration > 10s
(b) 100 different traces, 10 spans each where there's a span wth operation name = "foo"
(c) 800 more traces, 10 spans each which do not fit either

On the output we would get 10x (a), 30x (b) and 60x traces randomly selected from (c) and remainder of (not previously selected) (b)

So each time, the output batch is filled up to the global limit.

The "cascading" name comes from the idea that defined rules are first evaluated in order they are defined, and then the remainder of traces can be "cascaded" to the rule that can fill up the global rate from them.

Additionally, the processor allows to do probabilistic sampling of a portion of data to have a good representation of all traces (the current sampling.probability is then calculated for each batch)

jpkrohling · 2020-12-11T09:34:16Z

One thing that is missing from the original proposal is that we'll need a library with a base context key to hold the current decision for the trace: sample, drop, undecided. Most processors would set either sample or undecided, but processors like the ratelimiting would return either drop or undecided. A finalizer will then be needed, to effectively sample or drop.

pkositsyn · 2020-12-11T23:32:58Z

There are some questions to the finalizer. How can it decide anything if traces can be modified? For example, using batch processor (its context is not propagated to the next processor). Not obvious, how the decision should be made for a batch. I see only one workaround - manually prohibit some processors in between.

jpkrohling · 2020-12-14T09:52:26Z

I see only one workaround - manually prohibit some processors in between.

That's what I had in mind.

jpkrohling · 2020-12-14T12:47:02Z

@tigrannajaryan, is the plan sound enough to deprecate the tail-based sampling?

tigrannajaryan · 2020-12-16T16:21:55Z

I have concerns about ease-of-configuration of this proposal. What does a typical configuration going to look like compared to an equivalent configuration when using existing tailbasedsampling processor?

Normally I am for composable behavior and for breaking down into smaller processors that can be composed in some other ways. However, I would like to understand in what other ways can these 3 new processors be combined that gives a useful and necessary (not made-up) functionality. Please show some examples.

Also, requirements like this "manually prohibit some processors in between" are a red flag for me. (We already have some requirements of this sort for other processors and I think they are a mistake and should be fixed).

The alternate is to try to fix the tailbasedsampling processor instead of killing it. I don't see the arguments which say why tailbasedsampling is bad and why it cannot be fixed.

jpkrohling · 2020-12-17T08:44:43Z

What does a typical configuration going to look like compared to an equivalent configuration when using existing tailbasedsampling processor?

I'll work on this in January, but I'm also not entirely convinced at this point. To recap: this started once I changed the current tail-based sampling to remove features that exist as dedicated processors, like the groupbytrace. All that was left was the policy part, which I extracted as a new policysamplingprocessor (#1786), while the tailsamplingprocessor is kept for a while for backwards compatibility.

* Re-enabling TestExportProtocolConformation_metricsInFirstMessage and removing flaky TestAcceptAllGRPCProtoAffiliatedContentTypes test * removing vestigial code from unused tests

jpkrohling · 2021-02-10T14:26:32Z

This seems to be blocked, so, here's my proposal to unblock this:

proceed with splitting the policy part from the current tail based sampling processor
run a benchmark comparing the current solution (tail based sampling processor) vs. the new one (groupbytrace plus policy sampling processor)
if the new one proves to be better, deprecate the tail based sampling processor processor
adapt or create a new sampling processor with the ideas discussed here

My motivation to move with this is that I want to document the appropriate way to achieve tail based sampling with the collector, and I don't want to document something that won't live much longer. If we do decide that the current tail based sampling processor is indeed what we want to support for the future, that's the one I'll document.

morigs · 2021-03-15T14:11:01Z

Talking about sampling logic. How does attributesamplingprocessor relate to the existing filterprocessor (except this one doen't support traces yet)?
I believe filterprocessor sound better in terms of sampling decision logic because it allows to use expr expressions. This is very flexible way in opposite to numeric_attribute and string_attribute. It allows to describe complex logic which uses any trace properties (theoretically of course 😄) of any type (it's very unfortunate that I cannot filter on boolean attribute for example).

jpkrohling · 2021-03-15T16:01:29Z

Sounds like a good contender as well!

morigs · 2021-03-16T06:24:29Z

I would like to suggest splitting this roadmap into multiple issues with a detailed and definite description of the implementation. This would allow engaging more contributors

jpkrohling · 2021-03-16T08:21:18Z

I think the main problem right now is that we don't have a clear answer on whether multiple processors will provide a similar performance when compared to the current solution. I think it might, and I'm running some tests to assess it: https://github.com/jpkrohling/groupbytrace-tailbasedsampling-perf-comparison . Rigth now, I'm waiting on cncf/cluster#167 to get more reliable test runs.

That said, my current proposal is to:

understand the performance implications of splitting the tail-based sampling into different processors
create a policy sampling processor, with the current similar feature from the tail-based sampling processor. This would allow the tail-based sampling to be deprecated, and won't block other sampling processors from being created.
Agree on how the next sampling processors will look like. The proposal above is flexible, but as @tigrannajaryan mentioned, the complexity in the configuration might not be warranted. The filter processor sounds a good contender, but I don't have experience with it.

How about we build a few proof of concepts? @morigs taking care of the filterprocessor and @pmm-sumo on his previous proposal? I think @pmm-sumo might even be in use already :-)

morigs · 2021-03-16T10:49:09Z

Okay, now I better understand the idea. Indeed cascadingfilterprocessor as well as existing tailsamplingprocessor are too complex and not very flexible at the same time. I believe the composition of multiple processors like @tigrannajaryan mentioned is a better way. The only question here is how to perform rate-limiting similarly to the cascadingfilterprocessor proposal. When you need to apply different rate limits for each kind of filter it will require multiple pipelines. We can add a rate-limiting feature to the filterprocessor but it sounds dirty to me.

Also speaking about this

Also, requirements like this "manually prohibit some processors in between" are a red flag for me. (We already have some requirements of this sort for other processors and I think they are a mistake and should be fixed).

I think it's not always bad. This does not always mean that processor is not composable. This is the matter of dependencies between processors and exporters. We could either combine them together or split them into multiple units for better flexibility in the cost of some manual restrictions. Perhaps otel-collector will have some dependency validation feature someday 🤔

Right now I'm ready to work on open-telemetry/opentelemetry-collector#2310 but I have some questions unanswered thus don't have the full vision of target solution

yvrhdn · 2021-07-12T13:18:23Z

Hi, just came across this issue and I was wondering what the status of this effort is?
I was contributing a couple of new policies to the current tail sampling processor, but if we wish to replace the processor all together I'd prefer to contribute to the new design. Is there something we can work on to move this discussion along?

cc/ @mapno

jpkrohling · 2021-07-12T13:26:49Z

Given that this processor is part of the contrib, it can be deprecated in the future without huge problems. Given that we are focused on releasing the main distribution v1 soon, we are likely going to take a look at this again only in a few months.

That said: I'm convinced that there are some reliability issues with the current tail-based sampling processor but I'm not convinced they are not solvable, nor that the proposed architecture (groupbytrace + policy processors) is better than the current solution.

I've seen your PRs and I think they will be useful no matter where we want to go: if we decide to split the policy aspects of the tail-based sampling (which are working fine), we can reuse your contributions quite easily.

If you have some time and want to help on this front, here's more information about the tests: https://github.com/jpkrohling/groupbytrace-tailbasedsampling-perf-comparison

Let me know if this sounds interesting to you. I work on giving you access to the bare metals to use with the tests :-)

jpkrohling · 2022-02-17T09:25:09Z

From time to time, we get questions from people involving this issue. Some of them are concerned about using this only to have it removed in the future.

Despite the title of this issue, don't let this prevent you from using this processor! It's still quite useful and we are not going to replace it without giving you an upgrade path. If you use this today and we decide to move forward with the deprecation, we'll make sure the config and behavior are compatible with the current processor.

That said, we haven't decided yet on whether we are going to deprecate this or not. This issue here is to discuss the possibility, and it involves a lot of work in terms of building the new components and running performance tests comparing both.

* Fix typo in jaeger exporter * Update changelog * Add PR number to changelog * Move entry in changelog to Fixed Co-authored-by: Tyler Yahn <MrAlias@users.noreply.github.com>

github-actions · 2023-01-09T03:30:21Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

github-actions · 2023-03-13T05:17:48Z

This issue has been closed as inactive because it has been stale for 120 days with no activity.

DraegoG · 2024-09-25T14:02:56Z

Hi, @jpkrohling, while trying to go through the README of the Tail Sampling Processor, I was not clear that what is the use case of the rate_sampling policy and how it works internally.
I found this discussion started here mentioning the same problem but does not see a conclusion yet.

I seriously feel there is a need to update the README as well with explaining the use of the policies in more detail and what are the values supported for other parameters for e.g. in the ottl_condition policy there is a parameter called error_mode but nowhere its explained what are the supported values for that parameter.

I have opened another issue to track the same here: #35419

jpkrohling mentioned this issue Dec 10, 2020

Add policy sampling processor - 1/2 #1786

Closed

andrewhsu added the release:allowed-for-ga label Jan 20, 2021

andrewhsu assigned jpkrohling Jan 20, 2021

andrewhsu added priority:p2 Medium spec:trace labels Jan 20, 2021

jpkrohling mentioned this issue Feb 1, 2021

Deprecate tailsamplingprocessor #2261

Closed

albertteoh mentioned this issue Feb 10, 2021

Filter Processor Proposal open-telemetry/opentelemetry-collector#560

Closed

albertteoh mentioned this issue Feb 11, 2021

Filter Processor for Tracing open-telemetry/opentelemetry-collector#2310

Closed

BinaryHexer mentioned this issue Mar 31, 2021

tailsampling processor: Add support for logs #2636

Open

LukeMathWalker mentioned this issue Apr 1, 2021

Tail-sampling processor: drop spans matching an attribute #2972

Closed

danielgblanco mentioned this issue May 5, 2021

Stateful Collectors open-telemetry/opentelemetry-collector#3099

Closed

mapno mentioned this issue Jul 23, 2021

Service graphs processor grafana/agent#756

Merged

3 tasks

jpkrohling changed the title ~~Deprecate the tail-based sampling processor~~ Discuss the possibility of deprecating the tail-based sampling processor Feb 17, 2022

alexfdz mentioned this issue Jun 8, 2022

[processor/tailsamplingprocessor] move it to a stable component #10832

Closed

github-actions bot added the Stale label Jan 9, 2023

github-actions bot added the closed as inactive label Mar 13, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 13, 2023

komuw mentioned this issue Aug 29, 2023

Add tracing sampling komuw/otero#11

Closed

komuw mentioned this issue Sep 7, 2023

Add tail based tracing sampling komuw/otero#14

Open

DraegoG mentioned this issue Sep 25, 2024

Need a detailed description of the rate_sampling policy under the tailSampling processor #35419

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discuss the possibility of deprecating the tail-based sampling processor #1797

Discuss the possibility of deprecating the tail-based sampling processor #1797

jpkrohling commented Dec 10, 2020 •

edited

Loading

jpkrohling commented Dec 10, 2020 •

edited

Loading

ZhengHe-MD commented Dec 10, 2020 •

edited

Loading

tigrannajaryan commented Dec 10, 2020

chris-smith-zocdoc commented Dec 10, 2020

pmm-sumo commented Dec 10, 2020

jpkrohling commented Dec 11, 2020

pkositsyn commented Dec 11, 2020 •

edited

Loading

jpkrohling commented Dec 14, 2020

jpkrohling commented Dec 14, 2020

tigrannajaryan commented Dec 16, 2020

jpkrohling commented Dec 17, 2020

jpkrohling commented Feb 10, 2021

morigs commented Mar 15, 2021

jpkrohling commented Mar 15, 2021

morigs commented Mar 16, 2021

jpkrohling commented Mar 16, 2021 •

edited

Loading

morigs commented Mar 16, 2021 •

edited

Loading

yvrhdn commented Jul 12, 2021

jpkrohling commented Jul 12, 2021 •

edited

Loading

jpkrohling commented Feb 17, 2022

github-actions bot commented Jan 9, 2023

github-actions bot commented Mar 13, 2023

DraegoG commented Sep 25, 2024

Discuss the possibility of deprecating the tail-based sampling processor #1797

Discuss the possibility of deprecating the tail-based sampling processor #1797

Comments

jpkrohling commented Dec 10, 2020 • edited Loading

jpkrohling commented Dec 10, 2020 • edited Loading

ZhengHe-MD commented Dec 10, 2020 • edited Loading

tigrannajaryan commented Dec 10, 2020

chris-smith-zocdoc commented Dec 10, 2020

pmm-sumo commented Dec 10, 2020

jpkrohling commented Dec 11, 2020

pkositsyn commented Dec 11, 2020 • edited Loading

jpkrohling commented Dec 14, 2020

jpkrohling commented Dec 14, 2020

tigrannajaryan commented Dec 16, 2020

jpkrohling commented Dec 17, 2020

jpkrohling commented Feb 10, 2021

morigs commented Mar 15, 2021

jpkrohling commented Mar 15, 2021

morigs commented Mar 16, 2021

jpkrohling commented Mar 16, 2021 • edited Loading

morigs commented Mar 16, 2021 • edited Loading

yvrhdn commented Jul 12, 2021

jpkrohling commented Jul 12, 2021 • edited Loading

jpkrohling commented Feb 17, 2022

github-actions bot commented Jan 9, 2023

github-actions bot commented Mar 13, 2023

DraegoG commented Sep 25, 2024

jpkrohling commented Dec 10, 2020 •

edited

Loading

jpkrohling commented Dec 10, 2020 •

edited

Loading

ZhengHe-MD commented Dec 10, 2020 •

edited

Loading

pkositsyn commented Dec 11, 2020 •

edited

Loading

jpkrohling commented Mar 16, 2021 •

edited

Loading

morigs commented Mar 16, 2021 •

edited

Loading

jpkrohling commented Jul 12, 2021 •

edited

Loading