Feature to split tailsampling into two phases pre and post sample #30319

tiithansen · 2024-01-06T16:15:45Z

Component(s)

connector/spanmetrics, processor/tailsampling

Is your feature request related to a problem? Please describe.

When generating spanmetrics with exemplars its not possible to specify which traces are going to be stored by tailsampling processor and which not. Currently all traces passing through spanmetrics connector will have exemplars added to them even if traces are not actually stored. This causes frustration for developers trying to debug spikes on graphs.

Describe the solution you'd like

Split tailsampling into two phases:

Presample phase runs policies to detect which traces should be stored. If trace should be stored new attribute will be attached tail_sampling.sample: true. Presample phase will pass all traces to next stages
Postsample phase drops all traces which do not have tail_sampling.sample: true set.

Between these two stages it is possible to run spanmetrics connector which would check tail_sampling.sample attribute to determine if it should export exemplars or not.

Sample pipeline:

processors:
  tail_sampling/presample:
    processor_mode: presample
  tail_sampling/postsample:
    processor_mode: postsample

service:
      telemetry:
        logs:
          encoding: json
      pipelines:
        traces:
          receivers:
            - otlp
          processors:
            - ...
          exporters:
            - loadbalancing
        traces/after_lb:
          receivers:
            - otlp/lb
          processors:
            - tail_sampling/presample
          exporters:
            - forward
            - spanmetrics
        traces/sample:
          receivers:
            - forward
          processors:
            - tail_sampling/postsample
            - batch
          exporters:
            - otlphttp
        metrics:
          receivers:
            - spanmetrics
          processors:
            - attributes/metrics
            - batch/metrics
          exporters:
            - otlphttp/metrics
        logs: null

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

github-actions · 2024-01-06T16:16:02Z

Pinging code owners:

connector/spanmetrics: @albertteoh
processor/tailsampling: @jpkrohling

See Adding Labels via Comments if you do not have permissions to add labels yourself.

jiekun · 2024-01-08T02:02:15Z

correct me if i am wrong, the first tail sampling processor will no sample any trace, but tag them as sampled / not sampled. So that span metrics connector could attach exemplars of those sampled trace to metrics right?

I understand that it will benefit the metrics, but I feel the tail sampling processor should not be further enhanced to support it. And it takes 2x cpu / memory (cmiiw) to handle.

What if we just add some policy in span metrics connector to spot important traces?

tiithansen · 2024-01-08T06:41:07Z

correct me if i am wrong, the first tail sampling processor will no sample any trace, but tag them as sampled / not sampled. So that span metrics connector could attach exemplars of those sampled trace to metrics right?

Pre-sample phase will do the actual policy execution which determines and tags sampled traces.

I understand that it will benefit the metrics, but I feel the tail sampling processor should not be further enhanced to support it. And it takes 2x cpu / memory (cmiiw) to handle.

It will increase CPU usage but not 2x because post-sample will not run policies. What it does is that it will just iterate over spans in trace and drop or keep it depending if its tagged or not.

What if we just add some policy in span metrics connector to spot important traces?

This approach could increase CPU usage 2x because in order for it to make sense we would need to duplicate same logic in metrics processor. Worst case duplicate code and 2x CPU usage. Plus there is a chance that two processors won't come to same conclusion whether it should keep or drop the span.

jiekun · 2024-01-08T06:55:31Z

Thank you @tiithansen . So the pre-sample phase is actually a dry-run phase right?

// The following comments are just personal point of view. We need maintainer's support here.

In order to isolate the responsibilities of different components, I think a dry-run mode (or any other name) can be added to the tail-sampling processor to tag traces as "sampled" or "not sampled."
These data will then be processed by the span metrics connector.
Finally, we can directly filter out traces with the "not sampled" tag using existing components, without the need for the tail-sampling processor. (I think dropping traces directly based on some tags is not the responsibility of tail-sampling processor)

So all we need to modify on tail-sampling processor (If this proposal is accepted) is a dry-run config. And keep most of thing works as before.

But it's just my thought and I would like to have more input for both issue author and maintainers. Thanks for your contribution again @tiithansen

tiithansen · 2024-01-08T09:11:58Z

Yes, it pretty much is a dry-run in that sense. Using some other more light weight processor to do the actual dropping might make sense because tail sampling has a lot of going on inside. It batches traces together etc... But for sake of simplicity I implemented it in tail sampling for now.

There is a PR open with initial implementation already. Feel free to have a look how its implemented.

portertech · 2024-01-10T17:35:53Z

I would love to know which policy sampled the trace (in presample). Perhaps the policy name would make a good default attribute value? The existence suggests true and the value provides useful context. Perhaps prefix the policy type, e.g. probabilistic/ten-percent.

portertech · 2024-01-11T06:30:49Z

I agree with @jiekun, we should use existing components to filter out traces without the sample attribute. Focusing the implementation on the dry-run. Not yet sure how I feel about "presample" and "dry-run", I suspect there's a better name of this mode of operation 🤔

tiithansen · 2024-01-11T09:28:10Z

Adding policy name which triggered sampling should be fairly easy. But for debugging it would make sense if we could add whole decision tree to the attribute value. For example if we have nested and policies then it does not give any value if we just see root and policy. We should see something like root.nested_and.latency/1s.

Regarding "presample" and "dry-run". Maybe we could just have two types of behaviors in processor_mode decide_and_filter which is current and will remain default and decide_and_tag which would be the new one.

portertech · 2024-01-11T17:46:03Z

@tiithansen excellent point and great suggestions. Is there "prior art" with another component having this kinds of mode switch? Wondering if we start to establish a convention, or we go it alone.

portertech · 2024-01-12T17:41:54Z

FYI Honeycomb Refinery refers to "dry run mode" https://github.com/honeycombio/refinery#dry-run-mode

tiithansen · 2024-01-12T20:12:52Z

To me this dry-run sounds a bit different. Based on the info there, it looks more like helm dry-run where it checks if manifest/rules are correct and working as expected. In our case we are not evaluating tail sampling policy, but we are changing how tailsampling should handle unwanted traces. If it should tag them or drop them.

I will update the pull request soon with changes what have been discussed here.

tiithansen · 2024-01-13T13:49:39Z

@portertech @jiekun I have updated the PR. Tailsampling readme contains a full example also how to use tailsampling, filterprocessor an spanmetrics together.

github-actions · 2024-03-14T03:52:23Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

processor/tailsampling: @jpkrohling
connector/spanmetrics: @portertech

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions · 2024-05-13T05:19:53Z

This issue has been closed as inactive because it has been stale for 120 days with no activity.

tiithansen added enhancement New feature or request needs triage New item requiring triage labels Jan 6, 2024

github-actions bot added the connector/spanmetrics label Jan 6, 2024

github-actions bot added the processor/tailsampling Tail sampling processor label Jan 6, 2024

tiithansen mentioned this issue Jan 6, 2024

[connector/spanmetrics] [processor/tailsampling] Tail sampling phases #30320

Closed

github-actions bot mentioned this issue Jan 9, 2024

Weekly Report: 2024-01-02 - 2024-01-09 #30334

Closed

github-actions bot mentioned this issue Jan 16, 2024

Weekly Report: 2024-01-09 - 2024-01-16 #30565

Closed

This was referenced Jan 23, 2024

Weekly Report: 2024-01-16 - 2024-01-23 #30711

Closed

Weekly Report: 2024-01-23 - 2024-01-30 #30848

Closed

github-actions bot mentioned this issue Feb 6, 2024

Weekly Report: 2024-01-30 - 2024-02-06 #31055

Closed

This was referenced Feb 13, 2024

Weekly Report: 2024-02-06 - 2024-02-13 #31192

Closed

Weekly Report: 2024-02-13 - 2024-02-20 #31323

Closed

This was referenced Feb 20, 2024

Weekly Report: 2024-02-13 - 2024-02-20 asuresh4/opentelemetry-collector-contrib#11541

Open

Weekly Report: 2024-02-20 - 2024-02-27 #31422

Closed

Weekly Report: 2024-02-20 - 2024-02-27 asuresh4/opentelemetry-collector-contrib#11542

Open

This was referenced Mar 5, 2024

Weekly Report: 2024-02-27 - 2024-03-05 #31560

Closed

Weekly Report: 2024-02-27 - 2024-03-05 asuresh4/opentelemetry-collector-contrib#11543

Open

Weekly Report: 2024-03-05 - 2024-03-12 #31693

Closed

github-actions bot added the Stale label Mar 14, 2024

This was referenced Mar 19, 2024

Weekly Report: 2024-03-12 - 2024-03-19 #31825

Closed

Weekly Report: 2024-03-12 - 2024-03-19 asuresh4/opentelemetry-collector-contrib#11544

Open

Weekly Report: 2024-03-19 - 2024-03-26 #31947

Closed

This was referenced Apr 2, 2024

Weekly Report: 2024-03-26 - 2024-04-02 #32082

Closed

Weekly Report: 2024-04-02 - 2024-04-09 #32230

Closed

github-actions bot mentioned this issue Apr 16, 2024

Weekly Report: 2024-04-09 - 2024-04-16 #32407

Closed

github-actions bot added the closed as inactive label May 13, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature to split tailsampling into two phases pre and post sample #30319

Feature to split tailsampling into two phases pre and post sample #30319

tiithansen commented Jan 6, 2024

github-actions bot commented Jan 6, 2024

jiekun commented Jan 8, 2024 •

edited

Loading

tiithansen commented Jan 8, 2024

jiekun commented Jan 8, 2024

tiithansen commented Jan 8, 2024

portertech commented Jan 10, 2024 •

edited

Loading

portertech commented Jan 11, 2024

tiithansen commented Jan 11, 2024

portertech commented Jan 11, 2024

portertech commented Jan 12, 2024 •

edited

Loading

tiithansen commented Jan 12, 2024

tiithansen commented Jan 13, 2024

github-actions bot commented Mar 14, 2024

github-actions bot commented May 13, 2024

Feature to split tailsampling into two phases pre and post sample #30319

Feature to split tailsampling into two phases pre and post sample #30319

Comments

tiithansen commented Jan 6, 2024

Component(s)

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

github-actions bot commented Jan 6, 2024

jiekun commented Jan 8, 2024 • edited Loading

tiithansen commented Jan 8, 2024

jiekun commented Jan 8, 2024

tiithansen commented Jan 8, 2024

portertech commented Jan 10, 2024 • edited Loading

portertech commented Jan 11, 2024

tiithansen commented Jan 11, 2024

portertech commented Jan 11, 2024

portertech commented Jan 12, 2024 • edited Loading

tiithansen commented Jan 12, 2024

tiithansen commented Jan 13, 2024

github-actions bot commented Mar 14, 2024

github-actions bot commented May 13, 2024

jiekun commented Jan 8, 2024 •

edited

Loading

portertech commented Jan 10, 2024 •

edited

Loading

portertech commented Jan 12, 2024 •

edited

Loading