Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor tail-sampling processor - Decision cache #31583

Open
Tracked by #31580
jpkrohling opened this issue Mar 5, 2024 · 11 comments
Open
Tracked by #31580

Refactor tail-sampling processor - Decision cache #31583

jpkrohling opened this issue Mar 5, 2024 · 11 comments
Assignees
Labels
processor/tailsampling Tail sampling processor Stale

Comments

@jpkrohling
Copy link
Member

jpkrohling commented Mar 5, 2024

The decision cache can be a regular cached, where the key is the trace ID and the value is the decision. We can use a ring buffer to limit the number of entries in this cache, but this can be an implementation detail. Whenever a new span arrives, we'd run it through the cache and immediately release/drop based on a previous decision.

@jpkrohling jpkrohling added the processor/tailsampling Tail sampling processor label Mar 5, 2024
Copy link
Contributor

github-actions bot commented Mar 5, 2024

Pinging code owners for processor/tailsampling: @jpkrohling. See Adding Labels via Comments if you do not have permissions to add labels yourself.

@jpkrohling jpkrohling self-assigned this Mar 5, 2024
@kentquirk
Copy link
Member

Drop decisions only need a single bit -- "we dropped this", so there's benefit to splitting the decision cache into a Bloom or Cuckoo filter for drops and a separate cache for keep (attaching metadata to spans like sampling probability may require the keep cache to be more than a filter).

Cache size should be configurable.

@djluck
Copy link

djluck commented Mar 5, 2024

@kentquirk can you elaborate a bit more on why we'd need to capture sampling probability per-trace? My understanding is that by entering a trace id into the "sampled" cache we're stating that all spans for this trace should be sampled.

@jpkrohling
Copy link
Member Author

@kentquirk, do you want to work on this one? I can assign this to you. I'd like to work on converting the metrics to use OTel, which could be helpful in adding observability to the cache.

@kentquirk
Copy link
Member

@djluck You're correct that the probability applies to all spans in a trace, but sampling probability / adjusted count is of interest when analyzing the results. When I see a trace I want to know "how many traces like this occurred?" For a simple example, if I normally sample at 10% but keep all errors, I need to be able to multiply non-error traces by 10 in order to calculate the appropriate fraction of errors. To do that, I need to know what the sampling probability was. That data needs to be attached to the outgoing spans.

@jpkrohling You can assign it to me. It might end up getting split into a couple of pieces.

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@jamesmoessis
Copy link
Contributor

jamesmoessis commented May 21, 2024

Hi @kentquirk @jpkrohling this is certainly something I think this processor could use, and would help with early sampling decisions and preventing many spans from being stored in memory for longer than needed.

The discussions around bloom/cuckoo filter are definitely on the right track. Looking through Refinery's code, it uses a dual cuckoo-filter mechanism which seems to be quite memory efficient and relatively low rate of false-positives.

Another possible implementation is using an LRU cache. To make it twice as memory efficient you could also just store the right half of the trace id as a uint64.

As my team is in the midst of implementing tail sampling at quite a large scale, I'd be interested in taking a look at this in the coming weeks.

@jpkrohling
Copy link
Member Author

jpkrohling commented May 21, 2024

I would prefer if someone from Honeycomb could implement the cuckoo filter, to prevent any misunderstandings on the code ending up similar to Refinery's, and while I thought LRU would be the optimal one except for the size, your suggestion of using only the right-half of the IDs is wonderful!

@jpkrohling jpkrohling assigned jpkrohling and unassigned kentquirk Jun 10, 2024
@jpkrohling
Copy link
Member Author

For now, I'll implement the cache as a circular buffer, and we can have alternative implementations right after that.

jpkrohling pushed a commit that referenced this issue Jun 17, 2024
**Description:** 

Adds simple LRU decision cache for sampled trace IDs. 

The design makes it easy to add another cache for non-sampled IDs.

It does not save any other information other than the trace ID that is
sampled. It only holds the right half of the trace ID (as a uint64) in
the cache.

By default the cache remains no-op. Only when the user configures the
cache size will the cache become active.

**Link to tracking Issue:**
#31583

**Testing:**
* unit tests on new code 
* test in `processor_decision_test.go` to test that a trace that was
sampled, cached, but the span data was dropped persists a "keep"
decision.

**Documentation:** Added description to README
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Oct 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
processor/tailsampling Tail sampling processor Stale
Projects
None yet
Development

No branches or pull requests

4 participants