Backpressure Management #2410

mattjohnsonpint · 2023-06-02T16:05:53Z

Problem Statement

The internal BackgroundWorker class manages back-pressure for the SDK. When events, transactions, sessions, etc. are captured, the SentryClient adds them to a ConcurrentQueue within the background worker. The background worker then has a separate task that is responsible for reading from the queue and sending events to the ITransport implementation (typically that's the HttpTransport).

This is generally a good design. However, it suffers from the limitation that there's only one queue for all envelope types, non-prioritized, with a single limit set by SentryOptions.MaxQueueItems - defaulting to 30. If the queue is full, envelopes are dropped when captured, never being added to the queue. Therefore, it's quite possible (especially in a server environment) that a large flood of one type of envelope (such as transactions) can prevent another more important type of envelope (such as error events) from being sent to Sentry.

Solution Brainstorm

We need to support concurrency on the producer side, so the built-in PriortyQueue class is out. We need to be non-blocking on the consumer side, so BlockingCollection is also out. There is no built-in PriortyConcurrentQueue class. Thus, to implement prioritization, we will probably need to use multiple ConcurrentQueue instances in the background worker - perhaps in a dictionary where the key is the envelope type.

We also should have more than one option for controlling the maximum queue size (this might be separate properties, or a dictionary). Currently we just have MaxQueueItems - which is undocumented.

We should probably set higher default queue sizes for ASP.NET and ASP.NET Core. Currently, MaxQueueSize defaults to 30. We do show setting it higher in some of the sample apps, but with no explanation.

Whatever is decided in getsentry/team-sdks#10 should be taken into consideration also.

We potentially have a solution in this spike (although it may need tweaking):

Experimenting with Greedy Sampling #3167

References

Heap Data Structures (and their use in Priority Queues)

The text was updated successfully, but these errors were encountered:

ericsampson · 2023-06-10T12:58:12Z

What about using TPL Dataflow?

bitsandfoxes · 2024-03-04T13:39:36Z

Additional context; Internal doc (sorry): https://vanguard.getsentry.net/p/clnlv0iuj0010s60q9e09ba3f

jamescrosswell · 2024-05-02T09:29:34Z

The python implementation appears to use a combination of whether the queue is full or whether requests have been rate limited to adjust a downsample factor. This is then used to reduce the number of traces that get captured.

Curiously, it looks like rate limits across any category would cause downsampling (not just rate limits applied to traces). It would be good to understand if that was deliberate and, if so, what the thinking was behind that design choice.

@bitsandfoxes are we looking to mimic what the Python SDK is doing in .NET or are we also considering strategies like the one Matt originally proposed (some way to implement priority queues).

bitsandfoxes · 2024-05-02T15:10:04Z

We do have some docs on backpressure.

Curiously, it looks like rate limits across any category would cause downsampling (not just rate limits applied to traces). It would be good to understand if that was deliberate and, if so, what the thinking was behind that design choice.

If traces get rate limited errors get downsampled?

jamescrosswell · 2024-05-02T20:32:08Z

If traces get rate limited errors get downsampled?

No the other way around... Downsampling only affects traces in the Python implementation. However if Metrics or Errors get rate limited, this would cause traces to be downsampled, if I've understood that code correctly.

bitsandfoxes · 2024-05-03T14:30:11Z

After talking to the other teams I think we should mimic the other SDKs.
My reasoning:

It's a lot easier and simpler
It's already out there in the wild and proven to work well (enough)
The idea is to still monitor and ease up on "unhealthy" systems. Transactions are the most likely culprit.
The main target/scenario we'd like this to work really well is spike protection.
It's not just so much to "protect" Relay from too many requests. But if the client is in such a bad state that it has trouble sending events (i.e. the queue is overflowing) we'd ideally no-op as much as possible as to not add the the trauma.

mattjohnsonpint added Feature New feature or request Platform: .NET labels Jun 2, 2023

bitsandfoxes removed Impact: Large labels Dec 4, 2023

jamescrosswell mentioned this issue Feb 23, 2024

Experimenting with Greedy Sampling #3167

Draft

bitsandfoxes removed the Platform: .NET label Feb 29, 2024

bitsandfoxes changed the title ~~Improved background worker~~ Backpressure Management Mar 4, 2024

bitsandfoxes mentioned this issue Mar 4, 2024

SDK Backpressure and 100% Sampling #2946

Closed

jamescrosswell mentioned this issue Apr 18, 2024

Some errors not captured if out of performance units #3283

Closed

jamescrosswell mentioned this issue Aug 26, 2024

Set TracesSampleRate to 1.0 by default #2036

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backpressure Management #2410

Backpressure Management #2410

mattjohnsonpint commented Jun 2, 2023 •

edited by jamescrosswell

Loading

ericsampson commented Jun 10, 2023

bitsandfoxes commented Mar 4, 2024

jamescrosswell commented May 2, 2024

bitsandfoxes commented May 2, 2024

jamescrosswell commented May 2, 2024

bitsandfoxes commented May 3, 2024

Backpressure Management #2410

Backpressure Management #2410

Comments

mattjohnsonpint commented Jun 2, 2023 • edited by jamescrosswell Loading

Problem Statement

Solution Brainstorm

References

ericsampson commented Jun 10, 2023

bitsandfoxes commented Mar 4, 2024

jamescrosswell commented May 2, 2024

bitsandfoxes commented May 2, 2024

jamescrosswell commented May 2, 2024

bitsandfoxes commented May 3, 2024

mattjohnsonpint commented Jun 2, 2023 •

edited by jamescrosswell

Loading