receive: Expose write chunk queue as flag #5566

philipgough · 2022-08-03T14:21:40Z

I added CHANGELOG entry for this change.
Change is not relevant to the end user.

Changes

This change proposes to expose, via the hidden --tsdb.write-queue-size flag a chunk write queue for receive
which was upstreamed to TSDB via prometheus/prometheus#10051.

For clarification, at the time this feature was contributed, the queue was enabled by default but this was subsequently
made opt-in via prometheus/prometheus#10425 for the reasons discussed/reported in prometheus/prometheus#10377

There were some follow on improvements in:

Verification

I ran a three load test against Thanos Receive pushing two million active series oover a period of 4hrs ramping up and reaching peak within 1hr:

v0.27.0 release with 6 receivers
v0.27.0 release with 12 receivers
This branch with the queue enabled at 20m

They are displayed on the following graphs from left to right.

Shows a significant reduction in error rate on the last run:

Shows we generally see a correlation between higher error rates and increase in the head chunks metrics.
This appears to be correlated with the duration (latency spikes) in receive and replication.

The result of enabling the queue appears to be great reduction in p90 latency and slight improvement in the p99 taking most request within the default 5s timeout for replication.

Other things to note was that there was no measurable regression in our query path SLOs due to enabling this feature.
Memory remained stable throughout and we even avoided the spike we saw when running with 6 replicas on the latest release.

fpetkovski

This makes sense to me 👍

fpetkovski · 2022-08-04T04:28:27Z

Can we just add some hints in the docs (or flag description) on what a good value might be for the queue size?

matej-g

This is looking good to me! Even though it's experimental, as @fpetkovski said maybe adding a word or two on what it does and what are some recommended values to try would be nice.

philipgough · 2022-08-05T14:45:19Z

@fpetkovski, @matej-g:

I haven't documented this flag purposely and there are a few reasons for that, which is why it is hidden and should remain so for now IMO.

As you mentioned, it's an experimental feature flag here, and it is also experimental upstream. I'm still trying to run tests to tweak the value and I'd be happy to add proper and more through documentation when I discover the sweet spot and have a better understanding of the tradeoffs it brings (if any).

As of now, I've had good results so far with making the queue size equal to the number of active series that are being pushed, but I would not yet like to document that as optimal and as you can see in the description, the existing implementation is actively being iterated on so any provided values might well change.

Does that reasoning make sense for exclusion of additional docs right now?

CHANGELOG.md

yeya24

I like this change.
For the documentation part, I think it is fine for this pr as the flag is marked as hidden.

Signed-off-by: Philip Gough <philip.p.gough@gmail.com>

fpetkovski · 2022-08-08T06:17:46Z

This sounds good to me 👍

pull-request-size bot added the size/S label Aug 3, 2022

philipgough force-pushed the write-queue-flag branch 2 times, most recently from 9115509 to 1d22e08 Compare August 3, 2022 14:24

philipgough marked this pull request as ready for review August 3, 2022 15:12

fpetkovski previously approved these changes Aug 4, 2022

View reviewed changes

matej-g previously approved these changes Aug 5, 2022

View reviewed changes

philipgough mentioned this pull request Aug 5, 2022

High error rate and choking of receives during load test #5452

Closed

yeya24 reviewed Aug 6, 2022

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

yeya24 reviewed Aug 6, 2022

View reviewed changes

yeya24 previously approved these changes Aug 6, 2022

View reviewed changes

philipgough added 2 commits August 6, 2022 18:30

receive: Expose write chunk queue as flag

7d19258

Signed-off-by: Philip Gough <philip.p.gough@gmail.com>

docs: Update changelog

e635daf

Signed-off-by: Philip Gough <philip.p.gough@gmail.com>

philipgough dismissed stale reviews from yeya24, matej-g, and fpetkovski via e635daf August 6, 2022 17:32

philipgough force-pushed the write-queue-flag branch from 1d22e08 to e635daf Compare August 6, 2022 17:32

yeya24 approved these changes Aug 6, 2022

View reviewed changes

yeya24 merged commit 291b6fa into thanos-io:main Aug 8, 2022

philipgough deleted the write-queue-flag branch August 15, 2022 15:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

receive: Expose write chunk queue as flag #5566

receive: Expose write chunk queue as flag #5566

philipgough commented Aug 3, 2022 •

edited

Loading

fpetkovski left a comment

fpetkovski commented Aug 4, 2022

matej-g left a comment

philipgough commented Aug 5, 2022

yeya24 left a comment

fpetkovski commented Aug 8, 2022

receive: Expose write chunk queue as flag #5566

receive: Expose write chunk queue as flag #5566

Conversation

philipgough commented Aug 3, 2022 • edited Loading

Changes

Verification

fpetkovski left a comment

Choose a reason for hiding this comment

fpetkovski commented Aug 4, 2022

matej-g left a comment

Choose a reason for hiding this comment

philipgough commented Aug 5, 2022

yeya24 left a comment

Choose a reason for hiding this comment

fpetkovski commented Aug 8, 2022

philipgough commented Aug 3, 2022 •

edited

Loading