-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
receive: Expose write chunk queue as flag #5566
Conversation
9115509
to
1d22e08
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to me 👍
Can we just add some hints in the docs (or flag description) on what a good value might be for the queue size? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good to me! Even though it's experimental, as @fpetkovski said maybe adding a word or two on what it does and what are some recommended values to try would be nice.
I haven't documented this flag purposely and there are a few reasons for that, which is why it is hidden and should remain so for now IMO. As you mentioned, it's an experimental feature flag here, and it is also experimental upstream. I'm still trying to run tests to tweak the value and I'd be happy to add proper and more through documentation when I discover the sweet spot and have a better understanding of the tradeoffs it brings (if any). As of now, I've had good results so far with making the queue size equal to the number of active series that are being pushed, but I would not yet like to document that as optimal and as you can see in the description, the existing implementation is actively being iterated on so any provided values might well change. Does that reasoning make sense for exclusion of additional docs right now? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this change.
For the documentation part, I think it is fine for this pr as the flag is marked as hidden.
Signed-off-by: Philip Gough <philip.p.gough@gmail.com>
Signed-off-by: Philip Gough <philip.p.gough@gmail.com>
e635daf
1d22e08
to
e635daf
Compare
This sounds good to me 👍 |
Changes
This change proposes to expose, via the hidden
--tsdb.write-queue-size
flag a chunk write queue forreceive
which was upstreamed to TSDB via prometheus/prometheus#10051.
For clarification, at the time this feature was contributed, the queue was enabled by default but this was subsequently
made opt-in via prometheus/prometheus#10425 for the reasons discussed/reported in prometheus/prometheus#10377
There were some follow on improvements in:
Verification
I ran a three load test against Thanos Receive pushing two million active series oover a period of 4hrs ramping up and reaching peak within 1hr:
v0.27.0
release with 6 receiversv0.27.0
release with 12 receiversThey are displayed on the following graphs from left to right.
Shows a significant reduction in error rate on the last run:
Shows we generally see a correlation between higher error rates and increase in the head chunks metrics.
This appears to be correlated with the duration (latency spikes) in receive and replication.
The result of enabling the queue appears to be great reduction in p90 latency and slight improvement in the p99 taking most request within the default 5s timeout for replication.
Other things to note was that there was no measurable regression in our query path SLOs due to enabling this feature.
Memory remained stable throughout and we even avoided the spike we saw when running with 6 replicas on the latest release.