-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
store: add ability to limit max num of samples / concurrent queries #798
store: add ability to limit max num of samples / concurrent queries #798
Conversation
Hey, the idea is super nice and we definitely need this, but will time to review only from next week. I would aim for the same options as Prometheus has, so sample limit but also concurrent queries limit. We should aim for same expierience basicallly (: Thanks for this! |
49db847
to
a85e70f
Compare
fff5a06
to
074d3be
Compare
Sorry for a lot of commits - I noticed after the fact that the index reader only returns chunk references but not data about them. I moved it way down to the place where the retrieved chunks are converted into the "proper" format -- it seems like Prometheus does the same, I took a look at it too. I guess it would be nice to add extra stuff to the e2e test to test out this place at this point. Plus, a new metric which shows how many queries there are at the moment would probably be nice too. |
ddeb4ed
to
1bc1f59
Compare
Convert raw chunks into XOR encoded chunks and call the NumSamples() method on them to calculate the number of samples. Rip out the samples calculation into a different function because it is used in two different places.
cd6cb48
to
e87f763
Compare
It should be actually 30 - I miscalculated this.
I believe this is ready for a review. My only concern is that AFAICT there can be huge compacted chunks made by |
Well each chunk has max ~120 samples, so I think this is fine if we just estimate this. |
Compaction compacts blocs by compacting index file. Chunks stay the same (They are just in different "chunk files", but chunks (or segments) length stay the same. |
Thank you for clarifying the part about TSDB/compaction - I wasn't so sure since I don't have a good mental model in my brain of how it all works, yet. Then it is all good as we will not have to download huge chunks. As for the first message - are you saying that we should hardcode the number 120 inside the code? It doesn't seem like a good idea to me. Let me explain why:
Let me know what you think of my comments. If you really think this is the way to go forward and it's the only way it could get merged then I can change it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great overall, some suggestions though. 3 major ones regarding:
- unnecessary decoding and even downloading.
- gate queing time metric
- sample limit code
As for the first message - are you saying that we should hardcode the number 120 inside the code? It doesn't seem like a good idea to me.
wouldn't want our servers to sit unused just because the software that we use overestimates the load :(
Agree, but look on my comment, we can estimate with great confidence checking lenght of bytes.
- Also, what happens if at some point in the future the default length of a TSDB block from around ~2 hours changes to some other duration? The estimation would be even more off;
Block lenght can be 200 years but chunks will be still 120 samples maxium unless in TSDB we change that. If we do we can think about better solution (: But it won't happen anytime soon. See why here: prometheus-junkyard/tsdb#397
- Also, I don't see a big problem with downloading the chunks until the limit has been reached. It seems to that with the other option
--chunk-pool-size
we could achieve a very predictable RAM usage since the Go's GC process should quickly determine that RAM had been deallocated because we aren't allocating ad-hoc objects in this process. The limit is RAM, not network bandwidth; Furthermore, I think the most important thing here is that the query gets stopped at the lowest stage possible and Thanos Query doesn't receive this huge block of data.
Sure, but we can avoid downloading all, see my comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally ok from me and definitely useful feature to have, great! :)
Just few nits and and suggestions.
Ah, I see now where that |
cd6cb48
to
9d0b8a7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
docs/components/store.md
Outdated
efficiency we take 120 as number of samples in | ||
chunk, so the actual number of samples might be | ||
lower, even though maximum could be hit. Cannot | ||
be bigger than 120. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused. You say it cannot be bigger than 120 but default is 50000000 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It means that the upper-bound of samples per chunk is 120. Maybe we should move this part into parentheses in the former sentence: NOTE: efficiency we take 120 as the number of samples in a chunk (it cannot be bigger than that), so the actual number of samples might be lower, even though maximum could be hit.
Would it be clearer?
pkg/store/bucket.go
Outdated
Help: "Number of queries that were dropped due to the sample limit.", | ||
}) | ||
m.queriesLimit = prometheus.NewGauge(prometheus.GaugeOpts{ | ||
Name: "thanos_bucket_store_queries_limit", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prometheus' team use prometheus_engine_queries_concurrent_max
. Naming is hard but I figure we could be consistent with them ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Should rename this to thanos_bucket_store_queries_concurrent_max
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great stuff really like adding this ... think in a future PR we could probably apply it to some of the wider components (query could probably do with limiting) so making the flags more generic would help with this in the future.
Comments are mainly nits around conforming with some of the styles that are already in the Prometheus codebase or conventions.
With the default limit we could also set it to 0
meaning no limit and then do a phased rollout ... set it to 0 for the initial release so we know we will not break anyone and make sure it is in the release notes that the following release will set a limit ... this way people can do some planning and testing before it is set hard.
cmd/thanos/store.go
Outdated
@@ -36,6 +36,12 @@ func registerStore(m map[string]setupFunc, app *kingpin.Application, name string | |||
chunkPoolSize := cmd.Flag("chunk-pool-size", "Maximum size of concurrently allocatable bytes for chunks."). | |||
Default("2GB").Bytes() | |||
|
|||
maxSampleCount := cmd.Flag("grpc-sample-limit", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe align the flags with how prom has flags? grpc-sample-limit
> {query|store|storage}.grpc.read-sample-limit
?
See storage.remote.read-sample-limit
https://github.com/prometheus/prometheus/blob/master/cmd/prometheus/main.go#L206
cmd/thanos/store.go
Outdated
"Maximum amount of samples returned via a single Series call. 0 means no limit. NOTE: for efficiency we take 120 as number of samples in chunk, so the actual number of samples might be lower, even though maximum could be hit. Cannot be bigger than 120."). | ||
Default("50000000").Uint() | ||
|
||
maxConcurrent := cmd.Flag("grpc-concurrent-limit", "Maximum number of concurrent Series calls. 0 means no limit.").Default("20").Int() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above grpc-concurrent-limit
> query.grpc.max-concurrency
?
See query.max-concurrency
https://github.com/prometheus/prometheus/blob/master/cmd/prometheus/main.go#L233
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will rename to store.grpc.max-concurrency
as we are talking about Thanos Store here :P
@@ -42,6 +42,14 @@ import ( | |||
"google.golang.org/grpc/status" | |||
) | |||
|
|||
// Approximately this is the max number of samples that we may have in any given chunk. This is needed | |||
// for precalculating the number of samples that we may have to retrieve and decode for any given query | |||
// without downloading them. Please take a look at https://github.com/prometheus/tsdb/pull/397 to know |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think this is fine 👍
pkg/store/bucket.go
Outdated
// Query gate which limits the maximum amount of concurrent queries. | ||
queryGate *Gate | ||
|
||
// Samples limiter which limits the number of samples per each Series() call. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Samples limiter which
> samplesLimiter
g: gate.New(maxConcurrent), | ||
} | ||
g.inflightQueries = prometheus.NewGauge(prometheus.GaugeOpts{ | ||
Name: "queries_in_flight", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
queries_in_flight
> thanos_store_queries_in_flight_total
?
pkg/store/gate.go
Outdated
Subsystem: reg.Subsystem(), | ||
}) | ||
g.gateTiming = prometheus.NewHistogram(prometheus.HistogramOpts{ | ||
Name: "gate_seconds", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gate_seconds
> thanos_store_gate_duration_seconds
?
CHANGELOG.md
Outdated
New tracing span: | ||
* `store_query_gate_ismyturn` shows how long it took for a query to pass (or not) through the gate. | ||
|
||
:warning: **WARNING** :warning: #798 adds new default limits for max samples per one Series() gRPC method call and the maximum number of concurrent Series() gRPC method calls. Consider increasing them if you have a very huge deployment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets do a phased rollout of this ... can we have 0 mean off ... that way we can phase rollout to people by ensuring we warn them that we will be setting a sensible default and they should ensure their system can deal with it or set it explicitly for their needs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, perhaps most users won't want this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I already approved week ago, so feel free to merge. Found one nit though.
The original error already informs us about what is going wrong.
It's still useful to know that we are talking about samples here exactly.
Setting it to 0 by default doesn't make sense since the Go channel becomes unbuffered and all queries will timeout. Set it to 20 by default since that's the limit on Thanos Query and naturally there won't be more than 20 by default so it's good.
I reverted the changes to the default value of |
Thank you all for the reviews and comments, merging (: |
Add the ability to limit the number of samples that could be retrieved in a single query. We do this by checking how many chunks would have to be downloaded to satisfy any kind of query and multiplying it by 120 -- the new
maxSamplesPerChunk
constant. Why it was chosen to be120
is written above it and in the comments down below.Add the ability to limit the number of concurrent queries - this is done by using a Prometheus gate. Another type has been added which wraps that gate and stores the metrics so that the users would know how many queries are in the queue, what is the limit, and a summary metric which says how long it takes for queries to get through it.
This is a huge pain point for us as it is impossible to do capacity planning otherwise.
Changes
Two new options:
store.grpc.series-sample-limit
which lets you control how many samples can be returned at maximum via a single Series() call. Can potentially overestimate the number of samples but that is noted in the command line parameter.store.grpc.series-max-concurrency
which lets you control the maximum number of concurrent queries - they get blocked until there is appropriate space in the channel. Users can look at the new metrics to determine if a lot of queries are waiting to be executed.New metrics:
thanos_bucket_store_queries_dropped_total
shows how many queries were dropped due to the samples limit;thanos_bucket_store_queries_limit
is a constant metric which shows how many concurrent queries can come into Thanos Store;thanos_bucket_store_queries_in_flight_total
shows how many queries are currently "in flight" i.e. they are being executed;thanos_bucket_store_gate_duration_seconds
shows how many seconds it took for queries to pass through the gate in both cases - when that fails and when it does not.Verification
E2E tests pass which test the new limits, manual testing locally.