-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Instrumented servers #6074
Instrumented servers #6074
Conversation
4822670
to
0dbde4f
Compare
pkg/store/telemetry.go
Outdated
seriesRequested: promauto.With(reg).NewHistogram(prometheus.HistogramOpts{ | ||
Name: "thanos_store_server_series_requested", | ||
Help: "Number of requested series for Series calls", | ||
Buckets: []float64{1, 10, 100, 1000, 10000, 100000}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can bump this even more? I have some servers sending millions of series in certain calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I vote for allowing this the bucket to be configured via CLI args.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With native histograms we won't have to configure buckets anymore. Let's add a couple more buckets for chunks temporarily until we have complete native histograms support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😄 perhaps it would be worth including a paragraph or two in the docs about all the different kinds of limits that we have? It's becoming a lot 😄
Great work, @fpetkovski! This will be so useful. I was wondering though about the overall limiting situation that we will end having. Particularly thinking about Thanos Store, which already implements limits through the CLI args Maybe we can deprecate them (and clean up the code later on) and point people to the new options introduced by this PR? I think it will be better to avoid confusion. |
Hm good point, I didn't see those in the bucket store. Should we remove those and rename the newly added ones so that they can be applied to any store? I would like to avoid deprecating arguments for the sake of renaming. |
@fpetkovski personally I like more the new names, as they are component-agnostic and thus great candidates to standardize on. I don't think |
1639b53
to
157889e
Compare
This commit adds an instrumented store server that exposes metrics about Series requests and uses it as a decorator around all Store APIs. The instrumented store currenly exposes only two metrics, series requested and chunks requested. Additional metrics can be added as needed. Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
This commit implements a RateLimited store server which can be used to apply various limits to Series calls in components that implement the Store API. Rate limits are disabled by default but can be enabled selectively for each individual Thanos component. Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
157889e
to
0f5ddff
Compare
So I didn't realize we had series and chunk limiters already, so now the PR reuses what's already available for the bucket store. I've also consolidated the names of all flags so that they're the same across stores. The only exception is the bucket store which does not have a rate limited server wrapped around it. It seems to apply limits deeper in the stack before requesting data from object store. For other stores limits are still applied through a decorator. PTAL again :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some very small remarks, but overall I'm pretty happy with the work here. Great initiative, @fpetkovski!
CHANGELOG.md
Outdated
### Changed | ||
|
||
- [#6035](https://github.com/thanos-io/thanos/pull/6035) Replicate: Support all types of matchers to match blocks for replication. Change matcher parameter from string slice to a single string. | ||
>>>>>>> 0dbde4f1 (Add CHANGELOG entry) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small conflict leftover here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, removed.
pkg/store/limiter.go
Outdated
type RateLimits struct { | ||
SeriesPerRequest uint64 | ||
SamplesPerRequest uint64 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to explain in a comment here what's the purpose of this type and/or how the limit works.
For example, I wouldn't even call this type of limiting "rate limiting" because it's not happening over a period of time, like 30 series per second. this limiting is, as the name of the struct fields show, per request limits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a comment on this type to explain how limits are applied. I also renamed the server to limitedServer
and dropped the rate
from it.
pkg/store/limiter.go
Outdated
} | ||
|
||
// rateLimitedStoreServer is a storepb.StoreServer that can apply rate limits against Series requests. | ||
type rateLimitedStoreServer struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add the classic assignment to interface type to ensure this type keeps implementing the storepb.StoreServer
interface at compile time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing, done.
pkg/store/limiter.go
Outdated
} | ||
|
||
// rateLimitedServer is a storepb.Store_SeriesServer that tracks statistics about sent series. | ||
type rateLimitedServer struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add the classic assignment to interface type to ensure this type keeps implementing the storepb.Store_SeriesServer
interface at compile time?
Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
6dcec87
to
011dbb3
Compare
Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks a lot! 🙇
failedRequestsCounter: promauto.With(reg).NewCounterVec(prometheus.CounterOpts{ | ||
Name: "thanos_store_selects_dropped_total", | ||
Help: "Number of select queries that were dropped due to configured limits.", | ||
}, []string{"reason"}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI we already have this inside inside the limiters created on the lines above. Again, I like more the one introduced here for its standard potential and I think we can deprecate the old one later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thank you @fpetkovski 🙇 . I have small nits, especially regarding the flags, but I'm happy to get this in current form.
docs/components/query.md
Outdated
--store.grpc.samples-limit=0 | ||
The maximum samples allowed for a single | ||
Series request, The Series call fails if | ||
this limit is exceeded. 0 means no limit. | ||
NOTE: For efficiency the limit is internally | ||
implemented as 'chunks limit' considering each | ||
chunk contains a maximum of 120 samples. | ||
--store.grpc.series-limit=0 | ||
The maximum series allowed for a single Series | ||
request. The Series call fails if this limit is | ||
exceeded. 0 means no limit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me if I'm wrong, but it seems like we have two distinct store
flag categories:
- One is to configure the stores to which query connects
- The second one added in this PR refers to the store server of the query
I wonder if we could distinguish them more clearly. Maybe store-limit.*
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, should we use something like store.limits.request-series
and store.limits.request-samples
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that would still be mixing both in one bag, since both flag groups are still under flag
. But on the second thought, I think we shot ourself in the foot already since even in existing code we have dual use in query vs store...
And yet now that I'm thinking, we deprecated store
in favor of endpoint
, so eventually any store-connection related flags will be actually under endpoint
.
I guess we can reclaim store
as config for the actual store server, then store.limits.*
is fine 👍.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah in this case store
would be the store server, whereas endpoint
would be something we configure on the client. I updated the names of the flags.
} | ||
|
||
if err := i.seriesLimiter.Reserve(1); err != nil { | ||
return errors.Wrapf(err, "failed to send series") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder once this error bubbles up, if we should use a specific gRPC code. I think currently we use Unknown
for all errors, but perhaps in this case we could use ResourceExhausted
or something similar. Could be useful when debugging issues to see this directly in response or server metrics / span errors.
But can be done additionally, not a blocker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a good idea, but I'm concerned it might grow the scope of the PR even further. Right now we have no way to distinguish error types on the client side, and we mostly rely on surfacing the actual message of the user. In this case they would see something like failed to send series: limit X violated (got Y)
Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
8fb0ea1
to
b3d93b7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's fix the conflicts and get this in 🚀
Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
Should be ready to merge @matej-g |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @fpetkovski 🙇 , let's get this in!
* Add instrumentation to Store servers This commit adds an instrumented store server that exposes metrics about Series requests and uses it as a decorator around all Store APIs. The instrumented store currenly exposes only two metrics, series requested and chunks requested. Additional metrics can be added as needed. Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com> * Add rate limits to Store servers This commit implements a RateLimited store server which can be used to apply various limits to Series calls in components that implement the Store API. Rate limits are disabled by default but can be enabled selectively for each individual Thanos component. Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com> * Add CHANGELOG entry Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com> * Reuse existing limiters Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com> * Fix chunks limit binding Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com> * Unify flag names Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com> * Run make docs Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com> * Add another series bucket Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com> * Code review comments Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com> * Fix changelog Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com> * Rename flags Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com> --------- Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
* Add instrumentation to Store servers This commit adds an instrumented store server that exposes metrics about Series requests and uses it as a decorator around all Store APIs. The instrumented store currenly exposes only two metrics, series requested and chunks requested. Additional metrics can be added as needed. Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com> * Add rate limits to Store servers This commit implements a RateLimited store server which can be used to apply various limits to Series calls in components that implement the Store API. Rate limits are disabled by default but can be enabled selectively for each individual Thanos component. Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com> * Add CHANGELOG entry Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com> * Reuse existing limiters Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com> * Fix chunks limit binding Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com> * Unify flag names Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com> * Run make docs Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com> * Add another series bucket Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com> * Code review comments Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com> * Fix changelog Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com> * Rename flags Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com> --------- Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
This PR adds an InstrumentedStoreServer that exposes metrics
about Series requests and uses it as a decorator around all Store APIs.
The instrumented store currently exposes two histogram metrics, series requested
and chunks requested. Additional metrics can be added as needed.
This PR also implements a RateLimitedStoreServer which can be used
to apply various limits to Series calls in components that implement
the Store API.
Rate limits are disabled by default but can be enabled selectively
for each individual Thanos component.
Changes
Verification
Manual verification and tests.