Instrumented servers #6074

fpetkovski · 2023-01-26T10:43:22Z

This PR adds an InstrumentedStoreServer that exposes metrics
about Series requests and uses it as a decorator around all Store APIs.
The instrumented store currently exposes two histogram metrics, series requested
and chunks requested. Additional metrics can be added as needed.

This PR also implements a RateLimitedStoreServer which can be used
to apply various limits to Series calls in components that implement
the Store API.

Rate limits are disabled by default but can be enabled selectively
for each individual Thanos component.

I added CHANGELOG entry for this change.
Change is not relevant to the end user.

Changes

Expose histogram metrics from stores about series and chunks per Series request.
Allow configuring limits for series and chunks per Series request.

Verification

Manual verification and tests.

docs/components/receive.md

pkg/store/ratelimit.go

GiedriusS · 2023-01-27T14:59:19Z

pkg/store/telemetry.go

+		seriesRequested: promauto.With(reg).NewHistogram(prometheus.HistogramOpts{
+			Name:    "thanos_store_server_series_requested",
+			Help:    "Number of requested series for Series calls",
+			Buckets: []float64{1, 10, 100, 1000, 10000, 100000},


Maybe we can bump this even more? I have some servers sending millions of series in certain calls.

I vote for allowing this the bucket to be configured via CLI args.

With native histograms we won't have to configure buckets anymore. Let's add a couple more buckets for chunks temporarily until we have complete native histograms support.

GiedriusS

😄 perhaps it would be worth including a paragraph or two in the docs about all the different kinds of limits that we have? It's becoming a lot 😄

douglascamata · 2023-02-06T17:41:44Z

Great work, @fpetkovski! This will be so useful.

I was wondering though about the overall limiting situation that we will end having.

Particularly thinking about Thanos Store, which already implements limits through the CLI args --store.grpc.touched-series-limit and --store.grpc.series-sample-limit.

Maybe we can deprecate them (and clean up the code later on) and point people to the new options introduced by this PR? I think it will be better to avoid confusion.

fpetkovski · 2023-02-07T09:43:02Z

Hm good point, I didn't see those in the bucket store. Should we remove those and rename the newly added ones so that they can be applied to any store? I would like to avoid deprecating arguments for the sake of renaming.

douglascamata · 2023-02-07T16:09:05Z

@fpetkovski personally I like more the new names, as they are component-agnostic and thus great candidates to standardize on. I don't think touched-series-limit for series limit and series-sample-limit for samples are good names to standardize on... they are confusingly different.

This commit adds an instrumented store server that exposes metrics about Series requests and uses it as a decorator around all Store APIs. The instrumented store currenly exposes only two metrics, series requested and chunks requested. Additional metrics can be added as needed. Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

This commit implements a RateLimited store server which can be used to apply various limits to Series calls in components that implement the Store API. Rate limits are disabled by default but can be enabled selectively for each individual Thanos component. Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

fpetkovski · 2023-02-08T09:17:01Z

So I didn't realize we had series and chunk limiters already, so now the PR reuses what's already available for the bucket store. I've also consolidated the names of all flags so that they're the same across stores.

The only exception is the bucket store which does not have a rate limited server wrapped around it. It seems to apply limits deeper in the stack before requesting data from object store. For other stores limits are still applied through a decorator.

PTAL again :)

douglascamata

I have some very small remarks, but overall I'm pretty happy with the work here. Great initiative, @fpetkovski!

douglascamata · 2023-02-08T10:36:53Z

CHANGELOG.md

+### Changed
+
+- [#6035](https://github.com/thanos-io/thanos/pull/6035) Replicate: Support all types of matchers to match blocks for replication. Change matcher parameter from string slice to a single string.
+>>>>>>> 0dbde4f1 (Add CHANGELOG entry)


Small conflict leftover here.

Good catch, removed.

douglascamata · 2023-02-08T10:47:54Z

pkg/store/limiter.go

+type RateLimits struct {
+	SeriesPerRequest  uint64
+	SamplesPerRequest uint64
+}


Would be nice to explain in a comment here what's the purpose of this type and/or how the limit works.

For example, I wouldn't even call this type of limiting "rate limiting" because it's not happening over a period of time, like 30 series per second. this limiting is, as the name of the struct fields show, per request limits.

I added a comment on this type to explain how limits are applied. I also renamed the server to limitedServer and dropped the rate from it.

douglascamata · 2023-02-08T10:50:03Z

pkg/store/limiter.go

+}
+
+// rateLimitedStoreServer is a storepb.StoreServer that can apply rate limits against Series requests.
+type rateLimitedStoreServer struct {


Should we add the classic assignment to interface type to ensure this type keeps implementing the storepb.StoreServer interface at compile time?

Sure thing, done.

douglascamata · 2023-02-08T10:53:07Z

pkg/store/limiter.go

+}
+
+// rateLimitedServer is a storepb.Store_SeriesServer that tracks statistics about sent series.
+type rateLimitedServer struct {


Should we add the classic assignment to interface type to ensure this type keeps implementing the storepb.Store_SeriesServer interface at compile time?

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

douglascamata

LGTM. Thanks a lot! 🙇

douglascamata · 2023-02-08T14:46:56Z

pkg/store/limiter.go

+		failedRequestsCounter: promauto.With(reg).NewCounterVec(prometheus.CounterOpts{
+			Name: "thanos_store_selects_dropped_total",
+			Help: "Number of select queries that were dropped due to configured limits.",
+		}, []string{"reason"}),


FYI we already have this inside inside the limiters created on the lines above. Again, I like more the one introduced here for its standard potential and I think we can deprecate the old one later.

matej-g

Looks great, thank you @fpetkovski 🙇 . I have small nits, especially regarding the flags, but I'm happy to get this in current form.

matej-g · 2023-02-09T11:04:41Z

docs/components/query.md

+      --store.grpc.samples-limit=0
+                                 The maximum samples allowed for a single
+                                 Series request, The Series call fails if
+                                 this limit is exceeded. 0 means no limit.
+                                 NOTE: For efficiency the limit is internally
+                                 implemented as 'chunks limit' considering each
+                                 chunk contains a maximum of 120 samples.
+      --store.grpc.series-limit=0
+                                 The maximum series allowed for a single Series
+                                 request. The Series call fails if this limit is
+                                 exceeded. 0 means no limit.


Correct me if I'm wrong, but it seems like we have two distinct store flag categories:

One is to configure the stores to which query connects

The second one added in this PR refers to the store server of the query

I wonder if we could distinguish them more clearly. Maybe store-limit.*?

Good point, should we use something like store.limits.request-series and store.limits.request-samples?

I think that would still be mixing both in one bag, since both flag groups are still under flag. But on the second thought, I think we shot ourself in the foot already since even in existing code we have dual use in query vs store...

And yet now that I'm thinking, we deprecated store in favor of endpoint, so eventually any store-connection related flags will be actually under endpoint.

I guess we can reclaim store as config for the actual store server, then store.limits.* is fine 👍.

Yeah in this case store would be the store server, whereas endpoint would be something we configure on the client. I updated the names of the flags.

matej-g · 2023-02-09T11:16:25Z

pkg/store/limiter.go

+	}
+
+	if err := i.seriesLimiter.Reserve(1); err != nil {
+		return errors.Wrapf(err, "failed to send series")


I wonder once this error bubbles up, if we should use a specific gRPC code. I think currently we use Unknown for all errors, but perhaps in this case we could use ResourceExhausted or something similar. Could be useful when debugging issues to see this directly in response or server metrics / span errors.

But can be done additionally, not a blocker.

I think it's a good idea, but I'm concerned it might grow the scope of the PR even further. Right now we have no way to distinguish error types on the client side, and we mostly rely on surfacing the actual message of the user. In this case they would see something like failed to send series: limit X violated (got Y)

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

matej-g

Let's fix the conflicts and get this in 🚀

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

fpetkovski · 2023-02-13T13:40:25Z

Should be ready to merge @matej-g

matej-g

Thank you @fpetkovski 🙇 , let's get this in!

* Add instrumentation to Store servers This commit adds an instrumented store server that exposes metrics about Series requests and uses it as a decorator around all Store APIs. The instrumented store currenly exposes only two metrics, series requested and chunks requested. Additional metrics can be added as needed. Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com> * Add rate limits to Store servers This commit implements a RateLimited store server which can be used to apply various limits to Series calls in components that implement the Store API. Rate limits are disabled by default but can be enabled selectively for each individual Thanos component. Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com> * Add CHANGELOG entry Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com> * Reuse existing limiters Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com> * Fix chunks limit binding Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com> * Unify flag names Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com> * Run make docs Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com> * Add another series bucket Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com> * Code review comments Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com> * Fix changelog Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com> * Rename flags Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com> --------- Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

pull-request-size bot added the size/L label Jan 26, 2023

fpetkovski force-pushed the instrumented-servers branch 2 times, most recently from 4822670 to 0dbde4f Compare January 26, 2023 11:25

GiedriusS reviewed Jan 27, 2023

View reviewed changes

fpetkovski force-pushed the instrumented-servers branch 3 times, most recently from 1639b53 to 157889e Compare February 8, 2023 09:05

fpetkovski added 8 commits February 8, 2023 10:07

Add CHANGELOG entry

6e57c9b

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

Reuse existing limiters

5a9d02e

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

Fix chunks limit binding

e4d4302

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

Unify flag names

2f9e838

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

Run make docs

6a9a7cd

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

Add another series bucket

0f5ddff

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

fpetkovski force-pushed the instrumented-servers branch from 157889e to 0f5ddff Compare February 8, 2023 09:11

douglascamata suggested changes Feb 8, 2023

View reviewed changes

Code review comments

011dbb3

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

fpetkovski force-pushed the instrumented-servers branch from 6dcec87 to 011dbb3 Compare February 8, 2023 12:55

Fix changelog

cca1535

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

douglascamata previously approved these changes Feb 8, 2023

View reviewed changes

douglascamata reviewed Feb 8, 2023

View reviewed changes

matej-g reviewed Feb 9, 2023

View reviewed changes

fpetkovski dismissed douglascamata’s stale review via 8fb0ea1 February 9, 2023 13:22

Rename flags

b3d93b7

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

fpetkovski force-pushed the instrumented-servers branch from 8fb0ea1 to b3d93b7 Compare February 9, 2023 13:28

matej-g previously approved these changes Feb 10, 2023

View reviewed changes

Merge branch 'main' into instrumented-servers

1b996b1

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

fpetkovski dismissed matej-g’s stale review via 1b996b1 February 11, 2023 08:27

matej-g approved these changes Feb 13, 2023

View reviewed changes

matej-g merged commit c4218c7 into thanos-io:main Feb 13, 2023

matej-g mentioned this pull request Mar 17, 2023

Receive's memory usage continues to grow in v0.31.0-rc.0 #6176

Open

douglascamata mentioned this pull request Apr 6, 2023

Limit data bytes fetched per query and per store #5750

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Instrumented servers #6074

Instrumented servers #6074

fpetkovski commented Jan 26, 2023 •

edited

Loading

GiedriusS Jan 27, 2023

douglascamata Feb 6, 2023

fpetkovski Feb 7, 2023

GiedriusS left a comment

douglascamata commented Feb 6, 2023

fpetkovski commented Feb 7, 2023

douglascamata commented Feb 7, 2023 •

edited

Loading

fpetkovski commented Feb 8, 2023 •

edited

Loading

douglascamata left a comment

douglascamata Feb 8, 2023

fpetkovski Feb 8, 2023

douglascamata Feb 8, 2023

fpetkovski Feb 8, 2023

douglascamata Feb 8, 2023

fpetkovski Feb 8, 2023

douglascamata Feb 8, 2023

douglascamata left a comment

douglascamata Feb 8, 2023

matej-g left a comment

matej-g Feb 9, 2023

fpetkovski Feb 9, 2023 •

edited

Loading

matej-g Feb 9, 2023

fpetkovski Feb 9, 2023

matej-g Feb 9, 2023

fpetkovski Feb 9, 2023

matej-g left a comment

fpetkovski commented Feb 13, 2023

matej-g left a comment

Instrumented servers #6074

Instrumented servers #6074

Conversation

fpetkovski commented Jan 26, 2023 • edited Loading

Changes

Verification

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GiedriusS left a comment

Choose a reason for hiding this comment

douglascamata commented Feb 6, 2023

fpetkovski commented Feb 7, 2023

douglascamata commented Feb 7, 2023 • edited Loading

fpetkovski commented Feb 8, 2023 • edited Loading

douglascamata left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

douglascamata left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matej-g left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fpetkovski Feb 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matej-g left a comment

Choose a reason for hiding this comment

fpetkovski commented Feb 13, 2023

matej-g left a comment

Choose a reason for hiding this comment

fpetkovski commented Jan 26, 2023 •

edited

Loading

douglascamata commented Feb 7, 2023 •

edited

Loading

fpetkovski commented Feb 8, 2023 •

edited

Loading

fpetkovski Feb 9, 2023 •

edited

Loading