Add write limits for tenants in Thanos Receiver #5404

douglascamata · 2022-06-02T08:41:28Z

Is your proposal related to a problem?

Tenants can overload Thanos Receivers with remote write requests and bring the whole system down.

Describe the solution you'd like

I would like put maximum limits on the size of remote write requests, so that a single tenant is less likely to negatively affect others.

These are the proposed "knobs" for limiting (please feel free to propose more and give your opinion) the remote write endpoint usage:

Amount of timeseries per request. Labels: tenant and receive instance.
Amount of samples per request. Labels: tenant and receive instance.
Size in bytes of the incoming request body. Labels: tenant and receive instance.
Amount of concurrent HTTP requests. Labels: receive instance.

Hitting the two first limits should trigger an HTTP response with status code 413 (entity too large) and the last one should triggers a 429 (too many requests). In the future, the 413 error can be identified by the remote write clients and the data split into smaller requests under the limits.

I would also like to expose the current values and limits of each "knob" as metrics of Thanos Receive. This would allow easy tracking of the limit system.

To ensure backwards compatibility, limiting should be optional and disabled by default.

For the sake of simplicity and iteration, starting with a global value for each knob (all tenants have the same upper bound limit) and later adding the possibility of configuring different values per tenant.

Describe alternatives you've considered

Ask tenants to be mindful of the amount of metrics they are sending
Ask tenants to be careful when writing their remote write configurations
Adding smarter/stateful rate limit knobs, that could track limits across a ring of Receive. But initial work for this is being started at receive: Implement head series limits per tenant #5333. Meanwhile, simpler limits can be added and be helpful.

Additional context

Similar abandoned proposal: receive: Add per-each-write limit to the allowed number of samples #3963 (PR at For each write to Thanos/receive, add limit to the allowed number of samples #3964)
Lots of inspiration being taken from Cortex's limits

TODO

Export the criteria for limiting the remote write requests as metrics of Thanos Receive ([receive] Export metrics about remote write requests per tenant #5424).
Add charts with these metrics in the Receive dashboard. ([receive] Add per-tenant charts to Receive's example dashboard #5472)
Add support for configuring & applying limits (single limit for all tenants, Receive: add per request limits for remote write #5527).
Add support for per-tenant limits. (Receive: Allow remote write request limits to be defined per file and tenant #5565)
Add support for hot-reloading tenant limit configuration (Receive: Reload tenant limit configuration on file change #5673)
Add charts with these metrics in the Receive dashboard.

douglascamata · 2022-06-02T08:41:56Z

FYI: pressed the button early accidentally. I am still writing.

douglascamata · 2022-06-02T12:28:30Z

Ok, I think I am mostly ready with the writing at this moment.

douglascamata · 2022-06-02T13:01:07Z

Also considering to add a limit on the maximum amount of labels per timeseries.

douglascamata · 2022-06-02T14:22:15Z

Well, not a good idea to record the amount of labels per timeseries: cardinality will be very high. Keeping it out of the plans for now.

hanjm · 2022-06-03T05:46:37Z

I think we can use bloom filter to track per tenant cardinality.

douglascamata · 2022-06-03T08:53:06Z

@hanjm good idea! Thanks for the suggestion, I will investigate how we could use it.

wiardvanrij · 2022-06-09T11:50:49Z

Have you seen how Loki implements this as well? Might give some inspiration (not sure if relevant tho). For example limiting on bytes/s might also be useful. Anyhow, sounds really great to have!

bwplotka · 2022-06-09T11:51:30Z

From community discussion:

It makes sense, let's do this
Ideally disabled by default to not break compatibiltiy
Provide best practices - what works

douglascamata · 2022-06-13T10:16:21Z

@wiardvanrij thanks for the tip. I checked out their limits and I believe it makes total sense that we have a limit on request body size too.

I don't want to dive into rate limits (i.e. bytes per second or timeseries per second) in this proposal though, as these add more complications to keep calculating the data. I believe they could be part of a different proposal and possibly take advantage of the outcome of #5415.

This proposal is more for limiting the size of remote write requests, which requires no state and is much easier to implement.

douglascamata · 2022-06-15T13:21:05Z

The remote write request body size (in bytes) will be exported as a histogram. And I propose these buckets:

1K, 32K, 256K, 512K, 1M, 16M, 32M.

Please let me know if you have other suggestions.

For the amount of samples & timeseries per request, I would like to also make them a histogram. The bucket would be configurable and a default provided. Which buckets do you think we could provide as default? Possibly generate an exponential bucket set using prometheus.ExponentialBucket?

douglascamata · 2022-07-06T13:59:33Z

FYI, the remote write request body size metric was added as a summary without any quantiles defined yet.

stale · 2022-11-13T15:04:40Z

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

douglascamata · 2022-11-13T16:03:55Z

I'm on vacations, bot. Don't stale me!

douglascamata · 2023-04-06T16:45:08Z

Everything planned in this issue has already been implemented. Closing it.

douglascamata changed the title ~~Add rate limits to the Thanos Receiver~~ Add limits for tenants in Thanos Receiver Jun 2, 2022

douglascamata changed the title ~~Add limits for tenants in Thanos Receiver~~ Add write limits for tenants in Thanos Receiver Jun 7, 2022

douglascamata mentioned this issue Jun 15, 2022

[receive] Export metrics about remote write requests per tenant #5424

Merged

2 tasks

douglascamata mentioned this issue Jul 5, 2022

[receive] Add per-tenant charts to Receive's example dashboard #5472

Merged

2 tasks

douglascamata mentioned this issue Jul 21, 2022

Receive: add per request limits for remote write #5527

Merged

4 tasks

douglascamata mentioned this issue Aug 10, 2022

Receive: Allow remote write request limits to be defined per file and tenant #5565

Merged

2 tasks

stale bot added the stale label Nov 13, 2022

stale bot removed the stale label Apr 6, 2023

douglascamata closed this as completed Apr 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add write limits for tenants in Thanos Receiver #5404

Add write limits for tenants in Thanos Receiver #5404

douglascamata commented Jun 2, 2022 •

edited

Loading

douglascamata commented Jun 2, 2022

douglascamata commented Jun 2, 2022

douglascamata commented Jun 2, 2022

douglascamata commented Jun 2, 2022

hanjm commented Jun 3, 2022

douglascamata commented Jun 3, 2022

wiardvanrij commented Jun 9, 2022

bwplotka commented Jun 9, 2022

douglascamata commented Jun 13, 2022

douglascamata commented Jun 15, 2022

douglascamata commented Jul 6, 2022

stale bot commented Nov 13, 2022

douglascamata commented Nov 13, 2022

douglascamata commented Apr 6, 2023

Add write limits for tenants in Thanos Receiver #5404

Add write limits for tenants in Thanos Receiver #5404

Comments

douglascamata commented Jun 2, 2022 • edited Loading

Is your proposal related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

TODO

douglascamata commented Jun 2, 2022

douglascamata commented Jun 2, 2022

douglascamata commented Jun 2, 2022

douglascamata commented Jun 2, 2022

hanjm commented Jun 3, 2022

douglascamata commented Jun 3, 2022

wiardvanrij commented Jun 9, 2022

bwplotka commented Jun 9, 2022

douglascamata commented Jun 13, 2022

douglascamata commented Jun 15, 2022

douglascamata commented Jul 6, 2022

stale bot commented Nov 13, 2022

douglascamata commented Nov 13, 2022

douglascamata commented Apr 6, 2023

douglascamata commented Jun 2, 2022 •

edited

Loading