Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add write limits for tenants in Thanos Receiver #5404

Closed
6 tasks done
douglascamata opened this issue Jun 2, 2022 · 14 comments
Closed
6 tasks done

Add write limits for tenants in Thanos Receiver #5404

douglascamata opened this issue Jun 2, 2022 · 14 comments

Comments

@douglascamata
Copy link
Contributor

douglascamata commented Jun 2, 2022

Is your proposal related to a problem?

Tenants can overload Thanos Receivers with remote write requests and bring the whole system down.

Describe the solution you'd like

I would like put maximum limits on the size of remote write requests, so that a single tenant is less likely to negatively affect others.

These are the proposed "knobs" for limiting (please feel free to propose more and give your opinion) the remote write endpoint usage:

  • Amount of timeseries per request. Labels: tenant and receive instance.
  • Amount of samples per request. Labels: tenant and receive instance.
  • Size in bytes of the incoming request body. Labels: tenant and receive instance.
  • Amount of concurrent HTTP requests. Labels: receive instance.

Hitting the two first limits should trigger an HTTP response with status code 413 (entity too large) and the last one should triggers a 429 (too many requests). In the future, the 413 error can be identified by the remote write clients and the data split into smaller requests under the limits.

I would also like to expose the current values and limits of each "knob" as metrics of Thanos Receive. This would allow easy tracking of the limit system.

To ensure backwards compatibility, limiting should be optional and disabled by default.

For the sake of simplicity and iteration, starting with a global value for each knob (all tenants have the same upper bound limit) and later adding the possibility of configuring different values per tenant.

Describe alternatives you've considered

  • Ask tenants to be mindful of the amount of metrics they are sending
  • Ask tenants to be careful when writing their remote write configurations
  • Adding smarter/stateful rate limit knobs, that could track limits across a ring of Receive. But initial work for this is being started at receive: Implement head series limits per tenant #5333. Meanwhile, simpler limits can be added and be helpful.

Additional context

TODO

@douglascamata
Copy link
Contributor Author

FYI: pressed the button early accidentally. I am still writing.

@douglascamata
Copy link
Contributor Author

Ok, I think I am mostly ready with the writing at this moment.

@douglascamata douglascamata changed the title Add rate limits to the Thanos Receiver Add limits for tenants in Thanos Receiver Jun 2, 2022
@douglascamata
Copy link
Contributor Author

Also considering to add a limit on the maximum amount of labels per timeseries.

@douglascamata
Copy link
Contributor Author

Well, not a good idea to record the amount of labels per timeseries: cardinality will be very high. Keeping it out of the plans for now.

@hanjm
Copy link
Member

hanjm commented Jun 3, 2022

I think we can use bloom filter to track per tenant cardinality.

@douglascamata
Copy link
Contributor Author

@hanjm good idea! Thanks for the suggestion, I will investigate how we could use it.

@douglascamata douglascamata changed the title Add limits for tenants in Thanos Receiver Add write limits for tenants in Thanos Receiver Jun 7, 2022
@wiardvanrij
Copy link
Member

Have you seen how Loki implements this as well? Might give some inspiration (not sure if relevant tho). For example limiting on bytes/s might also be useful. Anyhow, sounds really great to have!

@bwplotka
Copy link
Member

bwplotka commented Jun 9, 2022

From community discussion:

  • It makes sense, let's do this
  • Ideally disabled by default to not break compatibiltiy
  • Provide best practices - what works

@douglascamata
Copy link
Contributor Author

@wiardvanrij thanks for the tip. I checked out their limits and I believe it makes total sense that we have a limit on request body size too.

I don't want to dive into rate limits (i.e. bytes per second or timeseries per second) in this proposal though, as these add more complications to keep calculating the data. I believe they could be part of a different proposal and possibly take advantage of the outcome of #5415.

This proposal is more for limiting the size of remote write requests, which requires no state and is much easier to implement.

@douglascamata
Copy link
Contributor Author

The remote write request body size (in bytes) will be exported as a histogram. And I propose these buckets:

  • 1K, 32K, 256K, 512K, 1M, 16M, 32M.

Please let me know if you have other suggestions.

For the amount of samples & timeseries per request, I would like to also make them a histogram. The bucket would be configurable and a default provided. Which buckets do you think we could provide as default? Possibly generate an exponential bucket set using prometheus.ExponentialBucket?

@douglascamata
Copy link
Contributor Author

FYI, the remote write request body size metric was added as a summary without any quantiles defined yet.

@stale
Copy link

stale bot commented Nov 13, 2022

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Nov 13, 2022
@douglascamata
Copy link
Contributor Author

I'm on vacations, bot. Don't stale me!

@stale stale bot removed the stale label Apr 6, 2023
@douglascamata
Copy link
Contributor Author

Everything planned in this issue has already been implemented. Closing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants