Scaling ingester above 1,500,000 in-memory series each? #3287

Packetslave · 2022-10-21T22:12:40Z

Packetslave
Oct 21, 2022

We're currently running 25 ingester pods @ 15gb/RAM each in our Mimir cluster, and we've just started hitting the "ProvisioningTooManyActiveSeries" warning for more than 1.6e6 in-memory series / ingester.

I'm wondering how serious of a limit this is, from a Mimir engineering perspective. From an operations perspective, we're running on m6a.8xlarge EKS nodes (128gb RAM each), and ingesters are constrained to a max of 1 per node for redundancy anyway.

Obviously, we have plenty of headroom to scale the ingesters UP from a RAM perspective. That being said, 1.6e6 was presumably chosen for a reason (runbook says the goal is not more than 1.5 million / ingester). So I'm looking for some input before I start scaling them OUT (more ingesters) dramatically.

Answered by Logiraptor

Oct 25, 2022

+1 to everything @stevesg said, and I'll just add that we've run with higher targets during load testing without any issue. For example, if you do the math on our 1 billion active series load test, you will see that we targeted 5M active series per ingester.

The math is (1e9 series * 3 replication factor) / 600 ingesters = 5M.

Here's the blog post I am referencing: https://grafana.com/blog/2022/04/08/how-we-scaled-our-new-prometheus-tsdb-grafana-mimir-to-1-billion-active-series/

View full answer

stevesg · 2022-10-25T19:12:29Z

stevesg
Oct 25, 2022
Maintainer

Others will likely have a better answer to this, but I can give some information.

First thing to bear in mind that is what ingesters primarily use memory for - series and queries. The memory recommendation accounts for storing the per-series data structures for 1.5M series, and the memory required for running queries.

The hard cut-off (by default, at least) is 2.5M series per ingester, this means there is actually some spare headroom. This headroom is useful for:

Series will not be perfectly balanced (expect for perhaps some very synthetic data), so some ingesters will hold more series than others.
- This is often more problematic with multi-tenant environments when shuffle sharding is being used, than for single tenant environments.
Stale series are not removed from memory for 2-3 hours, so if you have churning series, the headroom can be temporarily used to account for these series.
If your series load spikes, then you can temporarily use the headroom between 1.5M and 2.5M whilst you scale out more ingesters.

I can't give you a better answer as it depends on your use case, the recommendations are meant to give a good level of safety, but there is always room for tweaking. It might be that with on a single tenant system with well balanced ingesters, a less intensive query load, and low series churn, you could comfortably run ingesters with more series than 1.5M. I would recommend paying close attention to memory use when running your typical load of series and queries, and see what your typical memory use is.

Edit: I should also say that if you choose to give ingesters more memory, then you can of course very safely raise this limit.

0 replies

Logiraptor · 2022-10-25T19:35:52Z

Logiraptor
Oct 25, 2022
Maintainer

+1 to everything @stevesg said, and I'll just add that we've run with higher targets during load testing without any issue. For example, if you do the math on our 1 billion active series load test, you will see that we targeted 5M active series per ingester.

The math is (1e9 series * 3 replication factor) / 600 ingesters = 5M.

Here's the blog post I am referencing: https://grafana.com/blog/2022/04/08/how-we-scaled-our-new-prometheus-tsdb-grafana-mimir-to-1-billion-active-series/

2 replies

Packetslave Oct 25, 2022
Author

Thanks, this is super-helpful (and that blog post and the presentation that went with it are terrific). Are you able to say what the RAM/CPU specs for those 600 ingesters were for the test? I didn't see that in the article.

pstibrany Oct 26, 2022
Maintainer

Are you able to say what the RAM/CPU specs for those 600 ingesters were for the test? I didn't see that in the article.

We used these values for ingesters:

8 CPU cores (no limit)
128 Gi of memory (limit was also set to 128 Gi of memory)
100 GiB of local disk space

pstibrany · 2022-10-26T07:22:08Z

pstibrany
Oct 26, 2022
Maintainer

In addition to @stevesg's excellent answer, having more series per ingesters will also increase WAL replay time and slow down rollouts.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scaling ingester above 1,500,000 in-memory series each? #3287

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Scaling ingester above 1,500,000 in-memory series each? #3287

Packetslave Oct 21, 2022

Replies: 3 comments · 2 replies

stevesg Oct 25, 2022 Maintainer

Logiraptor Oct 25, 2022 Maintainer

Packetslave Oct 25, 2022 Author

pstibrany Oct 26, 2022 Maintainer

pstibrany Oct 26, 2022 Maintainer

Packetslave
Oct 21, 2022

Replies: 3 comments 2 replies

stevesg
Oct 25, 2022
Maintainer

Logiraptor
Oct 25, 2022
Maintainer

Packetslave Oct 25, 2022
Author

pstibrany Oct 26, 2022
Maintainer

pstibrany
Oct 26, 2022
Maintainer