Scaling ingester above 1,500,000 in-memory series each? #3287
-
We're currently running 25 ingester pods @ 15gb/RAM each in our Mimir cluster, and we've just started hitting the "ProvisioningTooManyActiveSeries" warning for more than 1.6e6 in-memory series / ingester. I'm wondering how serious of a limit this is, from a Mimir engineering perspective. From an operations perspective, we're running on m6a.8xlarge EKS nodes (128gb RAM each), and ingesters are constrained to a max of 1 per node for redundancy anyway. Obviously, we have plenty of headroom to scale the ingesters UP from a RAM perspective. That being said, 1.6e6 was presumably chosen for a reason (runbook says the goal is not more than 1.5 million / ingester). So I'm looking for some input before I start scaling them OUT (more ingesters) dramatically. |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 2 replies
-
Others will likely have a better answer to this, but I can give some information. First thing to bear in mind that is what ingesters primarily use memory for - series and queries. The memory recommendation accounts for storing the per-series data structures for 1.5M series, and the memory required for running queries. The hard cut-off (by default, at least) is 2.5M series per ingester, this means there is actually some spare headroom. This headroom is useful for:
I can't give you a better answer as it depends on your use case, the recommendations are meant to give a good level of safety, but there is always room for tweaking. It might be that with on a single tenant system with well balanced ingesters, a less intensive query load, and low series churn, you could comfortably run ingesters with more series than 1.5M. I would recommend paying close attention to memory use when running your typical load of series and queries, and see what your typical memory use is. Edit: I should also say that if you choose to give ingesters more memory, then you can of course very safely raise this limit. |
Beta Was this translation helpful? Give feedback.
-
+1 to everything @stevesg said, and I'll just add that we've run with higher targets during load testing without any issue. For example, if you do the math on our 1 billion active series load test, you will see that we targeted 5M active series per ingester. The math is (1e9 series * 3 replication factor) / 600 ingesters = 5M. Here's the blog post I am referencing: https://grafana.com/blog/2022/04/08/how-we-scaled-our-new-prometheus-tsdb-grafana-mimir-to-1-billion-active-series/ |
Beta Was this translation helpful? Give feedback.
-
In addition to @stevesg's excellent answer, having more series per ingesters will also increase WAL replay time and slow down rollouts. |
Beta Was this translation helpful? Give feedback.
+1 to everything @stevesg said, and I'll just add that we've run with higher targets during load testing without any issue. For example, if you do the math on our 1 billion active series load test, you will see that we targeted 5M active series per ingester.
The math is (1e9 series * 3 replication factor) / 600 ingesters = 5M.
Here's the blog post I am referencing: https://grafana.com/blog/2022/04/08/how-we-scaled-our-new-prometheus-tsdb-grafana-mimir-to-1-billion-active-series/