-
Notifications
You must be signed in to change notification settings - Fork 527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tempo cluster sizing / capacity planning #1540
Comments
Hi, thanks for raising this issue, it's also something we've been thinking about. There are several different forms this tool could take, and some work to identify the important variables and formulas, definitely including the ones you mentioned. A document with approximate calculations is ok, but there is also a need for a more sophisticated and accurate tool, in Tempo and the other databases. See Mimir's discussion for reference. Tempo would likely adopt the same approach. For now I can share some metrics from our internal clusters:
I'd expect these requirements to change over the next few releases as we add support for parquet blocks, likely increasing at first, but then stabilizing as we improve things. |
Could you please describe what queries the test was doing? Is the lookback or time range affecting query resources? Was query part using functions or just scaled querier? |
Does retention anyhow affect resource requirements? |
This was gathered from our own clusters which run real workloads and have a mixture of trace lookups and searches, and lookback of 1 or 24H, and using both querier pods and functions. Total querier resources is a function of data volume involved in a search. All queries are sharded into fixed-size sub-jobs, so a 2x time range will scan 2x data, and likewise a cluster with 2x volume across same time range. Scaling up pods or functions can keep latency down by executing more sub-jobs in parallel.
Retention affects how many blocks exist, which mostly impacts latency and object store requests. Tempo reads a bloom filter per block, so 2x retention will issue 2x reads to object store. Latency can be controlled by scaling up queriers to check more bloom filters in parallel (and more recently making use of #1388). Increased block list also has a small but not significant increase in memory since block metadata including name/size/location is kept in memory. |
This issue has been automatically marked as stale because it has not had any activity in the past 60 days. |
@mdisibio could re-open this ticket and perhaps document the resources in the docs? We have used values in this ticker in the Tempo Kubernetes operator and we would like to keep them updated if storage or other components change. |
Got it, reopening. Expecting the requirements to change in Tempo 2.0 with TraceQL and full parquet, will gather new numbers then. |
This will not block Tempo 2.0 from releasing so I'm moving it out of the v2.0 milestone. |
Heads up to @electron0zero and @mapno that this issue exists. After you do your research please publish some guidelines for the community and close out this issue. |
I'm happy to add this information to the documentation when it's ready. |
See also #2836 |
I have someone installing the operator on Openshift and we kept noticing an OOM error on our After a little investigation, we noticed that the pod consists of two containers ( It would probably be a better use of resources if |
@mdisibio may i know how you calculated the ingestion rate of 1MB/sec? |
@venkatb-zelar By comparing |
@mdisibio for what component ? 🤷 |
Is your feature request related to a problem? Please describe.
I would like to know (approximately) Tempo cluster size and how many resources it will need for a given ingestion rate and retention - number of spans/time, average byte span size, retention N days (maybe I am missing some input parameters).
Such a document is useful when evaluating tempo from the cost perspective or capacity planning.
Describe the solution you'd like
Documentation on Tempo cluster sizing.
Describe alternatives you've considered
Run tests Tempo
Additional context
The text was updated successfully, but these errors were encountered: