Tempo is consuming a lot of cpu and memory. Is there any tuning points? #1946
-
Hello, We are currently using Tempo installed on Google Kubernetes Engine and use Google Cloud Storage as backeend storage. Here is Pod replica and resource configuration.
Here is current traffic and resource consumption.
|
Beta Was this translation helpful? Give feedback.
Replies: 5 comments
-
This is a really interesting set of metrics. Thanks for posting it. These are, I believe, the 1.5 vParquet numbers? It's been awhile since we ran that and I've forgotten what the performance was like. So let's do some quick math. You have ~8MB/sec per ingester: (8MB/s * 4 distributors * 1 RF) / 4 Ingesters Yeah, that's not very good. Currently in our largest internal cluster we are using ~7.5 GB of memory per ingester and each one is receiving ~17MB/s. I don't want to draw too strong conclusions from that, but it does seem like tip of main is outperforming 1.5 which is encouraging. Overall we are definitely still working to improve the memory usage of parquet and we expect improvements over the next few versions. Personally, I find the variance in memory usage more concerning than the average. Cutting a large parquet block can be costly in terms of memory and can cause working set to spike multiple GBs. This can make for a challenging to operate system. To me your distributor usages seem fine (especially since you are also running metrics generator) but let me know if you disagree. Thoughts on Requests/Limits Config options I notice here you're taking thrift http. Consuming anything that's not otel requires the distributor to convert it into the otel object model. If you have teams still using jaeger perhaps swapping them to OTEL can help with resources.
Ingester settings listed with current defaults and thoughts:
Those are some immediate thoughts. Let me know how it goes and we can proceed from there. Also, look forward to Tempo 2.0 which will hopefully allow you to run vParquet with lower overhead. |
Beta Was this translation helpful? Give feedback.
-
Thank you for the answer Yes, we are using Tempo 1.5 with Parque backend enabled, and, I agree with you about the resource consumption of distributor. About your recommendations Thoughts on Requests/Limits
Config options
I got few questions.
|
Beta Was this translation helpful? Give feedback.
-
My guess is not significantly, but I can't say for sure.
Yes, this high variance is due to the cost of cutting a Parquet block. Hopefully reducing Another thing I should have mentioned is that we have found that larger batches tend to reduce CPU/Memory requirements of distributors and ingesters. We use a batch size of 1000 spans internally. This is configured on the Grafana Agent or OTel Collector depending on your pipeline. |
Beta Was this translation helpful? Give feedback.
-
Hello, Joe Thanks to the settings you recommend, there was a noticeable performance improvement. Here is summary of result
We are currently summarize the result with numbers, and hopefully share the detailed result soon. |
Beta Was this translation helpful? Give feedback.
-
Hello I'm working with @lanore78 on the same team. Thanks to the tuning options you suggested, it was very helpful to reduce the CPU/Memory spike. p.s) The data below is not a result from the environment(Tempo installed on Google Kubernetes Engine) mentioned in the first question. Test environment
Test result
Case 1)
|
Beta Was this translation helpful? Give feedback.
This is a really interesting set of metrics. Thanks for posting it. These are, I believe, the 1.5 vParquet numbers? It's been awhile since we ran that and I've forgotten what the performance was like. So let's do some quick math.
You have ~8MB/sec per ingester: (8MB/s * 4 distributors * 1 RF) / 4 Ingesters
You are using 8GB memory per ingester. So roughly 1GB of memory per 1MB/s?
Yeah, that's not very good. Currently in our largest internal cluster we are using ~7.5 GB of memory per ingester and each one is receiving ~17MB/s. I don't want to draw too strong conclusions from that, but it does seem like tip of main is outperforming 1.5 which is encouraging. Overall we are definitely still w…