Tempo is consuming a lot of cpu and memory. Is there any tuning points? #1946

lanore78 · 2022-12-13T09:41:57Z

lanore78
Dec 13, 2022

Hello,

We are currently using Tempo installed on Google Kubernetes Engine and use Google Cloud Storage as backeend storage.
I'm just wondering if there is any more tuning points on the write path.

Here is Pod replica and resource configuration.

distributor X 4 (both resource limit/request : 2Core/2GB)
ingester X 4 (both resource limit/request : 3Core/12GB)

Here is current traffic and resource consumption.
We think if Traces/Secs goes up to 5~6K, current Tempo cluster would not handle.

Traces/Sec : 3K ~ 4K
- Spans / Sec : 70K ~ 90K
- Received Bytes / Sec : 6~9MB
Distributor Resource Status
- CPU : 0.5~1
- Memory : 1GB
Ingester Resource Status
- CPU : 0.5 ~ 2
- Memory : 4GB ~ 10GB

Here is configuration we used.

multitenancy_enabled: false
usage_report:
  reporting_enabled: false
search_enabled: true
metrics_generator_enabled: true
compactor:
  compaction:
    block_retention: 168h
    compaction_window: 30m
  ring:
    kvstore:
      store: memberlist
metrics_generator:
  ring:
    kvstore:
      store: memberlist
  registry:
    stale_duration: 5m
  processor:
    service_graphs:
      max_items: 1e+06
      dimensions:
        - "http.method"
        - "http.status_code"
        - "labels.projectName"
        - "labels.projectId"
        - "labels.zone"
        - "labels.serverGroup"
        - "messaging.system"
        - "db.system"
        - "host.name"        
    span_metrics:
      dimensions:
        - "http.method"
        - "http.status_code"
        - "labels.projectName"
        - "labels.projectId"
        - "labels.zone"
        - "labels.serverGroup"
        - "messaging.system"
        - "db.system"
        - "host.name"
  storage:
    path: /var/tempo/wal
    remote_write:
      - basic_auth:
          password: *************
          username: *************
        headers:
          X-Scope-OrgID: *************
        send_exemplars: true
        url: https://cortex/api/v1/push
distributor:
  ring:
    kvstore:
      store: memberlist  
  receivers:
    jaeger:
      protocols:
        thrift_http:
          endpoint: 0.0.0.0:14268
        grpc:
          endpoint: 0.0.0.0:14250
    otlp:
      protocols:
        http:
          endpoint: 0.0.0.0:55681
        grpc:
          endpoint: 0.0.0.0:4317
query_frontend:
  max_outstanding_per_tenant: 2000
  search:
    target_bytes_per_job: 4e+09
    concurrent_jobs: 2000
    max_duration: 0
querier:
  max_concurrent_queries: 100
  search:
    query_timeout: 60s
  frontend_worker:
    frontend_address: tempo-tempo-distributed-query-frontend-discovery:9095
    grpc_client_config:
      max_recv_msg_size: 536870912
      max_send_msg_size: 536870912
ingester:
  lifecycler:
    ring:
      replication_factor: 1
      kvstore:
        store: memberlist
    tokens_file_path: /var/tempo/tokens.json
  max_block_bytes: 1073741824
memberlist:
  abort_if_cluster_join_fails: false
  join_members:
    - tempo-tempo-distributed-gossip-ring
overrides:
  max_bytes_per_trace: 0
  max_search_bytes_per_trace: 0
  max_traces_per_user: 0
  metrics_generator_processors:
  - span-metrics
  - service-graphs
  per_tenant_override_config: /conf/overrides.yaml
server:
  http_listen_port: 3100
  log_level: info
  log_format: logfmt
  grpc_server_max_recv_msg_size: 5.36870912e+08
  grpc_server_max_send_msg_size: 5.36870912e+08
storage:
  trace:
    block:
      version: vParquet
    backend: gcs
    gcs:
      bucket_name: heimdall-tempo-prod
    blocklist_poll: 5m
    local:
      path: /var/tempo/traces
    wal:
      path: /var/tempo/wal
    cache: memcached
    memcached:
      consistent_hash: true
      host: tempo-tempo-distributed-memcached
      service: memcached-client
      timeout: 500ms

Answered by joe-elliott

Dec 16, 2022

This is a really interesting set of metrics. Thanks for posting it. These are, I believe, the 1.5 vParquet numbers? It's been awhile since we ran that and I've forgotten what the performance was like. So let's do some quick math.

You have ~8MB/sec per ingester: (8MB/s * 4 distributors * 1 RF) / 4 Ingesters
You are using 8GB memory per ingester. So roughly 1GB of memory per 1MB/s?

Yeah, that's not very good. Currently in our largest internal cluster we are using ~7.5 GB of memory per ingester and each one is receiving ~17MB/s. I don't want to draw too strong conclusions from that, but it does seem like tip of main is outperforming 1.5 which is encouraging. Overall we are definitely still w…

View full answer

joe-elliott · 2022-12-16T17:12:01Z

joe-elliott
Dec 16, 2022
Maintainer

This is a really interesting set of metrics. Thanks for posting it. These are, I believe, the 1.5 vParquet numbers? It's been awhile since we ran that and I've forgotten what the performance was like. So let's do some quick math.

You have ~8MB/sec per ingester: (8MB/s * 4 distributors * 1 RF) / 4 Ingesters
You are using 8GB memory per ingester. So roughly 1GB of memory per 1MB/s?

Yeah, that's not very good. Currently in our largest internal cluster we are using ~7.5 GB of memory per ingester and each one is receiving ~17MB/s. I don't want to draw too strong conclusions from that, but it does seem like tip of main is outperforming 1.5 which is encouraging. Overall we are definitely still working to improve the memory usage of parquet and we expect improvements over the next few versions.

Personally, I find the variance in memory usage more concerning than the average. Cutting a large parquet block can be costly in terms of memory and can cause working set to spike multiple GBs. This can make for a challenging to operate system.

To me your distributor usages seem fine (especially since you are also running metrics generator) but let me know if you disagree.

Thoughts on Requests/Limits
This has nothing to do with Tempo, but personally I'd recommend reducing your requests to be p90 ... p99 of your actual usage. The amount you choose will depend on your personal appetite for risk but will help with k8s bin packing and will allow you to get more value out of your hardware.

Config options
Tuning for 2.0 and 1.5 parquet will likely be different and I don't know if I recall exactly what works well for 1.5. I'm going to talk generally about some tuning points to explore and the impact of increasing them or decreasing them. If you are able experiment and report back the community would appreciate it.

I notice here you're taking thrift http. Consuming anything that's not otel requires the distributor to convert it into the otel object model. If you have teams still using jaeger perhaps swapping them to OTEL can help with resources.

distributor:
  receivers:
    jaeger:
      protocols:
        thrift_http:
          endpoint: 0.0.0.0:14268

Ingester settings listed with current defaults and thoughts:

ingester:
  # this is more impactful in a multitenant system and on ingester startup, but having lots of flushes happening at the same time can eat a lot of resources. we have changed this to 4 internally
  concurrent_flushes: 16
  # Try reducing this to clear live traces out faster. 5s?
  flush_check_period: 10s
  # This is a tough one and is highly traffic/pipeline dependent. A lower value will clear traces out of memory faster which reduces memory requirements BUT will increase the number of traces that have to be combined when flushing a block. Recombining traces is costly. Try 5s? 15s? you may find one direction or the other works better for your system.
  trace_idle_period: 10s
  # Try reducing this to decrease the variance. You will flush more, smaller blocks which will increase pressure on compactors, but will likely reduce variance and memory requirements on ingesters.
  max_block_bytes: 1073741824
overrides:
  # outsized traces can have a large negative impact on memory usage. Perhaps set this to something like 10 megabytes and see if you start refusing spans?
  max_bytes_per_trace: 0

Those are some immediate thoughts. Let me know how it goes and we can proceed from there. Also, look forward to Tempo 2.0 which will hopefully allow you to run vParquet with lower overhead.

0 replies

lanore78 · 2022-12-20T09:27:49Z

lanore78
Dec 20, 2022
Author

Thank you for the answer

Yes, we are using Tempo 1.5 with Parque backend enabled, and, I agree with you about the resource consumption of distributor.
We are focusing on tuning ingester side.

About your recommendations

Thoughts on Requests/Limits

That's reasonable. We will reduce these value as you recommend. :)

Config options

All incomming traffics are using OTel already. So, there is nothing to do with this. However we will remove thrift related options.
Other config suggestions.
- It will take times. Hopefully we could share the result by nextweek.

I got few questions.

We are currently receiving OTel traffic over http(s). If we receive OTel traffic over grpc, can reduce cpu or memory consumption?
The gap between max and min value of ingester cpu usage are big. If we can make the gap smaller, so lower the max cpu usage can help reducing cpu usage. Is this make sense?

0 replies

joe-elliott · 2023-01-03T15:07:06Z

joe-elliott
Jan 3, 2023
Maintainer

We are currently receiving OTel traffic over http(s). If we receive OTel traffic over grpc, can reduce cpu or memory consumption?

My guess is not significantly, but I can't say for sure.

The gap between max and min value of ingester cpu usage are big. If we can make the gap smaller, so lower the max cpu usage can help reducing cpu usage. Is this make sense?

Yes, this high variance is due to the cost of cutting a Parquet block. Hopefully reducing max_block_bytes as suggested will reduce variance in CPU usage.

Another thing I should have mentioned is that we have found that larger batches tend to reduce CPU/Memory requirements of distributors and ingesters. We use a batch size of 1000 spans internally. This is configured on the Grafana Agent or OTel Collector depending on your pipeline.

0 replies

lanore78 · 2023-01-08T04:21:10Z

lanore78
Jan 8, 2023
Author

Hello, Joe

Thanks to the settings you recommend, there was a noticeable performance improvement.

Here is summary of result

As you told, lower max_block_bytes reduce CPU/Memory usage and give us almost 2x traffic improvement
The other 3 options(concurrent_flushes , flush_check_period, trace_idle_period) did not significantly affect the performance improvement.
We set max_bytes_per_trace to 10mb. After the change, we have observed 'trace too large' error logs. Most of the cases are caused by traces captured from batch jobs. These traces have average 10,000 spans per trace.
- We had to guide them (who is responsible for the service producing large trace), to avoid large trace.

We are currently summarize the result with numbers, and hopefully share the detailed result soon.

0 replies

shouldwhat · 2023-01-13T09:39:52Z

shouldwhat
Jan 13, 2023

Hello

I'm working with @lanore78 on the same team.

Thanks to the tuning options you suggested, it was very helpful to reduce the CPU/Memory spike.
I will share the test result of max_block_bytes option which was most helpful to us.

p.s) The data below is not a result from the environment(Tempo installed on Google Kubernetes Engine) mentioned in the first question.

Test environment

distributor X 1 (resource limit/request : 2Core / 2GB)
ingester X 1 (resource limit/request : 3Core / 12GB)
traffic
- Trace / sec : 3K
- Span / sec : 11K
- Received Bytes / sec : 4MB

Test result

As max_block_bytes is lowered, the frequency of block cutting increased, but the CPU/Memory resource usage seemed to decrease.
The frequency of compaction seems to have increased in the compactors, but I don't know if it affects CPU/Memory usage.

	(Ingester) CPU usage	(Ingester) Memory usage	(Compactor) CPU usage	(Compactor) Memory usage
Case 1) `max_block_bytes: 1gb`	0.3 ~ 1.8	4gb ~ 8.5gb	0 ~ 0.9	0.1gb ~ 2.5gb
Case 2) `max_block_bytes: 512mb`	0.3 ~ 1.6	3gb ~ 5.5gb	0 ~ 0.9	0.3gb ~ 4gb
Case 3) `max_block_bytes: 256mb`	0.5 ~ 1.0	2gb ~ 3.5gb	0 ~ 0.9	0.3gb ~ 3.5gb
Case 4) `max_block_bytes: 128mb`	0.5 ~ 1.0	2gb ~ 3gb	0 ~ 0.9	0.1gb ~ 4gb

Case 1) `max_block_bytes: 1gb`

Case 2) `max_block_bytes: 512mb`

Case 3) `max_block_bytes: 256mb`

Case 4) `max_block_bytes: 128mb`

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tempo is consuming a lot of cpu and memory. Is there any tuning points? #1946

{{title}}

Replies: 5 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Tempo is consuming a lot of cpu and memory. Is there any tuning points? #1946

lanore78 Dec 13, 2022

Replies: 5 comments

joe-elliott Dec 16, 2022 Maintainer

lanore78 Dec 20, 2022 Author

Thoughts on Requests/Limits

Config options

joe-elliott Jan 3, 2023 Maintainer

lanore78 Jan 8, 2023 Author

shouldwhat Jan 13, 2023

Test environment

Test result

Case 1) max_block_bytes: 1gb

Case 2) max_block_bytes: 512mb

Case 3) max_block_bytes: 256mb

Case 4) max_block_bytes: 128mb

lanore78
Dec 13, 2022

joe-elliott
Dec 16, 2022
Maintainer

lanore78
Dec 20, 2022
Author

joe-elliott
Jan 3, 2023
Maintainer

lanore78
Jan 8, 2023
Author

shouldwhat
Jan 13, 2023

Case 1) `max_block_bytes: 1gb`

Case 2) `max_block_bytes: 512mb`

Case 3) `max_block_bytes: 256mb`

Case 4) `max_block_bytes: 128mb`