otelcol can't get cpu limit and may cause performance issues. #4459

awx-fuyuanchu · 2021-11-19T05:53:16Z

Describe the bug

2021-11-19T03:39:27.622Z info service/collector.go:230 Starting otelcontribcol... {"Version": "v0.37.1", "NumCPU": 4}

Steps to reproduce

Limit the CPU resource to 1 with K8S node has 4 CPUs.

What did you expect to see?

otelcol could get the CPU limit.

What did you see instead?

otelcol get the CPU number of the node.

What version did you use?
Version: v0.37.1

What config did you use?

Environment

GKE v1.19.14-gke.1900

Additional context

So far, the CPU num was used in a set of limited features like bathProcessor and converter. It could be a risk in the feature.

bogdandrutu · 2021-11-19T16:39:24Z

That is just a log message. The CPU limitation comes from the k8s itself which throttles the process when gets to the limit. We don't do anything with "NumCPU" except that we print it. If you believe that it is confusing we can remove that message.

morigs · 2021-12-02T13:57:55Z

It's not a problem, of course, but it actually used here

bogdandrutu · 2022-01-04T23:53:17Z

@morigs interesting. Now we can have a long debate here, usually the operation executed after batching is an I/O op (not that CPU intensive). If we limit that to 1 core (in your example) we will never be able to hit probably 0.7 cores.

So not sure what is the best in this case.

morigs · 2022-01-05T08:45:03Z

@bogdandrutu
In the case of batch processor is not a problem, it's just a channel size. The real issue is the number of threads.
Processor as well as exporters can perform CPU intensive tasks (complex sampling, serialization etc) so they will try to utilize as many cores as possible. This will lead to throttling (which is a bad thing).
IMO there are two solutions:

Document how to use otel-collector in K8S (setting correct limits and GOMAXPROCS). And probably fix (if not already) this issue in otel operator.
Use something like this

Serpent6877 · 2022-12-06T07:34:05Z

I am curious about the same potential issue. We use opentelemetry collector as a sidecar on GKE. We use 0.48.0 and I see this in the logs

service/collector.go:252 Starting otelcol-contrib... {"Version": "0.48.0", "NumCPU": 16}

which is the virtuals CPU. We allocate 1 to 4 CPU depending on the deployment. So for the 1 CPU pods are we potentially having issues? We do pretty high volumes of traffic.

gebn · 2023-01-25T08:47:15Z

Prometheus has a currently-experimental --enable-feature auto-gomaxprocs flag which triggers uber-go/automaxprocs and has worked really well for us.

jpkrohling · 2023-01-25T11:24:52Z

Document how to use otel-collector in K8S (setting correct limits and GOMAXPROCS). And probably fix (if not already) this issue in otel operator.

I'm in favor of giving this a try. @open-telemetry/operator-approvers , what do you thin?

morigs · 2023-01-25T11:33:29Z

Should this be implemented as a core feature enabled by default? Or as an extension?

pavolloffay · 2023-01-26T16:23:44Z

I'm in favor of giving this a try. https://github.com/orgs/open-telemetry/teams/operator-approvers , what do you thin?

Agee on improving this. What changes are proposed for the operator? Should the operator set GOMAXPROCS?

jpkrohling · 2023-01-26T17:24:29Z

Should the operator set GOMAXPROCS?

Yes, I think it would be a good start.

frzifus · 2023-02-02T14:11:45Z

How would a cpu limit of 0.9 or 1.1 then be reflected? Is it just GOMAXPROCS=1?

jpkrohling · 2023-02-02T19:05:59Z

I would round it up: 0.9 becomes 1, 1.1 becomes 2.

edwintye · 2023-03-30T10:55:01Z

I accidentally stumbled onto the same problem as we were experiencing a lot of throttling during a spike of traffic. Seems to me that there already exists a mechanism to set GOMAXPROCS by adding it in

env:
  - name: GOMAXPROCS
    valueFrom:
      resourceFieldRef:
        containerName: otc-container
        resource: limits.cpu

to either the CR for the operator or directly into the deployment. Is there a scenario where using the roundup mechanism of native k8s is not as good?

max-frank · 2024-07-19T00:09:03Z

Is there a scenario where using the roundup mechanism of native k8s is not as good?

Any container environment other than k8s that does not easily support a mechanism as resourceFieldRef e.g., GCP CloudRun

By setting `GOMAXPROCS` to the number of available CPUs available to the pod via the 'downward API'[1]. This is based on[2], otherwise the collector will use `runtime.NumCPU` (i.e. number of processors available to the _node_) when setting up batch processing. [1] https://kubernetes.io/docs/concepts/workloads/pods/downward-api/#downwardapi-resourceFieldRef [2] open-telemetry/opentelemetry-collector#4459 (comment)

By setting `GOMAXPROCS` to the number of available CPUs available to the pod via the 'downward API'[1]. This is based on[2], otherwise the collector will use `runtime.NumCPU` (i.e. number of processors available to the _node_) when setting up batch processing. Some experimentation: I modified our pubsub producer to start producing _lots_ of messages (and hence events) to stress test the collector, before this change, once the collector queue was full it seemed like nothing, even adding a bunch more pods, would resolve the issue, here is the rate of span production: ![span_rate](https://github.com/user-attachments/assets/a9373130-6b9e-46c3-8ab0-982f373b9dc2) Which quickly filled up the collector queue: ![collector](https://github.com/user-attachments/assets/aaf2e2d8-934e-48ed-af3d-4152d2864ab0) However, even if I added a bunch more pods they would just start throttling: ![throttle](https://github.com/user-attachments/assets/a1fda4b9-bd2b-4542-9ef7-bc5b438dbc98) Though the CPU usage never looked alarming (i.e. it never approached the limit of `2`): ![cpu](https://github.com/user-attachments/assets/e7502883-3a67-4007-a4ad-3ff1f36f606f) With this change in place we were able to process more events with the same number of pods: ![span_rate_fix](https://github.com/user-attachments/assets/16517473-a77e-4d68-bd94-dc67c7425464) without _any_ CPU throttling, though the queue did still build up, but I expect this could be resolved by adjusting our `HorizontalPodAutoscaler` For comparison, here's CPU usage with the fix: ![cpu_fix](https://github.com/user-attachments/assets/c31d1b9a-29bb-41ec-bad0-11e0956f812e) [1] https://kubernetes.io/docs/concepts/workloads/pods/downward-api/#downwardapi-resourceFieldRef [2] open-telemetry/opentelemetry-collector#4459 (comment)

awx-fuyuanchu added the bug Something isn't working label Nov 19, 2021

bogdandrutu removed the bug Something isn't working label Nov 19, 2021

bogdandrutu mentioned this issue Jan 5, 2022

[Proposal] Move batching to exporterhelper #4646

Open

This was referenced Feb 3, 2023

[Bug]: otelcol can't get cpu limit and may cause performance issues jaegertracing/jaeger-operator#2169

Closed

[Bug]: otelcol can't get cpu limit and may cause performance issues open-telemetry/opentelemetry-operator#1456

Closed

cforce mentioned this issue Mar 17, 2024

Documentation for the thread model #4614

Open

jaronoff97 mentioned this issue May 7, 2024

Automatically set GOMEMLIMIT for collector, TA, bridge, etc. open-telemetry/opentelemetry-operator#2919

Closed

matthewhughes-uw mentioned this issue Sep 30, 2024

Fix CPU throttling on collector during high traffic utilitywarehouse/opentelemetry-manifests#51

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

otelcol can't get cpu limit and may cause performance issues. #4459

otelcol can't get cpu limit and may cause performance issues. #4459

awx-fuyuanchu commented Nov 19, 2021

bogdandrutu commented Nov 19, 2021

morigs commented Dec 2, 2021

bogdandrutu commented Jan 4, 2022

morigs commented Jan 5, 2022

Serpent6877 commented Dec 6, 2022 •

edited

Loading

gebn commented Jan 25, 2023

jpkrohling commented Jan 25, 2023

morigs commented Jan 25, 2023

pavolloffay commented Jan 26, 2023

jpkrohling commented Jan 26, 2023

frzifus commented Feb 2, 2023

jpkrohling commented Feb 2, 2023

edwintye commented Mar 30, 2023

max-frank commented Jul 19, 2024

otelcol can't get cpu limit and may cause performance issues. #4459

otelcol can't get cpu limit and may cause performance issues. #4459

Comments

awx-fuyuanchu commented Nov 19, 2021

bogdandrutu commented Nov 19, 2021

morigs commented Dec 2, 2021

bogdandrutu commented Jan 4, 2022

morigs commented Jan 5, 2022

Serpent6877 commented Dec 6, 2022 • edited Loading

gebn commented Jan 25, 2023

jpkrohling commented Jan 25, 2023

morigs commented Jan 25, 2023

pavolloffay commented Jan 26, 2023

jpkrohling commented Jan 26, 2023

frzifus commented Feb 2, 2023

jpkrohling commented Feb 2, 2023

edwintye commented Mar 30, 2023

max-frank commented Jul 19, 2024

Serpent6877 commented Dec 6, 2022 •

edited

Loading