Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

otelcol can't get cpu limit and may cause performance issues. #4459

Open
awx-fuyuanchu opened this issue Nov 19, 2021 · 14 comments
Open

otelcol can't get cpu limit and may cause performance issues. #4459

awx-fuyuanchu opened this issue Nov 19, 2021 · 14 comments

Comments

@awx-fuyuanchu
Copy link

Describe the bug

2021-11-19T03:39:27.622Z info service/collector.go:230 Starting otelcontribcol... {"Version": "v0.37.1", "NumCPU": 4}

Steps to reproduce

Limit the CPU resource to 1 with K8S node has 4 CPUs.

What did you expect to see?

otelcol could get the CPU limit.

What did you see instead?

otelcol get the CPU number of the node.

What version did you use?
Version: v0.37.1

What config did you use?

Environment

GKE v1.19.14-gke.1900

Additional context

So far, the CPU num was used in a set of limited features like bathProcessor and converter. It could be a risk in the feature.

@awx-fuyuanchu awx-fuyuanchu added the bug Something isn't working label Nov 19, 2021
@bogdandrutu
Copy link
Member

That is just a log message. The CPU limitation comes from the k8s itself which throttles the process when gets to the limit. We don't do anything with "NumCPU" except that we print it. If you believe that it is confusing we can remove that message.

@bogdandrutu bogdandrutu removed the bug Something isn't working label Nov 19, 2021
@morigs
Copy link
Contributor

morigs commented Dec 2, 2021

It's not a problem, of course, but it actually used here

@bogdandrutu
Copy link
Member

@morigs interesting. Now we can have a long debate here, usually the operation executed after batching is an I/O op (not that CPU intensive). If we limit that to 1 core (in your example) we will never be able to hit probably 0.7 cores.

So not sure what is the best in this case.

@morigs
Copy link
Contributor

morigs commented Jan 5, 2022

@bogdandrutu
In the case of batch processor is not a problem, it's just a channel size. The real issue is the number of threads.
Processor as well as exporters can perform CPU intensive tasks (complex sampling, serialization etc) so they will try to utilize as many cores as possible. This will lead to throttling (which is a bad thing).
IMO there are two solutions:

  1. Document how to use otel-collector in K8S (setting correct limits and GOMAXPROCS). And probably fix (if not already) this issue in otel operator.
  2. Use something like this

@Serpent6877
Copy link

Serpent6877 commented Dec 6, 2022

I am curious about the same potential issue. We use opentelemetry collector as a sidecar on GKE. We use 0.48.0 and I see this in the logs

service/collector.go:252 Starting otelcol-contrib... {"Version": "0.48.0", "NumCPU": 16}

which is the virtuals CPU. We allocate 1 to 4 CPU depending on the deployment. So for the 1 CPU pods are we potentially having issues? We do pretty high volumes of traffic.

@gebn
Copy link

gebn commented Jan 25, 2023

Prometheus has a currently-experimental --enable-feature auto-gomaxprocs flag which triggers uber-go/automaxprocs and has worked really well for us.

@jpkrohling
Copy link
Member

Document how to use otel-collector in K8S (setting correct limits and GOMAXPROCS). And probably fix (if not already) this issue in otel operator.

I'm in favor of giving this a try. @open-telemetry/operator-approvers , what do you thin?

@morigs
Copy link
Contributor

morigs commented Jan 25, 2023

Should this be implemented as a core feature enabled by default? Or as an extension?

@pavolloffay
Copy link
Member

I'm in favor of giving this a try. https://github.com/orgs/open-telemetry/teams/operator-approvers , what do you thin?

Agee on improving this. What changes are proposed for the operator? Should the operator set GOMAXPROCS?

@jpkrohling
Copy link
Member

Should the operator set GOMAXPROCS?

Yes, I think it would be a good start.

@frzifus
Copy link
Member

frzifus commented Feb 2, 2023

How would a cpu limit of 0.9 or 1.1 then be reflected? Is it just GOMAXPROCS=1?

@jpkrohling
Copy link
Member

I would round it up: 0.9 becomes 1, 1.1 becomes 2.

@edwintye
Copy link

I accidentally stumbled onto the same problem as we were experiencing a lot of throttling during a spike of traffic. Seems to me that there already exists a mechanism to set GOMAXPROCS by adding it in

env:
  - name: GOMAXPROCS
    valueFrom:
      resourceFieldRef:
        containerName: otc-container
        resource: limits.cpu

to either the CR for the operator or directly into the deployment. Is there a scenario where using the roundup mechanism of native k8s is not as good?

@max-frank
Copy link

Is there a scenario where using the roundup mechanism of native k8s is not as good?

Any container environment other than k8s that does not easily support a mechanism as resourceFieldRef e.g., GCP CloudRun

matthewhughes-uw added a commit to utilitywarehouse/opentelemetry-manifests that referenced this issue Sep 30, 2024
By setting `GOMAXPROCS` to the number of available CPUs available to the
pod via the 'downward API'[1]. This is based on[2], otherwise the
collector will use `runtime.NumCPU` (i.e. number of processors available
to the _node_) when setting up batch processing.

[1] https://kubernetes.io/docs/concepts/workloads/pods/downward-api/#downwardapi-resourceFieldRef
[2] open-telemetry/opentelemetry-collector#4459 (comment)
matthewhughes-uw added a commit to utilitywarehouse/opentelemetry-manifests that referenced this issue Oct 1, 2024
By setting `GOMAXPROCS` to the number of available CPUs available to the
pod via the 'downward API'[1]. This is based on[2], otherwise the
collector will use `runtime.NumCPU` (i.e. number of processors available
to the _node_) when setting up batch processing.

Some experimentation: I modified our pubsub producer to start producing
_lots_ of messages (and hence events) to stress test the collector,
before this change, once the collector queue was full it seemed like
nothing, even adding a bunch more pods, would resolve the issue, here is
the rate of span production:


![span_rate](https://github.com/user-attachments/assets/a9373130-6b9e-46c3-8ab0-982f373b9dc2)

Which quickly filled up the collector queue:


![collector](https://github.com/user-attachments/assets/aaf2e2d8-934e-48ed-af3d-4152d2864ab0)

However, even if I added a bunch more pods they would just start
throttling:


![throttle](https://github.com/user-attachments/assets/a1fda4b9-bd2b-4542-9ef7-bc5b438dbc98)

Though the CPU usage never looked alarming (i.e. it never approached the
limit of `2`):


![cpu](https://github.com/user-attachments/assets/e7502883-3a67-4007-a4ad-3ff1f36f606f)

With this change in place we were able to process more events with the
same number of pods:


![span_rate_fix](https://github.com/user-attachments/assets/16517473-a77e-4d68-bd94-dc67c7425464)

without _any_ CPU throttling, though the queue did still build up, but I
expect this could be resolved by adjusting our `HorizontalPodAutoscaler`

For comparison, here's CPU usage with the fix:


![cpu_fix](https://github.com/user-attachments/assets/c31d1b9a-29bb-41ec-bad0-11e0956f812e)

[1]
https://kubernetes.io/docs/concepts/workloads/pods/downward-api/#downwardapi-resourceFieldRef
[2]
open-telemetry/opentelemetry-collector#4459 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants