HostMetrics process scraper high CPU usage during collection on Windows Server 2019 #32947

drewftw · 2024-05-08T17:17:31Z

Component(s)

receiver/hostmetrics

What happened?

Description

Otel Collector running on Windows Server 2019 was observed to have high CPU spikes (3-7%) each time the hostmetrics receiver collection process ran which was set to an interval of 1 minute.

After testing the issue was narrowed down to the process scraper. The following shows the Otel collector CPU usage when only the process scraper is enabled.

After reenabling all other hostmetrics scrapers except for the process scraper, we can see the magnitude of the CPU spikes come down significantly (<0.5%).

Steps to Reproduce

On a machine running Windows Server 2019, download the v0.94 version of Otel collector from https://github.com/open-telemetry/opentelemetry-collector-releases/releases/tag/v0.94.0.

Modify the config.yaml to enable the hostmetrics process scraper and set the collection interval (see config attached to the issue for an example).

Run the otel collector exe

Monitor the CPU usage of the otel collector on Task Manager or graph the usage using perfmon

Expected Result

CPU usage comparable to observed levels on Linux collectors (<0.5%)

Actual Result

CPU spikes to 3-7%

Collector version

v0.93.0

Environment information

Environment

OS: Windows Server 2019

OpenTelemetry Collector configuration

receivers:
  hostmetrics:
    collection_interval: 1m
    scrapers:
      cpu:
        metrics:
          system.cpu.utilization:
            enabled: true
      disk:
      load:
      filesystem:
        metrics:
          system.filesystem.utilization:
            enabled: true
      memory:
        metrics:
          system.memory.utilization:
            enabled: true
      network:
      paging:
        metrics:
          system.paging.utilization:
            enabled: true
      processes:
      process:
        mute_process_exe_error: true
        metrics:
          process.cpu.utilization:
            enabled: true
          process.memory.utilization:
            enabled: true
  docker_stats:
    collection_interval: 1m
    metrics:
      container.cpu.throttling_data.periods:
        enabled: true
      container.cpu.throttling_data.throttled_periods:
        enabled: true
      container.cpu.throttling_data.throttled_time:
        enabled: true
  prometheus:
    config:
      scrape_configs:
        - job_name: $InstanceId/otel-self-metrics-collector-$Region
          scrape_interval: 1m
          static_configs:
            - targets: ['0.0.0.0:9999']
  otlp:
    protocols:
      grpc:
      http:

exporters:
  debug:
    verbosity: normal
  otlp:
    endpoint: <endpoint>

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 500
    spike_limit_mib: 100
  batch:
    send_batch_size: 8192
    send_batch_max_size: 8192
    timeout: 2000ms
  filter:
    metrics:
      exclude:
        match_type: strict
        metric_names:
          # comment a metric to remove from exclusion rule
          - otelcol_exporter_queue_capacity
          - otelcol_exporter_enqueue_failed_spans
          - otelcol_exporter_enqueue_failed_log_records
          - otelcol_exporter_enqueue_failed_metric_points
          - otelcol_exporter_send_failed_metric_points
          - otelcol_process_runtime_heap_alloc_bytes
          - otelcol_process_runtime_total_alloc_bytes
          - otelcol_processor_batch_timeout_trigger_send
          - otelcol_process_runtime_total_sys_memory_bytes
          - otelcol_process_uptime
          - otelcol_scraper_errored_metric_points
          - otelcol_scraper_scraped_metric_points
          - scrape_samples_scraped
          - scrape_samples_post_metric_relabeling
          - scrape_series_added
          - scrape_duration_seconds
          # - up
  resourcedetection:
    detectors: ec2, env, system
    ec2:
      tags:
        - ^Environment$
    system:
      hostname_sources: ["os"]
      resource_attributes:
        host.id:
          enabled: true

extensions:
  health_check:
  pprof:
  zpages:

service:
  telemetry:
    metrics:
      level: detailed
      address: 0.0.0.0:9999
  extensions: [pprof, zpages, health_check]
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [debug, otlp]
      processors: [memory_limiter, batch, resourcedetection]
    metrics:
      receivers: [otlp, hostmetrics, prometheus]
      exporters: [debug, otlp]
      processors: [memory_limiter, batch, resourcedetection, filter]
    logs:
      receivers: [otlp]
      exporters: [debug, otlp]
      processors: [memory_limiter, batch, resourcedetection]

Log output

No response

Additional context

Additional details: Windows 2019 was running on an m5x.large EC2

The text was updated successfully, but these errors were encountered:

github-actions · 2024-05-08T17:19:51Z

Pinging code owners:

receiver/hostmetrics: @dmitryax @braydonk

See Adding Labels via Comments if you do not have permissions to add labels yourself.

braydonk · 2024-05-08T17:24:14Z

Hi @drewftw,

I made improvements to the CPU usage of the process scraper in v0.99.0 of the collector. Would you be able to update the collector and give that a try? Hopefully that should make it better.

drewftw · 2024-05-08T17:37:29Z

Hey @braydonk, thanks for your quick response! Sure I can try v0.99 and see if it helps the issue. I've been testing with v0.94 since thats the version my users are on, they haven't upgraded yet

braydonk · 2024-05-08T17:46:23Z

Here's the issue with the explanation for the CPU usage and how it was fixed in v0.99.0. #28849

We can't be 100% sure you aren't running into something different since this was focused on Linux, but it's worth seeing if this helps in your scenario.

drewftw · 2024-05-23T16:06:12Z

@braydonk We're still observing a similar pattern after upgrading to v0.99.0. CPU spiking to 5% when metrics are being scraped. Anything I can investigate to provide more info?

braydonk · 2024-05-23T17:23:32Z

Thanks for the info @drewftw. I don't expect I'll need anything from your environment; I expect this is the same thing many users are experiencing rather than a specific breakage. The inefficiencies that existed on Linux may exist in different ways on Windows. I'll replicate the same research I did on Linux in my Windows environment.

I expect I can set aside some time next week, I will keep this issue updated with progress.

braydonk · 2024-05-29T19:49:03Z

I had time to investigate this today and I opened a PR with details and a fix!

crobert-1 · 2024-05-29T20:03:53Z

Removed needs triage as a code owner has opened a PR to resolve this issue.

github-actions · 2024-07-29T03:32:32Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

receiver/hostmetrics: @dmitryax @braydonk

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions · 2024-09-27T05:20:46Z

This issue has been closed as inactive because it has been stale for 120 days with no activity.

braydonk · 2024-09-27T12:29:26Z

Not Stale. The PR for this is open and ready for review. Can be marked stale-exempt.

rustyautopsy · 2025-02-19T12:42:41Z

We are getting hit by this issue as well. It would be a shame to have to pivot to something like the Prometheus Windows Exporter to capture these metrics. I suppose we could fork the code for this scraper, which would also be less then ideal.

braydonk · 2025-02-19T13:31:36Z

What would your plan be if you were forking? My experiment in #35337 turned out not to pay off; the memory usage of WMI ends up being really high anyway so there wasn't really any gain overall. Do you have any other ideas?

I could take just the portion of that PR that worked, which was detecting if the Parent PID resource attribute was disabled and skipping collection, which did have big performance gains provided you don't need that Parent PID.

rustyautopsy · 2025-02-19T20:33:32Z

The high level plan was to start playing with the PR as a base and see if that helped our issues. However, seeing that your tests didn't result in the performance gains, not sure we would stumble across something the sme's have not already tried.

So, we are back to square one.

I could take just the portion of that PR that worked, which was detecting if the Parent PID resource attribute was disabled and skipping collection, which did have big performance gains provided you don't need that Parent PID.

We would be happy to try this to see if it helps us.

rustyautopsy · 2025-02-20T01:07:29Z

Further tinkering has show that even if we disable all the default metrics, the CPU usage is still high. The scraper seems to be running, however, ignoring, we figure.

Additionally, hosts where CPU usage is high, those hosts have ~800 process running. Hosts with ~100 processes are not affected.

We thought we would share in case this helped.

braydonk · 2025-02-20T01:30:43Z

Were you trying that with the PR from earlier? Because if not, disabling any metrics or resource attributes does not stop the scraper from collecting the expensive attribute, which is the Parent PID. I'm working on a PR that will make it so the Parent PID is not collected if that resource attribute is disabled, which should reduce CPU usage. In my testing on the PR above with the parent PID disabled, it seemed pretty effective.

braydonk · 2025-02-20T03:19:05Z

I went back and redid my old experiments to evaluate the tradeoffs of #35337, I think I was overestimating how badly WMI is affected. I don't think it actually "outpaces" the CPU gains on the collector; the usage goes up but it really doesn't seem that bad actually (green = no fix, red = fix, blue = wmi):

Even graphing other metrics like Working Set shows the collector not having a substantial impact immediately, but I'm not super familiar with the depths of WMI.

(Side note: idk why the working set of the collector with the fix is so much lower, I didn't think my fix would have that impact).

The main WmiPrvSE process doesn't have much change, but the WmiPrvSE#1 which I assume is some manner of worker process that the process scraper ends up starting goes away when the Collector's not running. So there is definitely a memory impact as well, but it doesn't seem too bad in my experiments.

Now it might be that WMI does some manner of optimization holding onto old data or something, maybe this wasn't scientific enough. I can definitely use something like Windows Performance Recorder to do some more in-depth science on this.

But with this in mind, I'm inclined to actually revive my old PR instead. I feel a bit more confident that the change is a net-positive. I will rebase the PR to current HEAD and start to address the comments there.

rustyautopsy · 2025-02-20T12:33:13Z

To close the loop we did not try the PR, we will wait for your changes before trying again.

pjanotti · 2025-03-05T06:02:43Z

One suggestion for a roadmap for the process scraper on Windows is described at #35337 (comment)

drewftw added bug Something isn't working needs triage New item requiring triage labels May 8, 2024

github-actions bot added the receiver/hostmetrics label May 8, 2024

github-actions bot mentioned this issue May 28, 2024

Weekly Report: 2024-05-21 - 2024-05-28 #33243

Closed

braydonk mentioned this issue May 29, 2024

hostmetrics: use WMI to fetch ppid #33296

Closed

crobert-1 removed the needs triage New item requiring triage label May 29, 2024

github-actions bot added the Stale label Jul 29, 2024

braydonk mentioned this issue Sep 21, 2024

hostmetrics: use WMI to fetch ppid #35337

Open

github-actions bot added the closed as inactive label Sep 27, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 27, 2024

braydonk reopened this Feb 19, 2025

github-actions bot removed closed as inactive Stale labels Feb 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HostMetrics process scraper high CPU usage during collection on Windows Server 2019 #32947

HostMetrics process scraper high CPU usage during collection on Windows Server 2019 #32947

drewftw commented May 8, 2024

github-actions bot commented May 8, 2024

braydonk commented May 8, 2024

drewftw commented May 8, 2024

braydonk commented May 8, 2024 •

edited

Loading

drewftw commented May 23, 2024

braydonk commented May 23, 2024

braydonk commented May 29, 2024 •

edited

Loading

crobert-1 commented May 29, 2024

github-actions bot commented Jul 29, 2024

github-actions bot commented Sep 27, 2024

braydonk commented Sep 27, 2024

rustyautopsy commented Feb 19, 2025 •

edited

Loading

braydonk commented Feb 19, 2025

rustyautopsy commented Feb 19, 2025

rustyautopsy commented Feb 20, 2025

braydonk commented Feb 20, 2025

braydonk commented Feb 20, 2025 •

edited

Loading

rustyautopsy commented Feb 20, 2025

pjanotti commented Mar 5, 2025

HostMetrics process scraper high CPU usage during collection on Windows Server 2019 #32947

HostMetrics process scraper high CPU usage during collection on Windows Server 2019 #32947

Comments

drewftw commented May 8, 2024

Component(s)

What happened?

Description

Steps to Reproduce

Expected Result

Actual Result

Collector version

Environment information

Environment

OpenTelemetry Collector configuration

Log output

Additional context

github-actions bot commented May 8, 2024

braydonk commented May 8, 2024

drewftw commented May 8, 2024

braydonk commented May 8, 2024 • edited Loading

drewftw commented May 23, 2024

braydonk commented May 23, 2024

braydonk commented May 29, 2024 • edited Loading

crobert-1 commented May 29, 2024

github-actions bot commented Jul 29, 2024

github-actions bot commented Sep 27, 2024

braydonk commented Sep 27, 2024

rustyautopsy commented Feb 19, 2025 • edited Loading

braydonk commented Feb 19, 2025

rustyautopsy commented Feb 19, 2025

rustyautopsy commented Feb 20, 2025

braydonk commented Feb 20, 2025

braydonk commented Feb 20, 2025 • edited Loading

rustyautopsy commented Feb 20, 2025

pjanotti commented Mar 5, 2025

braydonk commented May 8, 2024 •

edited

Loading

braydonk commented May 29, 2024 •

edited

Loading

rustyautopsy commented Feb 19, 2025 •

edited

Loading

braydonk commented Feb 20, 2025 •

edited

Loading