Splunk exporter causing CPU regression with version update #29560

sarkisjad · 2023-11-29T11:24:07Z

Describe the bug
When comparing versions v0.54.0 and v0.87.0, there was a major issue with the CPU regression. This regression is happening because we are adding a Splunk exporter in our yaml file.

Steps to reproduce
In order to reproduce the error, I did a performance profiler pprof for the two versions on the same application that is using the otel collector.
We add pprof in the extension of the config yaml file:

extensions:
  pprof:
    save_to_file: ${file_name}

The file saved is a binary file that can produce the results in the form of a FlamGraph through a software called graphviz.
We can visualize the results by writing on the command line go tool pprog -http : ${file_name}

Here are the results:

What did you expect to see?
I expected to not see a big difference in the CPU regression.

What did you see instead?
here is a table outline what is causing the main issue.

Indeed, there is a 656% increase in CPU regression

What version did you use?
Version: v0.54.0, v0.87.0

What config did you use?

receivers:
  otlp: # pushed by agents
    protocols:
      grpc:
        endpoint: :${AGENT_RECEIVER_PORT}
  otlp/groupby: # pushed by env probes
    protocols:
      grpc:
        endpoint: :${OTLP_RECEIVER_PORT}

exporters:
  extensions: [pprof]
  splunk_hec/splunk_metrics: # push to Splunk.
    token: ${SPLUNK_TOKEN}
    endpoint: ${SPLUNK_ENDPOINT}
    source: "mx"
    sourcetype: "otel"
    index: "metrics_test"
    tls:
      insecure_skip_verify: true
    retry_on_failure:
      initial_interval: 5s
      max_interval: 30s
      max_elapsed_time: 3600s
    sending_queue:
      queue_size: 100000
      storage: file_storage

extensions:
  pprof:
    save_to_file: profiling_results

service:
  pipelines:
    metrics/splunk/env_probes:
      receivers: [otlp/groupby]
      exporters: [splunk_hec/splunk_metrics]
    metrics/splunk/agents:
      receivers: [otlp]
      exporters: [splunk_hec/splunk_metrics]

The text was updated successfully, but these errors were encountered:

github-actions · 2023-11-29T15:05:10Z

Pinging code owners for exporter/splunkhec: @atoulme @dmitryax. See Adding Labels via Comments if you do not have permissions to add labels yourself.

atoulme · 2023-11-29T17:22:39Z

You might be comparing apples and oranges here because of the vast range of version changes. We had quite a few fixes go in to fix how metrics were exported. You can try to test with different versions to zoom on which version starts to exhibit the issue you see, which I don't expect you to do.

Here are things to look at:

your sending queue capacity is quite big, are you sure you mean to queue up 100000 metrics? Typically we drop metrics if they fall behind instead of trying to persist them. I guess, combined with retries, that you're aiming to sustain traffic during outage for 3600s. I guess that is fine but still wanted to point out that's not what we typically ship with.
you can try to use the multi-metric HEC event format to reduce traffic and computation by setting use_multi_metric_format to true. Works with newer versions of Splunk (> 8 iirc). It batches and retries differently and that might help a lot with computation.
Look at max_content_length_metrics to see if you can safely increase it. This setting was not applied correctly in 0.54.0.

atoulme · 2023-11-29T17:30:10Z

While you're in there, also make sure to move to 0.90.0 to pick up the change introduced by #27776 which should reduce quite a bit the CPU costs (78% to 92% reduction).

sarkisjad · 2023-11-30T07:40:09Z

Hello @atoulme,
Thank you for getting back to me

There was something I forgot to add I thought it might affect the performance of the Splunk exporter.
Do you think processors might have an effect on the regression?
I am adding the following to the metrics

processors:
  batch/metrics:
    timeout: 200ms
    send_batch_size: 1000
    send_batch_max_size: 1000
  resource/run_id:
    attributes:
      - key: ${KEY}
        value: ${RUN_ID}
        action: insert
  filter/non_functional: # filter metrics sent to Splunk
    metrics:
      exclude:
        match_type: strict
        metric_names:
          - ${METRIC_NAME}

atoulme · 2023-11-30T08:18:09Z

None of the data you showed pointed to processors. Any processor in the pipeline adds overhead, but that should be manageable.

sarkisjad · 2023-12-01T12:30:12Z

Alright, thank you for your input!

sarkisjad added the bug Something isn't working label Nov 29, 2023

mx-psi transferred this issue from open-telemetry/opentelemetry-collector Nov 29, 2023

mx-psi added the exporter/splunkhec label Nov 29, 2023

sarkisjad closed this as completed Dec 1, 2023

github-actions bot mentioned this issue Dec 5, 2023

Weekly Report: 2023-11-28 - 2023-12-05 #29650

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Splunk exporter causing CPU regression with version update #29560

Splunk exporter causing CPU regression with version update #29560

sarkisjad commented Nov 29, 2023

github-actions bot commented Nov 29, 2023

atoulme commented Nov 29, 2023

atoulme commented Nov 29, 2023

sarkisjad commented Nov 30, 2023

atoulme commented Nov 30, 2023

sarkisjad commented Dec 1, 2023