Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Splunk exporter causing CPU regression with version update #29560

Closed
sarkisjad opened this issue Nov 29, 2023 · 6 comments
Closed

Splunk exporter causing CPU regression with version update #29560

sarkisjad opened this issue Nov 29, 2023 · 6 comments
Labels
bug Something isn't working exporter/splunkhec

Comments

@sarkisjad
Copy link

Describe the bug
When comparing versions v0.54.0 and v0.87.0, there was a major issue with the CPU regression. This regression is happening because we are adding a Splunk exporter in our yaml file.

Steps to reproduce
In order to reproduce the error, I did a performance profiler pprof for the two versions on the same application that is using the otel collector.
We add pprof in the extension of the config yaml file:

extensions:
  pprof:
    save_to_file: ${file_name}

The file saved is a binary file that can produce the results in the form of a FlamGraph through a software called graphviz.
We can visualize the results by writing on the command line go tool pprog -http : ${file_name}

Here are the results:

v0.54.0
v0.87.0

What did you expect to see?
I expected to not see a big difference in the CPU regression.

What did you see instead?
here is a table outline what is causing the main issue.
image
Indeed, there is a 656% increase in CPU regression

What version did you use?
Version: v0.54.0, v0.87.0

What config did you use?

receivers:
  otlp: # pushed by agents
    protocols:
      grpc:
        endpoint: :${AGENT_RECEIVER_PORT}
  otlp/groupby: # pushed by env probes
    protocols:
      grpc:
        endpoint: :${OTLP_RECEIVER_PORT}

exporters:
  extensions: [pprof]
  splunk_hec/splunk_metrics: # push to Splunk.
    token: ${SPLUNK_TOKEN}
    endpoint: ${SPLUNK_ENDPOINT}
    source: "mx"
    sourcetype: "otel"
    index: "metrics_test"
    tls:
      insecure_skip_verify: true
    retry_on_failure:
      initial_interval: 5s
      max_interval: 30s
      max_elapsed_time: 3600s
    sending_queue:
      queue_size: 100000
      storage: file_storage

extensions:
  pprof:
    save_to_file: profiling_results

service:
  pipelines:
    metrics/splunk/env_probes:
      receivers: [otlp/groupby]
      exporters: [splunk_hec/splunk_metrics]
    metrics/splunk/agents:
      receivers: [otlp]
      exporters: [splunk_hec/splunk_metrics]
@sarkisjad sarkisjad added the bug Something isn't working label Nov 29, 2023
@mx-psi mx-psi transferred this issue from open-telemetry/opentelemetry-collector Nov 29, 2023
Copy link
Contributor

Pinging code owners for exporter/splunkhec: @atoulme @dmitryax. See Adding Labels via Comments if you do not have permissions to add labels yourself.

@atoulme
Copy link
Contributor

atoulme commented Nov 29, 2023

You might be comparing apples and oranges here because of the vast range of version changes. We had quite a few fixes go in to fix how metrics were exported. You can try to test with different versions to zoom on which version starts to exhibit the issue you see, which I don't expect you to do.

Here are things to look at:

  • your sending queue capacity is quite big, are you sure you mean to queue up 100000 metrics? Typically we drop metrics if they fall behind instead of trying to persist them. I guess, combined with retries, that you're aiming to sustain traffic during outage for 3600s. I guess that is fine but still wanted to point out that's not what we typically ship with.
  • you can try to use the multi-metric HEC event format to reduce traffic and computation by setting use_multi_metric_format to true. Works with newer versions of Splunk (> 8 iirc). It batches and retries differently and that might help a lot with computation.
  • Look at max_content_length_metrics to see if you can safely increase it. This setting was not applied correctly in 0.54.0.

@atoulme
Copy link
Contributor

atoulme commented Nov 29, 2023

While you're in there, also make sure to move to 0.90.0 to pick up the change introduced by #27776 which should reduce quite a bit the CPU costs (78% to 92% reduction).

@sarkisjad
Copy link
Author

Hello @atoulme,
Thank you for getting back to me

There was something I forgot to add I thought it might affect the performance of the Splunk exporter.
Do you think processors might have an effect on the regression?
I am adding the following to the metrics

processors:
  batch/metrics:
    timeout: 200ms
    send_batch_size: 1000
    send_batch_max_size: 1000
  resource/run_id:
    attributes:
      - key: ${KEY}
        value: ${RUN_ID}
        action: insert
  filter/non_functional: # filter metrics sent to Splunk
    metrics:
      exclude:
        match_type: strict
        metric_names:
          - ${METRIC_NAME}

@atoulme
Copy link
Contributor

atoulme commented Nov 30, 2023

None of the data you showed pointed to processors. Any processor in the pipeline adds overhead, but that should be manageable.

@sarkisjad
Copy link
Author

Alright, thank you for your input!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working exporter/splunkhec
Projects
None yet
Development

No branches or pull requests

3 participants