Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenTelemetry Collector does not gracefully shutdown, losing metrics on spot instance termination #33441

Open
Rommmmm opened this issue Jun 9, 2024 · 9 comments
Labels
bug Something isn't working exporter/datadog Datadog components priority:p2 Medium Stale

Comments

@Rommmmm
Copy link

Rommmmm commented Jun 9, 2024

Component(s)

datadogexporter

What happened?

Description

We are currently experiencing an issue with the OpenTelemetry Collector running in our Kubernetes cluster, which is managed by Karpenter. Our setup involves spot instances, and we've noticed that when Karpenter terminates these instances, the OpenTelemetry Collector does not seem to shut down gracefully. Consequently, we are losing metrics and traces that are presumably still in the process of being processed or exported.

Steps to Reproduce

  1. Deploy the OpenTelemetry Collector on a Kubernetes cluster with Karpenter managing spot instances.
  2. Simulate a spot instance termination (or just teminate a node in the cluster).
  3. Observe that the metrics and traces during the termination period are lost.

Expected Result

The OpenTelemetry Collector should flush all pending metrics and traces before shutting down to ensure no data is lost during spot instance termination.

Actual Result

During a spot termination event triggered by Karpenter, the OpenTelemetry Collector shuts down without flushing all the data, causing loss of metrics and traces.

Collector version

0.95.0

Environment information

Environment

Kubernetes Version: 1.27
Karpenter Version: 0.35.2
Cloud Provider: AWS

OpenTelemetry Collector configuration

connectors:
  datadog/connector: null
exporters:
  datadog:
    api:
      fail_on_invalid_key: true
      key: <KEY>
      site: <SITE>
    host_metadata:
      enabled: false
    metrics:
      histograms:
        mode: distributions
        send_count_sum_metrics: true
      instrumentation_scope_metadata_as_tags: true
      resource_attributes_as_tags: true
      sums:
        cumulative_monotonic_mode: raw_value
    retry_on_failure:
      enabled: true
      initial_interval: 1s
      max_elapsed_time: 600s
      max_interval: 20s
    sending_queue:
      enabled: true
      num_consumers: 100
      queue_size: 3000
    traces:
      trace_buffer: 30
  debug: {}
  logging: {}
extensions:
  health_check:
    endpoint: <HEALTHCHECK>
processors:
  batch:
    send_batch_max_size: 3000
    send_batch_size: 2000
    timeout: 3s
  memory_limiter:
    check_interval: 5s
    limit_mib: 1800
    spike_limit_mib: 750
receivers:
  carbon:
    endpoint: <CARBON>
  otlp:
    protocols:
      grpc:
        endpoint: <ENDPOINT>
      http:
        endpoint: <ENDPOINT>
  prometheus:
    config:
      scrape_configs:
      - job_name: <JOB_NAME>
        scrape_interval: 30s
        static_configs:
        - targets:
          - <ENDPOINT>
  statsd:
    aggregation_interval: 60s
    endpoint: <ENDPOINT>
service:
  extensions:
  - health_check
  pipelines:
    logs:
      exporters:
      - datadog
      processors:
      - memory_limiter
      - batch
      - resource
      receivers:
      - otlp
    metrics:
      exporters:
      - datadog
      processors:
      - memory_limiter
      - batch
      - resource
      receivers:
      - otlp
      - carbon
      - statsd
      - prometheus
      - datadog/connector
    traces:
      exporters:
      - datadog
      - datadog/connector
      processors:
      - memory_limiter
      - batch
      - resource
      receivers:
      - otlp
  telemetry:
    metrics:
      address: <ENDPOINT>

Log output

No response

Additional context

I noticed that there is a terminationGracePeriodSeconds configuration in Kubernetes deployment that can give workloads more time to shutdown. However, this option does not seem to be exposed in the OpenTelemetry Collector Helm chart.

I would like to suggest the following enhancements:

  1. Expose the terminationGracePeriodSeconds parameter in the Helm chart to allow users to specify a custom grace period.
  2. Review the shutdown procedure of the OpenTelemetry Collector to ensure that it attempts to flush all buffered data before exiting.
Copy link
Contributor

Pinging code owners for exporter/datadog: @mx-psi @dineshg13 @liustanley @songy23 @mackjmr @ankitpatel96. See Adding Labels via Comments if you do not have permissions to add labels yourself.

1 similar comment
Copy link
Contributor

Pinging code owners for exporter/datadog: @mx-psi @dineshg13 @liustanley @songy23 @mackjmr @ankitpatel96. See Adding Labels via Comments if you do not have permissions to add labels yourself.

@songy23
Copy link
Member

songy23 commented Jun 10, 2024

@Rommmmm could you try upgrade to v0.102.0 and see if the issue persists? This should have been fixed in #33291.

@kevinh-canva
Copy link

Hi, I'm seeing the same issue. And updating to v0.102 doesn't help, we are still losing metrics

@Rommmmm
Copy link
Author

Rommmmm commented Jun 24, 2024

@songy23 sorry for taking so long but unfortunately upgrading didn't help

@ancostas
Copy link

@Rommmmm Does the collector not gracefully shut down at all, or is it being killed before it can shut down gracefully?

The mention of terminationGracePeriodSeconds makes it sound like the latter, which may be user error (i.e. a process can't finish it's exit routine if it is forcefully interrupted and killed in the middle of it).

@Rommmmm
Copy link
Author

Rommmmm commented Jul 8, 2024

@Rommmmm Does the collector not gracefully shut down at all, or is it being killed before it can shut down gracefully?

The mention of terminationGracePeriodSeconds makes it sound like the latter, which may be user error (i.e. a process can't finish it's exit routine if it is forcefully interrupted and killed in the middle of it).

Its not gracefully shut down

@ancostas
Copy link

ancostas commented Aug 7, 2024

@Rommmmm is it being killed or terminated? Processes being killed is not a graceful shutdown scenario AFAIK.

What I'm guessing is happening is that your terminationGracePeriodSeconds is too short, so while the process is shutting down gracefully (e.g. flushing queued data to a vendor backend), the control plane simply kills it since it took too long.

Copy link
Contributor

github-actions bot commented Oct 7, 2024

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Oct 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working exporter/datadog Datadog components priority:p2 Medium Stale
Projects
None yet
Development

No branches or pull requests

6 participants