OpenTelemetry Collector does not gracefully shutdown, losing metrics on spot instance termination #33441

Rommmmm · 2024-06-09T07:27:36Z

Component(s)

datadogexporter

What happened?

Description

We are currently experiencing an issue with the OpenTelemetry Collector running in our Kubernetes cluster, which is managed by Karpenter. Our setup involves spot instances, and we've noticed that when Karpenter terminates these instances, the OpenTelemetry Collector does not seem to shut down gracefully. Consequently, we are losing metrics and traces that are presumably still in the process of being processed or exported.

Steps to Reproduce

Deploy the OpenTelemetry Collector on a Kubernetes cluster with Karpenter managing spot instances.
Simulate a spot instance termination (or just teminate a node in the cluster).
Observe that the metrics and traces during the termination period are lost.

Expected Result

The OpenTelemetry Collector should flush all pending metrics and traces before shutting down to ensure no data is lost during spot instance termination.

Actual Result

During a spot termination event triggered by Karpenter, the OpenTelemetry Collector shuts down without flushing all the data, causing loss of metrics and traces.

Collector version

0.95.0

Environment information

Environment

Kubernetes Version: 1.27
Karpenter Version: 0.35.2
Cloud Provider: AWS

OpenTelemetry Collector configuration

connectors:
  datadog/connector: null
exporters:
  datadog:
    api:
      fail_on_invalid_key: true
      key: <KEY>
      site: <SITE>
    host_metadata:
      enabled: false
    metrics:
      histograms:
        mode: distributions
        send_count_sum_metrics: true
      instrumentation_scope_metadata_as_tags: true
      resource_attributes_as_tags: true
      sums:
        cumulative_monotonic_mode: raw_value
    retry_on_failure:
      enabled: true
      initial_interval: 1s
      max_elapsed_time: 600s
      max_interval: 20s
    sending_queue:
      enabled: true
      num_consumers: 100
      queue_size: 3000
    traces:
      trace_buffer: 30
  debug: {}
  logging: {}
extensions:
  health_check:
    endpoint: <HEALTHCHECK>
processors:
  batch:
    send_batch_max_size: 3000
    send_batch_size: 2000
    timeout: 3s
  memory_limiter:
    check_interval: 5s
    limit_mib: 1800
    spike_limit_mib: 750
receivers:
  carbon:
    endpoint: <CARBON>
  otlp:
    protocols:
      grpc:
        endpoint: <ENDPOINT>
      http:
        endpoint: <ENDPOINT>
  prometheus:
    config:
      scrape_configs:
      - job_name: <JOB_NAME>
        scrape_interval: 30s
        static_configs:
        - targets:
          - <ENDPOINT>
  statsd:
    aggregation_interval: 60s
    endpoint: <ENDPOINT>
service:
  extensions:
  - health_check
  pipelines:
    logs:
      exporters:
      - datadog
      processors:
      - memory_limiter
      - batch
      - resource
      receivers:
      - otlp
    metrics:
      exporters:
      - datadog
      processors:
      - memory_limiter
      - batch
      - resource
      receivers:
      - otlp
      - carbon
      - statsd
      - prometheus
      - datadog/connector
    traces:
      exporters:
      - datadog
      - datadog/connector
      processors:
      - memory_limiter
      - batch
      - resource
      receivers:
      - otlp
  telemetry:
    metrics:
      address: <ENDPOINT>

Log output

No response

Additional context

I noticed that there is a terminationGracePeriodSeconds configuration in Kubernetes deployment that can give workloads more time to shutdown. However, this option does not seem to be exposed in the OpenTelemetry Collector Helm chart.

I would like to suggest the following enhancements:

Expose the terminationGracePeriodSeconds parameter in the Helm chart to allow users to specify a custom grace period.
Review the shutdown procedure of the OpenTelemetry Collector to ensure that it attempts to flush all buffered data before exiting.

github-actions · 2024-06-10T19:30:15Z

Pinging code owners for exporter/datadog: @mx-psi @dineshg13 @liustanley @songy23 @mackjmr @ankitpatel96. See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions · 2024-06-10T19:30:17Z

Pinging code owners for exporter/datadog: @mx-psi @dineshg13 @liustanley @songy23 @mackjmr @ankitpatel96. See Adding Labels via Comments if you do not have permissions to add labels yourself.

songy23 · 2024-06-10T19:35:31Z

@Rommmmm could you try upgrade to v0.102.0 and see if the issue persists? This should have been fixed in #33291.

kevinh-canva · 2024-06-23T23:34:08Z

Hi, I'm seeing the same issue. And updating to v0.102 doesn't help, we are still losing metrics

Rommmmm · 2024-06-24T05:17:35Z

@songy23 sorry for taking so long but unfortunately upgrading didn't help

ancostas · 2024-06-25T18:04:47Z

@Rommmmm Does the collector not gracefully shut down at all, or is it being killed before it can shut down gracefully?

The mention of terminationGracePeriodSeconds makes it sound like the latter, which may be user error (i.e. a process can't finish it's exit routine if it is forcefully interrupted and killed in the middle of it).

Rommmmm · 2024-07-08T14:08:23Z

@Rommmmm Does the collector not gracefully shut down at all, or is it being killed before it can shut down gracefully?

The mention of terminationGracePeriodSeconds makes it sound like the latter, which may be user error (i.e. a process can't finish it's exit routine if it is forcefully interrupted and killed in the middle of it).

Its not gracefully shut down

ancostas · 2024-08-07T17:42:54Z

@Rommmmm is it being killed or terminated? Processes being killed is not a graceful shutdown scenario AFAIK.

What I'm guessing is happening is that your terminationGracePeriodSeconds is too short, so while the process is shutting down gracefully (e.g. flushing queued data to a vendor backend), the control plane simply kills it since it took too long.

github-actions · 2024-10-07T03:38:27Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

exporter/datadog: @mx-psi @dineshg13 @liustanley @songy23 @mackjmr @ankitpatel96

See Adding Labels via Comments if you do not have permissions to add labels yourself.

Rommmmm added bug Something isn't working needs triage New item requiring triage labels Jun 9, 2024

This was referenced Jun 10, 2024

Weekly Report: 2024-06-03 - 2024-06-10 LucaLanziani/opentelemetry-collector-contrib#6

Closed

Weekly Report: 2024-06-03 - 2024-06-10 LucaLanziani/opentelemetry-collector-contrib#7

Closed

crobert-1 added exporter/datadog Datadog components labels Jun 10, 2024

songy23 added waiting for author and removed needs triage New item requiring triage labels Jun 10, 2024

This was referenced Jun 10, 2024

Weekly Report: 2024-06-03 - 2024-06-10 LucaLanziani/opentelemetry-collector-contrib#8

Closed

Weekly Report: 2024-06-03 - 2024-06-10 LucaLanziani/opentelemetry-collector-contrib#9

Closed

mx-psi added priority:p2 Medium and removed waiting for author labels Jun 24, 2024

github-actions bot added the Stale label Oct 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenTelemetry Collector does not gracefully shutdown, losing metrics on spot instance termination #33441

OpenTelemetry Collector does not gracefully shutdown, losing metrics on spot instance termination #33441

Rommmmm commented Jun 9, 2024 •

edited

Loading

github-actions bot commented Jun 10, 2024

github-actions bot commented Jun 10, 2024

songy23 commented Jun 10, 2024

kevinh-canva commented Jun 23, 2024

Rommmmm commented Jun 24, 2024

ancostas commented Jun 25, 2024

Rommmmm commented Jul 8, 2024

ancostas commented Aug 7, 2024 •

edited

Loading

github-actions bot commented Oct 7, 2024

OpenTelemetry Collector does not gracefully shutdown, losing metrics on spot instance termination #33441

OpenTelemetry Collector does not gracefully shutdown, losing metrics on spot instance termination #33441

Comments

Rommmmm commented Jun 9, 2024 • edited Loading

Component(s)

What happened?

Description

Steps to Reproduce

Expected Result

Actual Result

Collector version

Environment information

Environment

OpenTelemetry Collector configuration

Log output

Additional context

github-actions bot commented Jun 10, 2024

github-actions bot commented Jun 10, 2024

songy23 commented Jun 10, 2024

kevinh-canva commented Jun 23, 2024

Rommmmm commented Jun 24, 2024

ancostas commented Jun 25, 2024

Rommmmm commented Jul 8, 2024

ancostas commented Aug 7, 2024 • edited Loading

github-actions bot commented Oct 7, 2024

Rommmmm commented Jun 9, 2024 •

edited

Loading

ancostas commented Aug 7, 2024 •

edited

Loading