Cluster receiver crashes after helm update when file_storage extension is enabled #800

wojtekzyla · 2023-05-29T14:13:58Z

What happened?

Description

While working on these PRs #675 and #753 I discovered that when file_storage was enabled for the cluster receiver, then making any update in the values.yaml file under clusterReceiver section and running helm update caused crash of the cluster receiver pod. Using file_storage in agent works fine.

Steps to Reproduce

Set up cluster receiver config as described in the configuration and start the otel collector.
Make any change in the values.yaml under clusterReceiver section and run helm update.

Expected Result

Cluster Receiver pod working correctly.

Actual Result

Cluster Receiver pod crashes.

Chart version

0.71.0

Environment information

Environment

Chart configuration

Name:         sck-otel-splunk-otel-collector-otel-k8s-cluster-receiver
Namespace:    default
Labels:       app=splunk-otel-collector
              app.kubernetes.io/instance=sck-otel
              app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=splunk-otel-collector
              app.kubernetes.io/version=0.71.0
              chart=splunk-otel-collector-0.71.0
              helm.sh/chart=splunk-otel-collector-0.71.0
              heritage=Helm
              release=sck-otel
Annotations:  meta.helm.sh/release-name: sck-otel
              meta.helm.sh/release-namespace: default

Data
====
relay:
----
exporters:
  splunk_hec/platform_metrics:
    disable_compression: true
    endpoint: <MY_ENDPOINT>
    index: metric
    max_connections: 200
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_elapsed_time: 300s
      max_interval: 30s
    sending_queue:
      enabled: true
      num_consumers: 10
      queue_size: 5000
      storage: file_storage/persistent_queue
    source: kubernetes
    splunk_app_name: splunk-otel-collector
    splunk_app_version: 0.71.0
    timeout: 10s
    tls:
      insecure_skip_verify: true
    token: ${SPLUNK_PLATFORM_HEC_TOKEN}
extensions:
  file_storage/persistent_queue:
    directory: /var/addon/splunk/persist/clusterReceiver
  health_check: null
  memory_ballast:
    size_mib: ${SPLUNK_BALLAST_SIZE_MIB}
processors:
  batch: null
  memory_limiter:
    check_interval: 2s
    limit_mib: ${SPLUNK_MEMORY_LIMIT_MIB}
  resource:
    attributes:
    - action: insert
      key: metric_source
      value: kubernetes
    - action: upsert
      key: k8s.cluster.name
      value: sck-otel
  resource/add_collector_k8s:
    attributes:
    - action: insert
      key: k8s.node.name
      value: ${K8S_NODE_NAME}
    - action: insert
      key: k8s.pod.name
      value: ${K8S_POD_NAME}
    - action: insert
      key: k8s.pod.uid
      value: ${K8S_POD_UID}
    - action: insert
      key: k8s.namespace.name
      value: ${K8S_NAMESPACE}
  resource/k8s_cluster:
    attributes:
    - action: insert
      key: receiver
      value: k8scluster
  resourcedetection:
    detectors:
    - env
    - system
    override: true
    timeout: 10s
receivers:
  k8s_cluster:
    auth_type: serviceAccount
  prometheus/k8s_cluster_receiver:
    config:
      scrape_configs:
      - job_name: otel-k8s-cluster-receiver
        scrape_interval: 10s
        static_configs:
        - targets:
          - ${K8S_POD_IP}:8889
service:
  extensions:
  - health_check
  - memory_ballast
  - file_storage/persistent_queue
  pipelines:
    metrics:
      exporters:
      - splunk_hec/platform_metrics
      processors:
      - memory_limiter
      - batch
      - resource
      - resource/k8s_cluster
      receivers:
      - k8s_cluster
    metrics/collector:
      exporters:
      - splunk_hec/platform_metrics
      processors:
      - memory_limiter
      - batch
      - resource/add_collector_k8s
      - resourcedetection
      - resource
      receivers:
      - prometheus/k8s_cluster_receiver
  telemetry:
    logs:
      level: "debug"
    metrics:
      address: 0.0.0.0:8889

Events:  <none>

Log output

2023/04/07 15:44:07 settings.go:331: Set config to [/conf/relay.yaml]
2023/04/07 15:44:07 settings.go:384: Set ballast to 297 MiB
2023/04/07 15:44:07 settings.go:400: Set memory limit to 810 MiB
2023-04-07T15:44:07.429Z	info	service/telemetry.go:90	Setting up own telemetry...
2023-04-07T15:44:07.429Z	info	service/telemetry.go:116	Serving Prometheus metrics	{"address": "0.0.0.0:8889", "level": "Basic"}
2023-04-07T15:44:07.429Z	debug	extension/extension.go:146	Beta component. May change in the future.	{"kind": "extension", "name": "health_check"}
2023-04-07T15:44:07.429Z	debug	extension/extension.go:146	Beta component. May change in the future.	{"kind": "extension", "name": "memory_ballast"}
2023-04-07T15:44:07.429Z	debug	extension/extension.go:146	Beta component. May change in the future.	{"kind": "extension", "name": "file_storage/persistent_queue"}
2023-04-07T15:44:07.429Z	debug	exporter/exporter.go:284	Beta component. May change in the future.	{"kind": "exporter", "data_type": "metrics", "name": "splunk_hec/platform_metrics"}
2023-04-07T15:44:07.429Z	debug	processor/processor.go:298	Beta component. May change in the future.	{"kind": "processor", "name": "resource/k8s_cluster", "pipeline": "metrics"}
2023-04-07T15:44:07.429Z	debug	processor/processor.go:298	Beta component. May change in the future.	{"kind": "processor", "name": "resource", "pipeline": "metrics"}
2023-04-07T15:44:07.429Z	debug	processor/processor.go:298	Beta component. May change in the future.	{"kind": "processor", "name": "resource", "pipeline": "metrics/collector"}
2023-04-07T15:44:07.429Z	debug	processor/processor.go:298	Beta component. May change in the future.	{"kind": "processor", "name": "resourcedetection", "pipeline": "metrics/collector"}
2023-04-07T15:44:07.429Z	debug	processor/processor.go:298	Beta component. May change in the future.	{"kind": "processor", "name": "resource/add_collector_k8s", "pipeline": "metrics/collector"}
2023-04-07T15:44:07.430Z	debug	processor/processor.go:298	Stable component.	{"kind": "processor", "name": "batch", "pipeline": "metrics"}
2023-04-07T15:44:07.430Z	debug	processor/processor.go:298	Stable component.	{"kind": "processor", "name": "batch", "pipeline": "metrics/collector"}
2023-04-07T15:44:07.430Z	debug	processor/processor.go:298	Beta component. May change in the future.	{"kind": "processor", "name": "memory_limiter", "pipeline": "metrics"}
2023-04-07T15:44:07.430Z	info	memorylimiterprocessor@v0.71.0/memorylimiter.go:113	Memory limiter configured	{"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "limit_mib": 810, "spike_limit_mib": 162, "check_interval": 2}
2023-04-07T15:44:07.430Z	debug	receiver/receiver.go:305	Beta component. May change in the future.	{"kind": "receiver", "name": "k8s_cluster", "data_type": "metrics"}
2023-04-07T15:44:07.430Z	debug	processor/processor.go:298	Beta component. May change in the future.	{"kind": "processor", "name": "memory_limiter", "pipeline": "metrics/collector"}
2023-04-07T15:44:07.430Z	debug	receiver/receiver.go:305	Beta component. May change in the future.	{"kind": "receiver", "name": "prometheus/k8s_cluster_receiver", "data_type": "metrics"}
2023-04-07T15:44:07.449Z	info	service/service.go:140	Starting otelcol...	{"Version": "v0.71.0", "NumCPU": 4}
2023-04-07T15:44:07.449Z	info	extensions/extensions.go:41	Starting extensions...
2023-04-07T15:44:07.449Z	info	extensions/extensions.go:44	Extension is starting...	{"kind": "extension", "name": "health_check"}
2023-04-07T15:44:07.449Z	info	healthcheckextension@v0.71.0/healthcheckextension.go:45	Starting health_check extension	{"kind": "extension", "name": "health_check", "config": {"Endpoint":"0.0.0.0:13133","TLSSetting":null,"CORS":null,"Auth":null,"MaxRequestBodySize":0,"IncludeMetadata":false,"Path":"/","CheckCollectorPipeline":{"Enabled":false,"Interval":"5m","ExporterFailureThreshold":5}}}
2023-04-07T15:44:07.449Z	warn	internal/warning.go:51	Using the 0.0.0.0 address exposes this server to every network interface, which may facilitate Denial of Service attacks	{"kind": "extension", "name": "health_check", "documentation": "https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/security-best-practices.md#safeguards-against-denial-of-service-attacks"}
2023-04-07T15:44:07.449Z	info	extensions/extensions.go:48	Extension started.	{"kind": "extension", "name": "health_check"}
2023-04-07T15:44:07.449Z	info	extensions/extensions.go:44	Extension is starting...	{"kind": "extension", "name": "memory_ballast"}
2023-04-07T15:44:07.520Z	info	ballastextension@v0.71.0/memory_ballast.go:52	Setting memory ballast	{"kind": "extension", "name": "memory_ballast", "MiBs": 297}
2023-04-07T15:44:07.520Z	info	extensions/extensions.go:48	Extension started.	{"kind": "extension", "name": "memory_ballast"}
2023-04-07T15:44:07.520Z	info	extensions/extensions.go:44	Extension is starting...	{"kind": "extension", "name": "file_storage/persistent_queue"}
2023-04-07T15:44:07.520Z	info	extensions/extensions.go:48	Extension started.	{"kind": "extension", "name": "file_storage/persistent_queue"}
2023-04-07T15:44:08.473Z	info	service/service.go:166	Starting shutdown...
2023-04-07T15:44:08.474Z	info	healthcheck/handler.go:129	Health Check state change	{"kind": "extension", "name": "health_check", "status": "unavailable"}
2023-04-07T15:44:08.474Z	info	extensions/extensions.go:55	Stopping extensions...
2023-04-07T15:44:08.474Z	info	service/service.go:180	Shutdown complete.
Error: cannot start pipelines: timeout; failed to shutdown pipelines: no existing monitoring routine is running; no existing monitoring routine is running
2023/04/07 15:44:08 main.go:115: application run finished with error: cannot start pipelines: timeout; failed to shutdown pipelines: no existing monitoring routine is running; no existing monitoring routine is running

Additional context

No response

The text was updated successfully, but these errors were encountered:

jvoravong · 2023-06-07T14:26:40Z

@wojtekzyla did you try adding a new volume and volume mount to the deployment-cluster-receiver.yaml for where the persistent queue data would be stored?

VihasMakwana · 2023-06-26T12:04:17Z

so, what happens when we update the cluster receiver deployment:
1). The old pod is running and has acquired the lock for the file-storage path for persistent queue.
2). New pod is created and it tries to acquire the lock, but gives up after 1s (default timeout value) as the old one is still up, because of the default strategy setting in deployment.

To solve this issue we need to:
1). Update the deployment pod replacement strategy and set .spec.strategy.rollingUpdate.maxUnavailable to 1 (we can make this configurable like the agent), so we can terminate the old pod and create a new pod simultaneously. This would work the same way as the agent pods.
2). Set timeout to 0 i.e. wait indefinitely to acquire the lock. It would succeed once the previous pod is terminated successfully.

VihasMakwana · 2023-06-26T12:28:13Z

The old pod isn't terminating because we haven't yet successfully started the new pod. The new pod is trying to acquire a lock that can only be released when we terminate the old pod.
A deadlock ;(
and that's why we need to update pod replacement strategy for the deployment for this special case

VihasMakwana · 2023-06-26T12:55:40Z

@dmitryax ^^

VihasMakwana · 2023-07-28T06:42:13Z

I think we can close this one as the behavior is as designed

wojtekzyla added the bug Something isn't working label May 29, 2023

wojtekzyla mentioned this issue May 29, 2023

Persistent queue #753

Closed

jvoravong mentioned this issue Aug 23, 2023

Persistence queue #861

Merged

jvoravong closed this as completed Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster receiver crashes after helm update when file_storage extension is enabled #800

Cluster receiver crashes after helm update when file_storage extension is enabled #800

wojtekzyla commented May 29, 2023

jvoravong commented Jun 7, 2023

VihasMakwana commented Jun 26, 2023 •

edited

Loading

VihasMakwana commented Jun 26, 2023 •

edited

Loading

VihasMakwana commented Jun 26, 2023

VihasMakwana commented Jul 28, 2023

Cluster receiver crashes after helm update when file_storage extension is enabled #800

Cluster receiver crashes after helm update when file_storage extension is enabled #800

Comments

wojtekzyla commented May 29, 2023

What happened?

Description

Steps to Reproduce

Expected Result

Actual Result

Chart version

Environment information

Environment

Chart configuration

Log output

Additional context

jvoravong commented Jun 7, 2023

VihasMakwana commented Jun 26, 2023 • edited Loading

VihasMakwana commented Jun 26, 2023 • edited Loading

VihasMakwana commented Jun 26, 2023

VihasMakwana commented Jul 28, 2023

VihasMakwana commented Jun 26, 2023 •

edited

Loading

VihasMakwana commented Jun 26, 2023 •

edited

Loading