Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

otelcol_exporter_queue_size not working when using persistant queue #30608

Closed
chkp-yairt opened this issue Jan 16, 2024 · 3 comments
Closed

otelcol_exporter_queue_size not working when using persistant queue #30608

chkp-yairt opened this issue Jan 16, 2024 · 3 comments
Labels
bug Something isn't working extension/storage/filestorage needs triage New item requiring triage

Comments

@chkp-yairt
Copy link

Component(s)

extension/storage/filestorage

What happened?

Description

We are using the otelcol_exporter_queue_size metric to alert about our otel collector status.
Recently we moved to use the persistant queue to avoid losing information if the collector is restarted.
But once we did that the otelcol_exporter_queue_size no longer shows the current situation.
We tested this by blocking the receiver and observing the pvc attached to the otel pod.
The result was that we saw the folder size increase but the queue size metric did not change.

Expected Result

otelcol_exporter_queue_size should behave the same way as when running the queue in memory.

Actual Result

otelcol_exporter_queue_size does not change even when queue size is actually increasing.

Collector version

0.92.0

Environment information

Environment

OS: amazon linux

OpenTelemetry Collector configuration

receivers:
      prometheus:
        config:
          global:
            scrape_interval: 30s
            evaluation_interval: 30s
          scrape_configs:
            - job_name: "kubernetes-nodes"
              kubernetes_sd_configs:
                - role: node
              scheme: https
              tls_config:
                ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
                insecure_skip_verify: true
              bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
              relabel_configs:
                - action: labelmap
                  regex: __meta_kubernetes_node_label_(.+)
                - target_label: __address__
                  replacement: kubernetes.default.svc:443
                - source_labels: [__meta_kubernetes_node_name]
                  regex: (.+)
                  target_label: __metrics_path__
                  replacement: /api/v1/nodes/$$1/proxy/metrics
                - source_labels: [__meta_kubernetes_node_name]
                  action: replace
                  target_label: node
                - source_labels: [__meta_kubernetes_node_name]
                  action: replace
                  target_label: machine
            - job_name: k8s-node-kube-system
              bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
              honor_labels: false
              kubernetes_sd_configs:
              - role: node
              metrics_path: /metrics/cadvisor
              relabel_configs:
              - action: labelmap
                regex: __meta_kubernetes_service_label_(.+)
              - action: replace
                source_labels:
                - __meta_kubernetes_node_name
                target_label: kubernetes_node_name
              scheme: https
              tls_config:
                ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
                insecure_skip_verify: true
extensions:
  health_check: {}
  memory_ballast:
    size_in_percentage: 40
  file_storage:
    timeout: 10s
    directory: /data/otelcol
processors:
  batch:
    send_batch_size: 1000
    timeout: 1s
    send_batch_max_size: 1500

  resource/standard:
    attributes:
      - key: ClusterName
        value: $CLUSTER_NAME
        action: upsert
      - key: k8s.cluster.name
        from_attribute: k8s-cluster
        action: upsert
exporters:
  logging:
    verbosity: detailed
  otlp:
    compression: zstd
    endpoint: internal-otel-collector:4317
    sending_queue: 
      enabled: true 
      queue_size: 1000
      storage: file_storage

service:
  telemetry:
    logs:
      level: "debug"
    metrics:
      address: 0.0.0.0:8888
  extensions: [health_check, memory_ballast, file_storage]
  pipelines:
    metrics:
      receivers: [otlp, prometheus]
      processors: [batch, resource/standard]
      exporters: [otlp]

extraVolumeMounts:
  - name: data-volume
    mountPath: /data/otelcol
initContainers: 
  - name: init
    image: busybox
    command: ['sh', '-c', 'chown -R 10001: /data/otelcol' ]
    volumeMounts:
      - name: data-volume
        mountPath: /data/otelcol
statefulset:
  # volumeClaimTemplates for a statefulset
  volumeClaimTemplates: 
    - metadata:
        name: data-volume
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 10Gi
  podManagementPolicy: "Parallel"

Log output

No response

Additional context

No response

@chkp-yairt chkp-yairt added bug Something isn't working needs triage New item requiring triage labels Jan 16, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@dmitryax
Copy link
Member

dmitryax commented Jan 16, 2024

@chkp-yairt What distribution of the collector do you use?

I cannot reproduce the issue with the contrib build. I see otelcol_exporter_queue_size reported correctly both for memory and persistent queue.

I run contrib collector build v0.92.0 with this config:

extensions:
  file_storage:
    directory: ./tmp

receivers:
  otlp:
    protocols:
      http:

exporters:
  otlphttp:
    endpoint: https://invalid-invalid.com
    sending_queue:
      enabled: true
      num_consumers: 1
      storage: file_storage

service:
  extensions: [file_storage]
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [otlphttp]

Sent 100 dump otlp traces and can see

otelcol_exporter_queue_size{exporter="otlphttp",service_instance_id="24801a50-05a6-408a-9028-0e4c7ec5b26f",service_name="otelcontribcol",service_version="0.92.0-dev"} 99

Metrics use exactly the same logic.

Are you sure that metrics are not delivered? What logs do you see?

@chkp-yairt
Copy link
Author

Hi @dmitryax thanks for the quick reply.
We changed our test and now we can see the metric working.
I think perhaps since we blocked the destination the data was dropped in the enqueue and never got to the sending queue itself.
Again thanks for your help and quick reply

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working extension/storage/filestorage needs triage New item requiring triage
Projects
None yet
Development

No branches or pull requests

2 participants