Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[loadbalancingexporter] using the loadbalancingexporter k8s resolved breaks the internal metrics #30697

Closed
grzn opened this issue Jan 21, 2024 · 9 comments
Assignees
Labels
bug Something isn't working exporter/loadbalancing

Comments

@grzn
Copy link
Contributor

grzn commented Jan 21, 2024

Component(s)

exporter/loadbalancing

What happened?

Description

metrics endpoint fails

Steps to Reproduce

Enable the load balacing exporter

    loadbalancing/traces:
      resolver:
        k8s:
          service: opentelemetry-collector.default

## Expected Result

`curl http://<podip>:9090/metrics` to succeed

## Actual Result

curl http;curl http;wiz@utils-ops-5db4f67695-mtupg:/$ curl http://10.0.46.15:9090/metrics
An error has occurred while serving metrics:

collected metric "otelcol_exporter_queue_size" { label:{name:"exporter" value:"loadbalancing/traces"} label:{name:"service_instance_id" value:"16abe78b-1a05-4f33-909c-cc9e9cb4b73e


### Collector version

0.92.0

### Environment information

## Environment
Docker image `otel/opentelemetry-collector-contrib:0.92.0`

### OpenTelemetry Collector configuration

```yaml
exporters:
    loadbalancing/traces:
      resolver:
        k8s:
          service: opentelemetry-collector.default
      protocol:
        otlp:
          tls:
            insecure: true
  receivers:
    otlp:
      protocols:
        grpc: {}
        http: {}
    prometheus:
      config:
        scrape_configs:
        - job_name: opentelemetry-agent
          scrape_interval: 10s
          static_configs:
          - targets:
            - ${K8S_POD_IP}:9090
  service:
    telemetry:
      logs:
        level: info
      metrics:
        address: 0.0.0.0:9090
    extensions:
    - health_check
    - memory_ballast
    pipelines:
      traces:
        exporters:
        - loadbalancing/traces
        processors:
        - memory_limiter
        - k8sattributes
        - resource
        - resource/add_cluster_name
        - resource/add_environment
        - resourcedetection
        receivers:
        - otlp

Log output

the logs are clean

Additional context

No response

@grzn grzn added bug Something isn't working needs triage New item requiring triage labels Jan 21, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@grzn
Copy link
Contributor Author

grzn commented Jan 21, 2024

Looks like this is happening after the k8s service restarts and the pods change

@grzn
Copy link
Contributor Author

grzn commented Jan 21, 2024

In our setup, we have an agent daemonset and a collector deployment; the agent sends metrics to the k8s service for that deployment, using the config mentioned in the description.

To reproduce:

  1. start the collector/deployment
  2. start the agent/daemonset
  3. curl to one of the agent pods at http://:8080/metrics, all works
  4. rollout restart the collector/deployment
  5. run the curl again, them metrics endpoint will return an error as mentioned in the issue description

eveyrthing works when doing the same but with the sending queue disabled, e.g.

      loadbalancing/traces:
        protocol:
          otlp:
            retry_on_failure:
              enabled: false
            sending_queue:
              enabled: false
            timeout: 3s
            tls:
              insecure: true
        resolver:
          k8s:
            service: opentelemetry-collector.default

@jpkrohling
Copy link
Member

Might be related to #16826

@jpkrohling jpkrohling removed the needs triage New item requiring triage label Jan 25, 2024
@jpkrohling jpkrohling self-assigned this Jan 25, 2024
@Juliaj
Copy link
Contributor

Juliaj commented Jan 28, 2024

@jpkrohling, we're also hitting this issue with our trace ingestion infra. Our setup is: Tier 1 (2 Otel load balancer, deployment, k8s resolver) -> Tier 2( 3 Otel Collector, statefulset) -> trace storage backend. I can repro this on demand with the steps similar to @grzn outlined above by terminating one of the otel collector at tier 2. I'm debugging the issue at the moment and would like to collaborate if possible.

This doesn't repro if Otel load balancer uses DNS resolver.

@grzn grzn changed the title [loadbalancingexporter] using the loadbalancingexporter breaks the internal metrics [loadbalancingexporter] using the loadbalancingexporter k8s resolved breaks the internal metrics Jan 29, 2024
@Juliaj
Copy link
Contributor

Juliaj commented Feb 2, 2024

@grzn, a few of us are investigating this. See more info in issue mentioned above #30477.

@jpkrohling
Copy link
Member

Given that #30477 seems to have been fixed in 0.94.0, can you confirm this is reproducible with the latest version as well, @grzn ?

@grzn
Copy link
Contributor Author

grzn commented Feb 15, 2024

we're running v0.94.0 for a few hours, looks good so far

@jpkrohling
Copy link
Member

Alright, I'm closing this, but let me know if this needs to be reopened.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working exporter/loadbalancing
Projects
None yet
Development

No branches or pull requests

3 participants