Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error encoding and sending metric family: write tcp 172.31.204.123:8889->172.31.42.221:60282: write: broken pipe #32371

Open
zhulei-pacvue opened this issue Apr 15, 2024 · 10 comments
Assignees
Labels
bug Something isn't working exporter/prometheus

Comments

@zhulei-pacvue
Copy link

Component(s)

exporter/prometheus

Describe the issue you're reporting

When I use Prometheusexporter, otelcol frequently reports errors as follows:

2024-04-15T01:39:07.597Z error prometheusexporter@v0.97.0/log.go:23 error encoding and sending metric family: write tcp 172.31.204.123:8889->172.31.42.221:60282: write: broken pipe
{"kind": "exporter", "data_type": "metrics", "name": "prometheus"}
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusexporter.(*promLogger).Println
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusexporter@v0.97.0/log.go:23
github.com/prometheus/client_golang/prometheus/promhttp.HandlerForTransactional.func1.2
github.com/prometheus/client_golang@v1.19.0/prometheus/promhttp/http.go:192
github.com/prometheus/client_golang/prometheus/promhttp.HandlerForTransactional.func1
github.com/prometheus/client_golang@v1.19.0/prometheus/promhttp/http.go:210
net/http.HandlerFunc.ServeHTTP
net/http/server.go:2166
net/http.(*ServeMux).ServeHTTP
net/http/server.go:2683
go.opentelemetry.io/collector/config/confighttp.(*decompressor).ServeHTTP
go.opentelemetry.io/collector/config/confighttp@v0.97.0/compression.go:160
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.(*middleware).serveHTTP
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp@v0.49.0/handler.go:225
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.NewMiddleware.func1.1
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp@v0.49.0/handler.go:83
net/http.HandlerFunc.ServeHTTP
net/http/server.go:2166
go.opentelemetry.io/collector/config/confighttp.(*clientInfoHandler).ServeHTTP
go.opentelemetry.io/collector/config/confighttp@v0.97.0/clientinfohandler.go:26
net/http.serverHandler.ServeHTTP
net/http/server.go:3137
net/http.(*conn).serve
net/http/server.go:2039

Version:
otelcol:0.97.0

Importent configurations:

exporters:
  otlp:
    endpoint: 'jaeger-collector:4317'
    tls:
      insecure: true  
  prometheus:
    endpoint: ${env:MY_POD_IP}:8889
    namespace:
    send_timestamps: false
    metric_expiration: 10m
    add_metric_suffixes: false   
@zhulei-pacvue zhulei-pacvue added the needs triage New item requiring triage label Apr 15, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@crobert-1 crobert-1 added the bug Something isn't working label Apr 15, 2024
@crobert-1
Copy link
Member

Hello @zhulei-pacvue, did this error happen on startup, or after the collector had been running for some time? Have you seen this happen repeatedly when running the collector, or was this only one time? Can you share more about what kind of environment the collector was running in?

@zhulei-pacvue
Copy link
Author

@crobert-1 Thank you! This error often occurs after the collector has been running for some time. After the service is restarted, it can run normally.

@AndreasPetersen
Copy link

We're experiencing the same issue:

error	prometheusexporter@v0.99.0/log.go:23	error encoding and sending metric family: write tcp 100.78.45.37:8889->100.78.6.13:46202: write: broken pipe
	{"kind": "exporter", "data_type": "metrics", "name": "prometheus"}
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusexporter.(*promLogger).Println
	github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusexporter@v0.99.0/log.go:23
github.com/prometheus/client_golang/prometheus/promhttp.HandlerForTransactional.func1.2
	github.com/prometheus/client_golang@v1.19.0/prometheus/promhttp/http.go:192
github.com/prometheus/client_golang/prometheus/promhttp.HandlerForTransactional.func1
	github.com/prometheus/client_golang@v1.19.0/prometheus/promhttp/http.go:210
net/http.HandlerFunc.ServeHTTP
	net/http/server.go:2166
net/http.(*ServeMux).ServeHTTP
	net/http/server.go:2683
go.opentelemetry.io/collector/config/confighttp.(*decompressor).ServeHTTP
	go.opentelemetry.io/collector/config/confighttp@v0.99.0/compression.go:160
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.(*middleware).serveHTTP
	go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp@v0.50.0/handler.go:214
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.NewMiddleware.func1.1
	go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp@v0.50.0/handler.go:72
net/http.HandlerFunc.ServeHTTP
	net/http/server.go:2166
go.opentelemetry.io/collector/config/confighttp.(*clientInfoHandler).ServeHTTP
	go.opentelemetry.io/collector/config/confighttp@v0.99.0/clientinfohandler.go:26
net/http.serverHandler.ServeHTTP
	net/http/server.go:3137
net/http.(*conn).serve
	net/http/server.go:2039

Usually there's about 10 or 20 such logs that happen at about the same time. Some of then happen at the exact same time, while others are a few ms apart.

In the last week alone, this happens about once or twice a day:

image

We're running this on OpenShift using Kubernetes v1.26.13+8f85140.

The OpenTelemetry collector runs as a container in a pod with a Quarkus service sending metrics and traces to the OpenTelemetry Collector.

Here is the Kubernetes Deployment resource:

kind: Deployment
apiVersion: apps/v1
metadata:
  annotations:
    # ...
  name: service
  namespace: my-namespace
  labels:
    # ...
spec:
  replicas: 2
  selector:
    matchLabels:
      # ...
  template:
    metadata:
      labels:
        # ...
      annotations:
        # ...
        prometheus.io/path: /metrics
        prometheus.io/port: '8889'
        prometheus.io/scrape: 'true'
    spec:
      volumes:
        - name: otel-config
          configMap:
            name: otel-config
            defaultMode: 420
        - name: opa-conf
          configMap:
            name: opa
            defaultMode: 420
      containers:
        - resources:
            limits:
              cpu: 500m
              memory: 1536Mi
            requests:
              cpu: 10m
              memory: 1536Mi
          readinessProbe:
            httpGet:
              path: /q/health/ready
              port: container-port
              scheme: HTTP
            initialDelaySeconds: 10
            timeoutSeconds: 10
            periodSeconds: 7
            successThreshold: 1
            failureThreshold: 7
          terminationMessagePath: /dev/termination-log
          lifecycle:
            preStop:
              exec:
                command:
                  - sleep
                  - '5'
          name: service
          livenessProbe:
            httpGet:
              path: /q/health/live
              port: container-port
              scheme: HTTP
            initialDelaySeconds: 60
            timeoutSeconds: 5
            periodSeconds: 10
            successThreshold: 1
            failureThreshold: 3
          env:
            # ...
          ports:
            - name: container-port
              containerPort: 8080
              protocol: TCP
          imagePullPolicy: IfNotPresent
          terminationMessagePolicy: File
          image: my-service
        - resources:
            limits:
              cpu: 10m
              memory: 150Mi
            requests:
              cpu: 10m
              memory: 150Mi
          readinessProbe:
            httpGet:
              path: /
              port: otel-health
              scheme: HTTP
            initialDelaySeconds: 10
            timeoutSeconds: 1
            periodSeconds: 10
            successThreshold: 1
            failureThreshold: 3
          terminationMessagePath: /dev/termination-log
          lifecycle:
            preStop:
              exec:
                command:
                  - /bin/sh
                  - '-c'
                  - >-
                    while curl -X GET http://localhost:8080/q/health/live ; do
                    sleep 1; done
          name: otel-collector
          livenessProbe:
            httpGet:
              path: /
              port: otel-health
              scheme: HTTP
            initialDelaySeconds: 10
            timeoutSeconds: 1
            periodSeconds: 10
            successThreshold: 1
            failureThreshold: 3
          env:
            - name: TZ
              value: Europe/Copenhagen
          ports:
            - name: otel-health
              containerPort: 13133
              protocol: TCP
            - name: otlp-grpc
              containerPort: 4317
              protocol: TCP
            - name: otlp-http
              containerPort: 4318
              protocol: TCP
            - name: jaeger-grpc
              containerPort: 14260
              protocol: TCP
            - name: jaeger-http
              containerPort: 14268
              protocol: TCP
          imagePullPolicy: IfNotPresent
          volumeMounts:
            - name: otel-config
              mountPath: /etc/config/otel
          terminationMessagePolicy: File
          # We use RedHat's ubi8-minimal as a base image, and install OpenTelemetry otelcol_VERSION_linux_amd64.rpm from https://github.com/open-telemetry/opentelemetry-collector-releases/releases onto it
          image: >-
            my-repo/opentelemetry-collector:0.99.0
          args:
            - otelcol
            - '--config=/etc/config/otel/otel-collector-config.yaml'
        - resources:
            limits:
              cpu: 50m
              memory: 150Mi
            requests:
              cpu: 10m
              memory: 150Mi
          readinessProbe:
            httpGet:
              path: /health?bundle=true
              port: opa-port
              scheme: HTTP
            initialDelaySeconds: 10
            timeoutSeconds: 10
            periodSeconds: 7
            successThreshold: 1
            failureThreshold: 7
          terminationMessagePath: /dev/termination-log
          lifecycle:
            preStop:
              exec:
                command:
                  - /bin/sh
                  - '-c'
                  - >-
                    while curl -X GET http://localhost:8080/q/health/live ; do
                    sleep 1; done
          name: opa
          livenessProbe:
            httpGet:
              path: /health
              port: opa-port
              scheme: HTTP
            initialDelaySeconds: 10
            timeoutSeconds: 5
            periodSeconds: 10
            successThreshold: 1
            failureThreshold: 3
          env:
            - name: TZ
              value: Europe/Copenhagen
          ports:
            - name: opa-port
              containerPort: 8181
              protocol: TCP
          imagePullPolicy: IfNotPresent
          volumeMounts:
            - name: opa-conf
              readOnly: true
              mountPath: /opa-conf
          terminationMessagePolicy: File
          image: >-
            opa:0.63.0
          args:
            - run
            - '--ignore=.*'
            - '--server'
            - '--log-level'
            - error
            - '--config-file'
            - /opa-conf/opa-conf.yaml
            - '--watch'
      restartPolicy: Always
      terminationGracePeriodSeconds: 30
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 1

The OpenTelemetry Collector config:

receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: "opa"
          scrape_interval: 15s
          metrics_path: "/metrics"
          static_configs:
            - targets:
                - "0.0.0.0:8181"
  otlp:
    protocols:
      grpc:
        endpoint:
          0.0.0.0:4317
      http:
        endpoint:
          0.0.0.0:4318
  jaeger:
    protocols:
      grpc:
        endpoint:
          0.0.0.0:14260
      thrift_http:
        endpoint:
          0.0.0.0:14268
processors:
  batch:
    timeout: 1s
exporters:
  prometheus:
    add_metric_suffixes: false
    endpoint:
      0.0.0.0:8889
  otlp:
    endpoint: "ose-jaeger-collector-headless.ose-jaeger.svc.cluster.local:4317"
    tls:
      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt
      insecure: false
      min_version: "1.2"
extensions:
  health_check:
    endpoint:
      0.0.0.0:13133
service:
  extensions: [ health_check ]
  pipelines:
    metrics:
      receivers: [ otlp ]
      processors: [ batch ]
      exporters: [ prometheus ]
    traces:
      receivers: [ otlp, jaeger ]
      processors: [ batch ]
      exporters: [ otlp ]
  telemetry:
    metrics:
      address: 0.0.0.0:8888

CPU and memory usage of the OpenTelemetry Collector shows nothing abnormal when the error occurs:

image

@L3o-pold
Copy link

Same issue since we upgraded from 0.88.0 to 0.100.0

@OfekCyberX
Copy link

OfekCyberX commented Jul 3, 2024

Hi @crobert-1 , is there any ongoing effort to fix/mitigate this issue?
we're have a deployment of otel-collector in k8s and we face the exact same issue, same stacktrace with the prometheusexporter error:

github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusexporter.(*promLogger).Println
	github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusexporter@v0.87.0/log.go:23

We also tried to upgrade to latest release 0.97.0 - didn't change.

We have the otel-collector running on a dedicated node without hard limit, only requests:

resources:
  requests:
    cpu: 32
    memory: 128Gi

Memory usage & CPU usage are well under limit.
Mem is 12%, CPU is 6%.

Seems it starts happening when we increase the metrics ingest volume.
We have clients sending metrics every 1 min, and what caused this error to start appearing is increasing the number of clients.

Is there any fix/mitigation for this issue? via a param or config change?

Thanks

@crobert-1
Copy link
Member

My apologies, I'm largely unfamiliar with this component and its functionality. I'm not aware of any ongoing effort to address this. @Aneurysm9 do you have any suggestions here?

@rickdgeerling
Copy link

Adding an extra datapoint. I stumbled upon this issue when I configured a k8s livenessProbe pointing to the prometheusexport endpoint.

@dashpole
Copy link
Contributor

dashpole commented Oct 9, 2024

getting to this now. Based on https://groups.google.com/g/prometheus-users/c/7UFP7MVJjRk/m/gB7R6goxAwAJ, it sounds like this happens when the client disconnects before the metrics are returned. Maybe you should increase the timeout for the client calling the exporter?

Otherwise, the best we could do here is probably silence the error message by removing this line, or offer a config option to do that:

@karlkfi
Copy link

karlkfi commented Oct 30, 2024

I suspect this issue is caused by k8s liveness and readiness probes, which may close the connection before the response has been written.

I think otherwise, you would want to know if this is happen, but in this case this seems like the intended behavior.

Is there some path or URL parameter we can use for probing that will skip sending metrics but still be a good indicator of health?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working exporter/prometheus
Projects
None yet
Development

No branches or pull requests

8 participants