Skip to content

monitoringartist/opentelemetry-collector-monitoring

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 

Repository files navigation

OpenTelemetry (OTEL) collector monitoring

Metrics

Collector can expose Prometheus metrics locally on port 8888 and path /metrics. For containerized environments it may be desirable to expose this port on a public interface instead of just locally.

service:
  telemetry:
    metrics:
      readers:
        - pull
          exporter:
            prometheus: 127.0.0.1
            port: 8888 

Collector can scrape own metric via own metric pipeline, so real configuration can looks like:

extensions:
  sigv4auth/aws:

receivers:
  prometheus:
    config:
      scrape_configs:
      - job_name: otel-collector-metrics
        scrape_interval: 10s
        static_configs:
          - targets: ['127.0.0.1:8888']

exporters:
  prometheusremotewrite/aws:
    endpoint: ${PROMETHEUS_ENDPOINT}
    auth:
      authenticator: sigv4auth/aws
    retry_on_failure:
      enabled: true
      initial_interval: 1s
      max_interval: 10s
      max_elapsed_time: 30s

service:
  pipelines:
    metrics:
      receivers: [prometheus]
      processors: []
      exporters: [prometheusremotewrite/aws]
  telemetry:
    metrics:
      readers:
        - pull
          exporter:
            prometheus: 127.0.0.1
            port: 8888
    logs:
      encoding: json
      level: info

Grafana dashboard for OpenTelemetry collector metrics

OpenTelemetry collector dashboard

This dashboard can also be used for Grafana Alloy monitoring.

Prometheus alerts

Recommended Prometheus alerts for OpenTelemetry collector metrics:

# keep in mind that metrics may have "_total" suffixes - check your metrics/configuration first
groups:
  - name: opentelemetry-collector
    rules:
      - alert: processor-dropped-spans
        expr: sum(rate(otelcol_processor_dropped_spans{}[1m])) > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: Some spans have been dropped by processor
          description: Maybe collector has received non standard spans or it reached some limits
      - alert: processor-dropped-metrics
        expr: sum(rate(otelcol_processor_dropped_metric_points{}[1m])) > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: Some metric points have been dropped by processor
          description: Maybe collector has received non standard metric points or it reached some limits
      - alert: processor-dropped-logs
        expr: sum(rate(otelcol_processor_dropped_log_records{}[1m])) > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: Some log records have been dropped by processor
          description: Maybe collector has received non standard log records or it reached some limits
      - alert: receiver-refused-spans
        expr: sum(rate(otelcol_receiver_refused_spans{}[1m])) > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: Some spans have been refused by receiver
          description: Maybe collector has received non standard spans or it reached some limits
      - alert: receiver-refused-metrics
        expr: sum(rate(otelcol_receiver_refused_metric_points{}[1m])) > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: Some metric points have been refused by receiver
          description: Maybe collector has received non standard metric points or it reached some limits
      - alert: receiver-refused-logs
        expr: sum(rate(otelcol_receiver_refused_log_records{}[1m])) > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: Some log records have been refused by receiver
          description: Maybe collector has received non standard log records or it reached some limits
      - alert: exporter-enqueued-spans
        expr: sum(rate(otelcol_exporter_enqueue_failed_spans{}[1m])) > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: Some spans have been enqueued by exporter
          description: Maybe used destination has a problem or used payload is not correct
      - alert: exporter-enqueued-metrics
        expr: sum(rate(otelcol_exporter_enqueue_failed_metric_points{}[1m])) > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: Some metric points have been enqueued by exporter
          description: Maybe used destination has a problem or used payload is not correct
      - alert: exporter-enqueued-logs
        expr: sum(rate(otelcol_exporter_enqueue_failed_log_records{}[1m])) > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: Some log records have been enqueued by exporter
          description: Maybe used destination has a problem or used payload is not correct
      - alert: exporter-failed-requests
        expr: sum(rate(otelcol_exporter_send_failed_requests{}[1m])) > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: Some exporter requests failed
          description: Maybe used destination has a problem or used payload is not correct
      - alert: high-cpu-usage
        expr: max(rate(otelcol_process_cpu_seconds{}[1m])*100) > 90
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: High max CPU usage
          description: Collector needs to scale up

Documentation