Skip to content

[Feature Request] Implement OpenTelemetry Metrics in Gerbil #25

@marcschaeferger

Description

@marcschaeferger

Add OpenTelemetry-based observability to Gerbil

Reference: fosrl/pangolin#1429

Summary / Goal

Instrument Gerbil with OpenTelemetry Metrics (OTel) following CNCF / industry best practices so that:

  • Metrics are emitted using the OpenTelemetry Go SDK (vendor-neutral API).
  • Metrics are backend-agnostic and exportable to Prometheus‑compatible backends and any OTLP‑supporting system via the OpenTelemetry Collector.
  • Semantic conventions, SI units (_seconds, _bytes), and low‑cardinality labels are enforced.
  • Labels are stable and low‑cardinality (e.g., ifname, peer, site_id), avoiding per‑request unique values.
  • Provide an out‑of‑the‑box /metrics endpoint for Prometheus scraping (Prometheus exporter) and example OTel Collector config for production pipelines.
  • Focus is metrics first; design should allow adding traces and logs later.

Why this is needed

Gerbil manages WireGuard interfaces, keys, and peer state — all of which are critical for connectivity, security and performance. Operator visibility into handshake success/failure, per‑peer traffic, RTT, key rotations, netlink errors and config reloads is essential for:

  • Detecting connectivity regressions and degraded tunnels
  • Tracking authentication / handshake failures and key rotation issues
  • Capacity planning (peer counts, bandwidth)
  • Alerting (interface down, excessive errors, handshake failures)
  • Correlating with Pangolin and other components for end‑to‑end troubleshooting

OpenTelemetry provides a vendor‑neutral way to emit metrics, and the Collector allows flexible export to Prometheus (scrape or remote_write), Grafana Mimir, or other backends.


Recommended Gerbil Metrics

Interface / Peer Metrics

Metric name Type Labels Description / Units
gerbil_wg_interface_up Gauge (0/1) ifname, instance Interface operational state (1=up, 0=down)
gerbil_wg_peers_total Gauge ifname Number of configured peers on interface
gerbil_wg_peer_connected Gauge (0/1) ifname, peer Peer connected state (1=connected)
gerbil_wg_handshakes_total Counter ifname, peer, result Handshake attempts (result: success/failure)
gerbil_wg_handshake_latency_seconds Histogram ifname, peer Handshake latency distribution (seconds)
gerbil_wg_peer_rtt_seconds Histogram ifname, peer Observed RTT to peer (seconds)
gerbil_wg_bytes_received_total Counter ifname, peer Bytes received from peer
gerbil_wg_bytes_transmitted_total Counter ifname, peer Bytes transmitted to peer
gerbil_allowed_ips_count Gauge ifname, peer Number of allowed IP entries per peer
gerbil_key_rotation_total Counter ifname, reason Key rotation events (manual/auto/expired)

System Metrics

Metric name Type Labels Description
gerbil_netlink_events_total Counter event_type Netlink events processed (link/addr/rule changes)
gerbil_netlink_errors_total Counter component, error_type Netlink or kernel error counts
gerbil_sync_duration_seconds Histogram component Duration of reconciliation/sync loops (seconds)
gerbil_workqueue_depth Gauge queue Length of internal workqueues
gerbil_kernel_module_loads_total Counter result Kernel module load attempts (success/failure)
gerbil_firewall_rules_applied_total Counter result, chain IPTables/NFT rules applied count

Operational / Admin / Security

Metric name Type Labels Description
gerbil_config_reloads_total Counter result Config reloads (success/failure)
gerbil_restart_count_total Counter Process restarts count
gerbil_auth_failures_total Counter peer, reason Auth or peer validation failures
gerbil_acl_denied_total Counter ifname, peer, policy Access-control denied events
gerbil_certificate_expiry_days Gauge cert_name, ifname Days until certificate expiry (if TLS used)

Platform / Runtime (recommended alongside OTel)

  • Standard Go runtime/process metrics (goroutines, heap, GC, CPU) should be enabled either via OTel runtime instrumentation or exposed alongside OTel metrics for Prometheus scraping.

Implementation Plan

  1. Dependencies (example packages)

    • Add OpenTelemetry Go modules to go.mod:
      • go.opentelemetry.io/otel
      • go.opentelemetry.io/otel/sdk/metric
      • go.opentelemetry.io/otel/exporters/prometheus
      • go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc (or OTLP/HTTP variant)
      • Optional:
        • go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp (if HTTP APIs are exposed)
        • go.opentelemetry.io/contrib/instrumentation/runtime (for Go runtime metrics)
      • ...
  2. Central metrics module

    • Create internal/metrics/ that:
      • Initializes OTel MeterProvider.
      • Registers Prometheus exporter (when enabled) and exposes the exporter handler on /metrics (or mounts to existing HTTP server route).
      • Optionally registers OTLP exporter when configured via env vars.
      • Defines and pre‑registers all Gerbil metrics instruments:
        • Counters, Histograms, Gauges — with constants for names, descriptions, and label keys.
      • Exposes helper methods:
        • Inc(name string, labels ...attribute.KeyValue)
        • Observe(name string, value float64, labels ...attribute.KeyValue)
        • SetGauge(name string, value float64, labels ...attribute.KeyValue)
      • Provides Shutdown() to flush and close exporters.
  3. Instrumentation approach

    • WireGuard interface management:
      • Gauge for number of interfaces managed.
      • Gauge 0/1 per interface for status (up/down).
      • Counters for RX/TX bytes per interface.
      • Counter for uptime seconds per interface.
    • Peer management:
      • Gauge for peers per interface.
      • Gauge 0/1 for peer connection status.
      • Gauge for seconds since last handshake.
      • Counters for RX/TX bytes per peer.
      • Histogram for handshake latency.
      • Counter for peer connection failures (labelled with reason).
    • Configuration operations:
      • Counters for config reloads, interface add/remove (labelled with result: success/fail).
    • System/runtime:
      • Process uptime counter.
      • Go runtime metrics (goroutines, memory alloc).
    • All metrics should be updated where Gerbil processes WireGuard status or events.
  4. Histograms & buckets

    • Configure histogram buckets per spec:
      • Duration buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30]
      • Byte-size buckets: [512, 1024, 4096, 16384, 65536, 262144, 1048576]
    • Use seconds for all durations; bytes for all sizes.
  5. Exporter configuration (runtime)

    • Environment variables (suggested defaults):
      • GERBIL_METRICS_PROMETHEUS_ENABLED=true
      • GERBIL_METRICS_OTLP_ENABLED=false
      • OTEL_EXPORTER_OTLP_ENDPOINT (when OTLP enabled)
      • OTEL_EXPORTER_OTLP_PROTOCOL (http/protobuf or grpc)
      • OTEL_SERVICE_NAME=gerbil
      • OTEL_RESOURCE_ATTRIBUTES (e.g., service.instance.id=...)
      • OTEL_METRIC_EXPORT_INTERVAL (ms)
  6. Local testing

    • Provide docker-compose.metrics.yml with:
      • Gerbil
      • OpenTelemetry Collector (example config)
      • Prometheus (scraping /metrics or Collector)
      • Grafana (optional)
    • Validate both direct Prometheus scrape and OTLP → Collector → remote_write flows.
  7. Collector example

    • Include examples/collector.yaml demonstrating:
      • OTLP receiver
      • Transform processor to promote resource attributes (e.g., wg_interface, peer, site_id)
      • Prometheus remote_write exporter (generic endpoint)
      • Notes on:
        • Metric name normalization for Prometheus
        • out_of_order_time_window if sending OTLP to Prometheus
  8. Documentation

    • observability.md:
      • Metric catalog (name, type, labels, units, description)
      • How to enable/disable Prometheus exporter and OTLP exporter via env vars
      • How to run Docker Compose test stack
      • How to add a new metric (naming, labels, buckets)
  9. Testing & validation

    • Manual test: start compose, generate traffic, curl /metrics, verify metrics names, units, labels and histogram buckets.
    • Include sample /metrics output in the PR.
    • ...

🔗 References & Best Practices

Guides & integrations

Practical walkthroughs & blog posts

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions