-
Notifications
You must be signed in to change notification settings - Fork 26
Description
Add OpenTelemetry-based observability to Gerbil
Reference: fosrl/pangolin#1429
Summary / Goal
Instrument Gerbil with OpenTelemetry Metrics (OTel) following CNCF / industry best practices so that:
- Metrics are emitted using the OpenTelemetry Go SDK (vendor-neutral API).
- Metrics are backend-agnostic and exportable to Prometheus‑compatible backends and any OTLP‑supporting system via the OpenTelemetry Collector.
- Semantic conventions, SI units (_seconds, _bytes), and low‑cardinality labels are enforced.
- Labels are stable and low‑cardinality (e.g.,
ifname,peer,site_id), avoiding per‑request unique values. - Provide an out‑of‑the‑box /metrics endpoint for Prometheus scraping (Prometheus exporter) and example OTel Collector config for production pipelines.
- Focus is metrics first; design should allow adding traces and logs later.
Why this is needed
Gerbil manages WireGuard interfaces, keys, and peer state — all of which are critical for connectivity, security and performance. Operator visibility into handshake success/failure, per‑peer traffic, RTT, key rotations, netlink errors and config reloads is essential for:
- Detecting connectivity regressions and degraded tunnels
- Tracking authentication / handshake failures and key rotation issues
- Capacity planning (peer counts, bandwidth)
- Alerting (interface down, excessive errors, handshake failures)
- Correlating with Pangolin and other components for end‑to‑end troubleshooting
OpenTelemetry provides a vendor‑neutral way to emit metrics, and the Collector allows flexible export to Prometheus (scrape or remote_write), Grafana Mimir, or other backends.
Recommended Gerbil Metrics
Interface / Peer Metrics
| Metric name | Type | Labels | Description / Units |
|---|---|---|---|
gerbil_wg_interface_up |
Gauge (0/1) | ifname, instance |
Interface operational state (1=up, 0=down) |
gerbil_wg_peers_total |
Gauge | ifname |
Number of configured peers on interface |
gerbil_wg_peer_connected |
Gauge (0/1) | ifname, peer |
Peer connected state (1=connected) |
gerbil_wg_handshakes_total |
Counter | ifname, peer, result |
Handshake attempts (result: success/failure) |
gerbil_wg_handshake_latency_seconds |
Histogram | ifname, peer |
Handshake latency distribution (seconds) |
gerbil_wg_peer_rtt_seconds |
Histogram | ifname, peer |
Observed RTT to peer (seconds) |
gerbil_wg_bytes_received_total |
Counter | ifname, peer |
Bytes received from peer |
gerbil_wg_bytes_transmitted_total |
Counter | ifname, peer |
Bytes transmitted to peer |
gerbil_allowed_ips_count |
Gauge | ifname, peer |
Number of allowed IP entries per peer |
gerbil_key_rotation_total |
Counter | ifname, reason |
Key rotation events (manual/auto/expired) |
System Metrics
| Metric name | Type | Labels | Description |
|---|---|---|---|
gerbil_netlink_events_total |
Counter | event_type |
Netlink events processed (link/addr/rule changes) |
gerbil_netlink_errors_total |
Counter | component, error_type |
Netlink or kernel error counts |
gerbil_sync_duration_seconds |
Histogram | component |
Duration of reconciliation/sync loops (seconds) |
gerbil_workqueue_depth |
Gauge | queue |
Length of internal workqueues |
gerbil_kernel_module_loads_total |
Counter | result |
Kernel module load attempts (success/failure) |
gerbil_firewall_rules_applied_total |
Counter | result, chain |
IPTables/NFT rules applied count |
Operational / Admin / Security
| Metric name | Type | Labels | Description |
|---|---|---|---|
gerbil_config_reloads_total |
Counter | result |
Config reloads (success/failure) |
gerbil_restart_count_total |
Counter | — | Process restarts count |
gerbil_auth_failures_total |
Counter | peer, reason |
Auth or peer validation failures |
gerbil_acl_denied_total |
Counter | ifname, peer, policy |
Access-control denied events |
gerbil_certificate_expiry_days |
Gauge | cert_name, ifname |
Days until certificate expiry (if TLS used) |
Platform / Runtime (recommended alongside OTel)
- Standard Go runtime/process metrics (goroutines, heap, GC, CPU) should be enabled either via OTel runtime instrumentation or exposed alongside OTel metrics for Prometheus scraping.
Implementation Plan
-
Dependencies (example packages)
- Add OpenTelemetry Go modules to
go.mod:go.opentelemetry.io/otelgo.opentelemetry.io/otel/sdk/metricgo.opentelemetry.io/otel/exporters/prometheusgo.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc(or OTLP/HTTP variant)- Optional:
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp(if HTTP APIs are exposed)go.opentelemetry.io/contrib/instrumentation/runtime(for Go runtime metrics)
- ...
- Add OpenTelemetry Go modules to
-
Central metrics module
- Create
internal/metrics/that:- Initializes OTel
MeterProvider. - Registers Prometheus exporter (when enabled) and exposes the exporter handler on
/metrics(or mounts to existing HTTP server route). - Optionally registers OTLP exporter when configured via env vars.
- Defines and pre‑registers all Gerbil metrics instruments:
- Counters, Histograms, Gauges — with constants for names, descriptions, and label keys.
- Exposes helper methods:
Inc(name string, labels ...attribute.KeyValue)Observe(name string, value float64, labels ...attribute.KeyValue)SetGauge(name string, value float64, labels ...attribute.KeyValue)
- Provides
Shutdown()to flush and close exporters.
- Initializes OTel
- Create
-
Instrumentation approach
- WireGuard interface management:
- Gauge for number of interfaces managed.
- Gauge 0/1 per interface for status (up/down).
- Counters for RX/TX bytes per interface.
- Counter for uptime seconds per interface.
- Peer management:
- Gauge for peers per interface.
- Gauge 0/1 for peer connection status.
- Gauge for seconds since last handshake.
- Counters for RX/TX bytes per peer.
- Histogram for handshake latency.
- Counter for peer connection failures (labelled with reason).
- Configuration operations:
- Counters for config reloads, interface add/remove (labelled with result: success/fail).
- System/runtime:
- Process uptime counter.
- Go runtime metrics (goroutines, memory alloc).
- All metrics should be updated where Gerbil processes WireGuard status or events.
- WireGuard interface management:
-
Histograms & buckets
- Configure histogram buckets per spec:
- Duration buckets:
[0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30] - Byte-size buckets:
[512, 1024, 4096, 16384, 65536, 262144, 1048576]
- Duration buckets:
- Use seconds for all durations; bytes for all sizes.
- Configure histogram buckets per spec:
-
Exporter configuration (runtime)
- Environment variables (suggested defaults):
GERBIL_METRICS_PROMETHEUS_ENABLED=trueGERBIL_METRICS_OTLP_ENABLED=falseOTEL_EXPORTER_OTLP_ENDPOINT(when OTLP enabled)OTEL_EXPORTER_OTLP_PROTOCOL(http/protobuf or grpc)OTEL_SERVICE_NAME=gerbilOTEL_RESOURCE_ATTRIBUTES(e.g.,service.instance.id=...)OTEL_METRIC_EXPORT_INTERVAL(ms)
- Environment variables (suggested defaults):
-
Local testing
- Provide
docker-compose.metrics.ymlwith:- Gerbil
- OpenTelemetry Collector (example config)
- Prometheus (scraping
/metricsor Collector) - Grafana (optional)
- Validate both direct Prometheus scrape and OTLP → Collector → remote_write flows.
- Provide
-
Collector example
- Include
examples/collector.yamldemonstrating:- OTLP receiver
- Transform processor to promote resource attributes (e.g.,
wg_interface,peer,site_id) - Prometheus remote_write exporter (generic endpoint)
- Notes on:
- Metric name normalization for Prometheus
out_of_order_time_windowif sending OTLP to Prometheus
- Include
-
Documentation
observability.md:- Metric catalog (name, type, labels, units, description)
- How to enable/disable Prometheus exporter and OTLP exporter via env vars
- How to run Docker Compose test stack
- How to add a new metric (naming, labels, buckets)
-
Testing & validation
- Manual test: start compose, generate traffic, curl
/metrics, verify metrics names, units, labels and histogram buckets. - Include sample
/metricsoutput in the PR. - ...
- Manual test: start compose, generate traffic, curl
🔗 References & Best Practices
- Traefik - Metrics (observability) -- Traefik metrics configuration and exporter options.
- OpenTelemetry - Go: Getting Started / Instrumentation Guide -- How to instrument Go applications with OpenTelemetry.
- OpenTelemetry - Go: Exporters -- Exporter options for Go (OTLP, Prometheus, etc.).
Guides & integrations
- Prometheus - OpenTelemetry guide -- Guidance for integrating Prometheus with OpenTelemetry.
- Prometheus blog - Commitment to OpenTelemetry (Mar 2024) -- Prometheus project notes and recommended OTLP ingestion patterns.
Practical walkthroughs & blog posts
- OpenTelemetry blog - Prometheus + OpenTelemetry (2024) - Practical notes on combining Prometheus and OpenTelemetry.
- Grafana Blog - A practical guide to data collection with OpenTelemetry and Prometheus (Jul 2023) -- Hands-on examples and best practices for OTEL + Prometheus.
- BetterStack - OpenTelemetry for Go -- Practical guide for instrumenting Go apps with OpenTelemetry.
- BetterStack - OpenTelemetry metrics vs Prometheus metrics -- Comparison and guidance when to use OTEL vs Prometheus metric