Skip to content

[Feature Request] Implement OpenTelemetry Metrics in Pangolin #1429

@marcschaeferger

Description

@marcschaeferger

Add OpenTelemetry-based observability to Pangolin

Summary / Goal

Instrument Pangolin with OpenTelemetry Metrics (OTel) following CNCF / industry best practices so that:

  • Metrics are emitted using the OpenTelemetry JS SDK (vendor-neutral API).
  • Metrics are backend-agnostic and exportable to Prometheus‑compatible backends and any OTLP‑supporting system via the OpenTelemetry Collector.
  • Semantic conventions, SI units (_seconds, _bytes), and low‑cardinality labels are enforced.
  • Focus is metrics first; design should allow adding traces and logs later.
  • Provide an out‑of‑the‑box /metrics endpoint for Prometheus scraping (Prometheus exporter) and example OTel Collector config for production pipelines.

Why OpenTelemetry (OTel)

  • OTel is the CNCF standard for multi‑signal observability (metrics, traces, logs).
  • Instrument once, export anywhere (Prometheus, Grafana Mimir, Thanos/Cortex, cloud vendors).
  • Use the OTel Collector for attribute enrichment, normalization, batching, and remote_write.

Requirements & Constraints

  • Use the OpenTelemetry JavaScript/TypeScript SDKs and official instrumentation packages.
  • Provide a /metrics endpoint in Prometheus exposition format via the OTel Prometheus exporter.
  • All durations in seconds and sizes in bytes. Metric names should carry units where applicable (_seconds, _bytes) and counters use _total.
  • Enforce low‑cardinality label design (e.g., site_id, resource_id). Do not use per‑request unique values as labels.
  • All exporters configurable at runtime through environment variables or configuration (no code change to switch exporter).
  • Provide an example OTel Collector configuration that demonstrates OTLP ingestion, attribute promotion and prometheusremotewrite usage.

Recommended Pangolin Metrics (TypeScript implementation)

Use snake_case names and include units/ _total suffixes where applicable.

Category Metric Name Type Labels Unit / Notes
Site / Global pangolin_site_active_sites Gauge site_id, region count
pangolin_site_online Gauge 0/1 site_id, transport bool
pangolin_site_bandwidth_bytes_total Counter site_id, direction, protocol bytes
pangolin_site_uptime_seconds_total Counter site_id seconds
pangolin_site_connection_drops_total Counter site_id count
pangolin_site_handshake_latency_seconds Histogram site_id, transport seconds
Resource / App pangolin_resource_requests_total Counter site_id,resource_id,backend,method,status count
pangolin_resource_request_duration_seconds Histogram site_id,resource_id,backend,method seconds
pangolin_resource_active_connections Gauge site_id,resource_id,protocol count
pangolin_resource_errors_total Counter site_id,resource_id,backend,error_type count
pangolin_resource_bandwidth_bytes_total Counter site_id,resource_id,direction bytes
Tunnel / Transport pangolin_tunnel_up Gauge 0/1 site_id,transport bool
pangolin_tunnel_reconnects_total Counter site_id,transport,reason count
pangolin_tunnel_latency_seconds Histogram site_id,transport seconds
pangolin_tunnel_bytes_total Counter site_id,transport,direction bytes
pangolin_wg_handshake_total Counter site_id,result count
Backend pangolin_backend_health_status Gauge 1/0 backend,site_id bool
pangolin_backend_connection_errors_total Counter backend,site_id,error_type count
pangolin_backend_response_size_bytes Histogram backend,site_id bytes
Auth / Identity pangolin_auth_requests_total Counter site_id,auth_method,result count
pangolin_auth_request_duration_seconds Histogram auth_method,result seconds
pangolin_auth_active_users Gauge site_id,auth_method count
pangolin_auth_failure_reasons_total Counter site_id,reason,auth_method count
Tokens / Sessions pangolin_token_issued_total Counter site_id,auth_method count
pangolin_token_revoked_total Counter reason count
pangolin_token_refresh_total Counter site_id,result count
UI / API pangolin_ui_requests_total Counter endpoint,method,status count
pangolin_ui_active_sessions Gauge count
Operational pangolin_config_reloads_total Counter result count
pangolin_restart_count_total Counter count
pangolin_background_jobs_total Counter job_type,status count
pangolin_certificates_expiry_days Gauge site_id,resource_id days

Label guidelines: prefer site_id/resource_id. Avoid per‑request unique labels (user IDs, full URLs). Use enums and stable identifiers.


Implementation Plan

  1. Dependencies (example packages)

    • Add OpenTelemetry JS packages to the Node app (install via npm/yarn):
      • @opentelemetry/api
      • @opentelemetry/sdk-metrics (or current stable metrics SDK)
      • @opentelemetry/exporter-prometheus
      • @opentelemetry/exporter-metrics-otlp-http (or OTLP exporter variant)
      • @opentelemetry/instrumentation-http
      • Framework instrumentation if used (e.g., @opentelemetry/instrumentation-express, Next.js instrumentation patterns)
      • ...
  2. Central metrics module

    • Create src/metrics/ (or server/metrics/) that:
      • Initializes OTel MeterProvider.
      • Registers Prometheus exporter (when enabled) and exposes the exporter handler on /metrics (or mounts to existing server route).
      • Optionally registers OTLP exporter when configured via env vars.
      • Exposes a singleton metrics object with helper functions:
        • inc(name, labels), observe(name, value, labels), setGauge(name, value, labels) — mapped to pre-registered instruments.
      • Provides shutdown() to flush metrics.
  3. Instrumentation approach

    • HTTP: use @opentelemetry/instrumentation-http plus manual wrapper to label proxied requests with site_id, resource_id, backend, etc.
    • Proxy logic: instrument where Pangolin forwards requests to backends; record counts, statuses and latencies.
    • Auth: instrument login/logout flows, failed attempts, active sessions gauge.
    • Tunnel events: instrument connect/disconnect/reconnect and throughput/latency where Pangolin has visibility.
    • Background jobs, config reloads, cert expiry checks: instrument events and counters.
  4. Histograms & buckets

    • Configure histogram buckets per spec (duration buckets and byte-size buckets).
    • Use seconds for durations; bytes for sizes.
  5. Exporter configuration (runtime)

    • Environment variables (suggested defaults):
      • PANGOLIN_METRICS_PROMETHEUS_ENABLED=true
      • PANGOLIN_METRICS_OTLP_ENABLED=false
      • OTEL_EXPORTER_OTLP_ENDPOINT (when OTLP enabled)
      • OTEL_EXPORTER_OTLP_PROTOCOL (http/protobuf or grpc)
      • OTEL_SERVICE_NAME=pangolin
      • OTEL_RESOURCE_ATTRIBUTES (e.g., service.instance.id=...)
      • OTEL_METRIC_EXPORT_INTERVAL (ms)
  6. Local testing

    • Provide docker-compose.metrics.yml with:
      • Pangolin
      • OpenTelemetry Collector (example config)
      • Prometheus (scraping /metrics or Collector)
      • Grafana (optional)
    • Validate both direct Prometheus scrape and OTLP → Collector → remote_write flows.
  7. Collector example

    • Include example collector.yaml demonstrating:
      • OTLP receiver
      • Transform processor to promote resource attributes (e.g., site_id, resource_id)
      • Prometheus remote_write exporter (generic endpoint)
      • Notes on name normalization and out‑of‑order ingestion if sending OTLP to Prometheus
  8. Documentation

    • observability.md:
      • Metric catalog (name, type, labels, units, description)
      • How to enable/disable Prometheus exporter and OTLP exporter via env vars
      • How to run Docker Compose test stack
      • How to add a new metric (naming, labels, buckets)
  9. Testing & validation

    • Manual test: start compose, generate traffic, curl /metrics, verify metrics names, units, labels and histogram buckets.
    • Include sample /metrics output in the PR.
    • ...

References & Best Practices

Practical walkthroughs & blog posts

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions