-
-
Notifications
You must be signed in to change notification settings - Fork 528
Closed
Description
Add OpenTelemetry-based observability to Pangolin
Summary / Goal
Instrument Pangolin with OpenTelemetry Metrics (OTel) following CNCF / industry best practices so that:
- Metrics are emitted using the OpenTelemetry JS SDK (vendor-neutral API).
- Metrics are backend-agnostic and exportable to Prometheus‑compatible backends and any OTLP‑supporting system via the OpenTelemetry Collector.
- Semantic conventions, SI units (
_seconds,_bytes), and low‑cardinality labels are enforced. - Focus is metrics first; design should allow adding traces and logs later.
- Provide an out‑of‑the‑box
/metricsendpoint for Prometheus scraping (Prometheus exporter) and example OTel Collector config for production pipelines.
Why OpenTelemetry (OTel)
- OTel is the CNCF standard for multi‑signal observability (metrics, traces, logs).
- Instrument once, export anywhere (Prometheus, Grafana Mimir, Thanos/Cortex, cloud vendors).
- Use the OTel Collector for attribute enrichment, normalization, batching, and remote_write.
Requirements & Constraints
- Use the OpenTelemetry JavaScript/TypeScript SDKs and official instrumentation packages.
- Provide a
/metricsendpoint in Prometheus exposition format via the OTel Prometheus exporter. - All durations in seconds and sizes in bytes. Metric names should carry units where applicable (
_seconds,_bytes) and counters use_total. - Enforce low‑cardinality label design (e.g.,
site_id,resource_id). Do not use per‑request unique values as labels. - All exporters configurable at runtime through environment variables or configuration (no code change to switch exporter).
- Provide an example OTel Collector configuration that demonstrates OTLP ingestion, attribute promotion and
prometheusremotewriteusage.
Recommended Pangolin Metrics (TypeScript implementation)
Use snake_case names and include units/ _total suffixes where applicable.
| Category | Metric Name | Type | Labels | Unit / Notes |
|---|---|---|---|---|
| Site / Global | pangolin_site_active_sites |
Gauge | site_id, region |
count |
pangolin_site_online |
Gauge 0/1 | site_id, transport |
bool | |
pangolin_site_bandwidth_bytes_total |
Counter | site_id, direction, protocol |
bytes | |
pangolin_site_uptime_seconds_total |
Counter | site_id |
seconds | |
pangolin_site_connection_drops_total |
Counter | site_id |
count | |
pangolin_site_handshake_latency_seconds |
Histogram | site_id, transport |
seconds | |
| Resource / App | pangolin_resource_requests_total |
Counter | site_id,resource_id,backend,method,status |
count |
pangolin_resource_request_duration_seconds |
Histogram | site_id,resource_id,backend,method |
seconds | |
pangolin_resource_active_connections |
Gauge | site_id,resource_id,protocol |
count | |
pangolin_resource_errors_total |
Counter | site_id,resource_id,backend,error_type |
count | |
pangolin_resource_bandwidth_bytes_total |
Counter | site_id,resource_id,direction |
bytes | |
| Tunnel / Transport | pangolin_tunnel_up |
Gauge 0/1 | site_id,transport |
bool |
pangolin_tunnel_reconnects_total |
Counter | site_id,transport,reason |
count | |
pangolin_tunnel_latency_seconds |
Histogram | site_id,transport |
seconds | |
pangolin_tunnel_bytes_total |
Counter | site_id,transport,direction |
bytes | |
pangolin_wg_handshake_total |
Counter | site_id,result |
count | |
| Backend | pangolin_backend_health_status |
Gauge 1/0 | backend,site_id |
bool |
pangolin_backend_connection_errors_total |
Counter | backend,site_id,error_type |
count | |
pangolin_backend_response_size_bytes |
Histogram | backend,site_id |
bytes | |
| Auth / Identity | pangolin_auth_requests_total |
Counter | site_id,auth_method,result |
count |
pangolin_auth_request_duration_seconds |
Histogram | auth_method,result |
seconds | |
pangolin_auth_active_users |
Gauge | site_id,auth_method |
count | |
pangolin_auth_failure_reasons_total |
Counter | site_id,reason,auth_method |
count | |
| Tokens / Sessions | pangolin_token_issued_total |
Counter | site_id,auth_method |
count |
pangolin_token_revoked_total |
Counter | reason |
count | |
pangolin_token_refresh_total |
Counter | site_id,result |
count | |
| UI / API | pangolin_ui_requests_total |
Counter | endpoint,method,status |
count |
pangolin_ui_active_sessions |
Gauge | count | ||
| Operational | pangolin_config_reloads_total |
Counter | result |
count |
pangolin_restart_count_total |
Counter | count | ||
pangolin_background_jobs_total |
Counter | job_type,status |
count | |
pangolin_certificates_expiry_days |
Gauge | site_id,resource_id |
days |
Label guidelines: prefer site_id/resource_id. Avoid per‑request unique labels (user IDs, full URLs). Use enums and stable identifiers.
Implementation Plan
-
Dependencies (example packages)
- Add OpenTelemetry JS packages to the Node app (install via npm/yarn):
@opentelemetry/api@opentelemetry/sdk-metrics(or current stable metrics SDK)@opentelemetry/exporter-prometheus@opentelemetry/exporter-metrics-otlp-http(or OTLP exporter variant)@opentelemetry/instrumentation-http- Framework instrumentation if used (e.g.,
@opentelemetry/instrumentation-express, Next.js instrumentation patterns) - ...
- Add OpenTelemetry JS packages to the Node app (install via npm/yarn):
-
Central metrics module
- Create
src/metrics/(orserver/metrics/) that:- Initializes OTel MeterProvider.
- Registers Prometheus exporter (when enabled) and exposes the exporter handler on
/metrics(or mounts to existing server route). - Optionally registers OTLP exporter when configured via env vars.
- Exposes a singleton
metricsobject with helper functions:inc(name, labels),observe(name, value, labels),setGauge(name, value, labels)— mapped to pre-registered instruments.
- Provides
shutdown()to flush metrics.
- Create
-
Instrumentation approach
- HTTP: use
@opentelemetry/instrumentation-httpplus manual wrapper to label proxied requests withsite_id,resource_id,backend, etc. - Proxy logic: instrument where Pangolin forwards requests to backends; record counts, statuses and latencies.
- Auth: instrument login/logout flows, failed attempts, active sessions gauge.
- Tunnel events: instrument connect/disconnect/reconnect and throughput/latency where Pangolin has visibility.
- Background jobs, config reloads, cert expiry checks: instrument events and counters.
- HTTP: use
-
Histograms & buckets
- Configure histogram buckets per spec (duration buckets and byte-size buckets).
- Use seconds for durations; bytes for sizes.
-
Exporter configuration (runtime)
- Environment variables (suggested defaults):
PANGOLIN_METRICS_PROMETHEUS_ENABLED=truePANGOLIN_METRICS_OTLP_ENABLED=falseOTEL_EXPORTER_OTLP_ENDPOINT(when OTLP enabled)OTEL_EXPORTER_OTLP_PROTOCOL(http/protobuf or grpc)OTEL_SERVICE_NAME=pangolinOTEL_RESOURCE_ATTRIBUTES(e.g.,service.instance.id=...)OTEL_METRIC_EXPORT_INTERVAL(ms)
- Environment variables (suggested defaults):
-
Local testing
- Provide
docker-compose.metrics.ymlwith:- Pangolin
- OpenTelemetry Collector (example config)
- Prometheus (scraping
/metricsor Collector) - Grafana (optional)
- Validate both direct Prometheus scrape and OTLP → Collector → remote_write flows.
- Provide
-
Collector example
- Include
example collector.yamldemonstrating:- OTLP receiver
- Transform processor to promote resource attributes (e.g.,
site_id,resource_id) - Prometheus remote_write exporter (generic endpoint)
- Notes on name normalization and out‑of‑order ingestion if sending OTLP to Prometheus
- Include
-
Documentation
observability.md:- Metric catalog (name, type, labels, units, description)
- How to enable/disable Prometheus exporter and OTLP exporter via env vars
- How to run Docker Compose test stack
- How to add a new metric (naming, labels, buckets)
-
Testing & validation
- Manual test: start compose, generate traffic, curl
/metrics, verify metrics names, units, labels and histogram buckets. - Include sample
/metricsoutput in the PR. - ...
- Manual test: start compose, generate traffic, curl
References & Best Practices
- Traefik – Metrics (observability) – Traefik metrics configuration and exporter options (Prometheus, OpenTelemetry).
- OpenTelemetry – JavaScript/TypeScript: Getting Started / Instrumentation Guide – How to instrument JavaScript/TypeScript/Node.js applications with OpenTelemetry.
- OpenTelemetry – JavaScript/TypeScript: Exporters – Exporter options for Node.js and browser (OTLP, Prometheus, etc.).
Practical walkthroughs & blog posts
- OpenTelemetry Blog – Prometheus + OpenTelemetry (2024) – Practical notes on combining Prometheus and OpenTelemetry.
- Grafana Blog – A Practical Guide to Data Collection with OpenTelemetry and Prometheus (Jul 2023) – Hands‑on examples and best practices for OTEL + Prometheus.
- BetterStack – OpenTelemetry for Node.js – Practical guide for instrumenting Node.js apps with OpenTelemetry.
- BetterStack – OpenTelemetry Metrics vs Prometheus Metrics – Comparison and guidance on when to use OTEL vs Prometheus metrics.
Metadata
Metadata
Assignees
Labels
No labels