[Feature Request] Implement OpenTelemetry Metrics in Pangolin

Add OpenTelemetry-based observability to Pangolin
---

## Summary / Goal

Instrument Pangolin with **OpenTelemetry Metrics (OTel)** following CNCF / industry best practices so that:

- Metrics are emitted using the OpenTelemetry JS SDK  (vendor-neutral API).
- Metrics are backend-agnostic and exportable to Prometheus‑compatible backends and any OTLP‑supporting system via the OpenTelemetry Collector.
- Semantic conventions, SI units (`_seconds`, `_bytes`), and low‑cardinality labels are enforced.
- Focus is metrics first; design should allow adding traces and logs later.
- Provide an out‑of‑the‑box `/metrics` endpoint for Prometheus scraping (Prometheus exporter) and example OTel Collector config for production pipelines.

---

## Why OpenTelemetry (OTel)

- OTel is the CNCF standard for multi‑signal observability (metrics, traces, logs).
- Instrument once, export anywhere (Prometheus, Grafana Mimir, Thanos/Cortex, cloud vendors).
- Use the OTel Collector for attribute enrichment, normalization, batching, and remote_write.

---

## Requirements & Constraints

- Use the **OpenTelemetry JavaScript/TypeScript SDKs** and official instrumentation packages.
- Provide a `/metrics` endpoint in Prometheus exposition format via the OTel Prometheus exporter.
- All durations in **seconds** and sizes in **bytes**. Metric names should carry units where applicable (`_seconds`, `_bytes`) and counters use `_total`.
- Enforce low‑cardinality label design (e.g., `site_id`, `resource_id`). Do not use per‑request unique values as labels.
- All exporters configurable at runtime through environment variables or configuration (no code change to switch exporter).
- Provide an example **OTel Collector** configuration that demonstrates OTLP ingestion, attribute promotion and `prometheusremotewrite` usage.

---

## Recommended Pangolin Metrics (TypeScript implementation)

Use snake_case names and include units/ `_total` suffixes where applicable.

| Category          | Metric Name                                   | Type       | Labels                                                                                       | Unit / Notes |
|-------------------|-----------------------------------------------|------------|----------------------------------------------------------------------------------------------|--------------|
| **Site / Global** | `pangolin_site_active_sites`                  | Gauge      | `site_id`, `region`                                                                          | count        |
|                   | `pangolin_site_online`                        | Gauge 0/1  | `site_id`, `transport`                                                                       | bool         |
|                   | `pangolin_site_bandwidth_bytes_total`         | Counter    | `site_id`, `direction`, `protocol`                                                           | bytes        |
|                   | `pangolin_site_uptime_seconds_total`          | Counter    | `site_id`                                                                                    | seconds      |
|                   | `pangolin_site_connection_drops_total`        | Counter    | `site_id`                                                                                    | count        |
|                   | `pangolin_site_handshake_latency_seconds`     | Histogram  | `site_id`, `transport`                                                                       | seconds      |
| **Resource / App**| `pangolin_resource_requests_total`             | Counter    | `site_id`,`resource_id`,`backend`,`method`,`status`                                          | count        |
|                   | `pangolin_resource_request_duration_seconds`  | Histogram  | `site_id`,`resource_id`,`backend`,`method`                                                   | seconds      |
|                   | `pangolin_resource_active_connections`        | Gauge      | `site_id`,`resource_id`,`protocol`                                                           | count        |
|                   | `pangolin_resource_errors_total`              | Counter    | `site_id`,`resource_id`,`backend`,`error_type`                                               | count        |
|                   | `pangolin_resource_bandwidth_bytes_total`     | Counter    | `site_id`,`resource_id`,`direction`                                                          | bytes        |
| **Tunnel / Transport** | `pangolin_tunnel_up`                      | Gauge 0/1  | `site_id`,`transport`                                                                        | bool         |
|                   | `pangolin_tunnel_reconnects_total`            | Counter    | `site_id`,`transport`,`reason`                                                               | count        |
|                   | `pangolin_tunnel_latency_seconds`             | Histogram  | `site_id`,`transport`                                                                        | seconds      |
|                   | `pangolin_tunnel_bytes_total`                 | Counter    | `site_id`,`transport`,`direction`                                                            | bytes        |
|                   | `pangolin_wg_handshake_total`                 | Counter    | `site_id`,`result`                                                                           | count        |
| **Backend**       | `pangolin_backend_health_status`              | Gauge 1/0  | `backend`,`site_id`                                                                          | bool         |
|                   | `pangolin_backend_connection_errors_total`    | Counter    | `backend`,`site_id`,`error_type`                                                             | count        |
|                   | `pangolin_backend_response_size_bytes`        | Histogram  | `backend`,`site_id`                                                                          | bytes        |
| **Auth / Identity**| `pangolin_auth_requests_total`                | Counter    | `site_id`,`auth_method`,`result`                                                             | count        |
|                   | `pangolin_auth_request_duration_seconds`      | Histogram  | `auth_method`,`result`                                                                       | seconds      |
|                   | `pangolin_auth_active_users`                  | Gauge      | `site_id`,`auth_method`                                                                      | count        |
|                   | `pangolin_auth_failure_reasons_total`         | Counter    | `site_id`,`reason`,`auth_method`                                                             | count        |
| **Tokens / Sessions** | `pangolin_token_issued_total`              | Counter    | `site_id`,`auth_method`                                                                      | count        |
|                   | `pangolin_token_revoked_total`                | Counter    | `reason`                                                                                     | count        |
|                   | `pangolin_token_refresh_total`                | Counter    | `site_id`,`result`                                                                           | count        |
| **UI / API**      | `pangolin_ui_requests_total`                   | Counter    | `endpoint`,`method`,`status`                                                                 | count        |
|                   | `pangolin_ui_active_sessions`                 | Gauge      |                                                                                              | count        |
| **Operational**   | `pangolin_config_reloads_total`                | Counter    | `result`                                                                                     | count        |
|                   | `pangolin_restart_count_total`                | Counter    |                                                                                              | count        |
|                   | `pangolin_background_jobs_total`              | Counter    | `job_type`,`status`                                                                          | count        |
|                   | `pangolin_certificates_expiry_days`           | Gauge      | `site_id`,`resource_id`                                                                      | days         |

_Label guidelines:_ prefer `site_id`/`resource_id`. Avoid per‑request unique labels (user IDs, full URLs). Use enums and stable identifiers.

---

## Implementation Plan

1. Dependencies (example packages)
   - Add OpenTelemetry JS packages to the Node app (install via npm/yarn):
     - `@opentelemetry/api`
     - `@opentelemetry/sdk-metrics` (or current stable metrics SDK)
     - `@opentelemetry/exporter-prometheus`
     - `@opentelemetry/exporter-metrics-otlp-http` (or OTLP exporter variant)
     - `@opentelemetry/instrumentation-http`
     - Framework instrumentation if used (e.g., `@opentelemetry/instrumentation-express`, Next.js instrumentation patterns)
     - ...

2. Central metrics module
   - Create `src/metrics/` (or `server/metrics/`) that:
     - Initializes OTel MeterProvider.
     - Registers Prometheus exporter (when enabled) and exposes the exporter handler on `/metrics` (or mounts to existing server route).
     - Optionally registers OTLP exporter when configured via env vars.
     - Exposes a singleton `metrics` object with helper functions:
       - `inc(name, labels)`, `observe(name, value, labels)`, `setGauge(name, value, labels)` — mapped to pre-registered instruments.
     - Provides `shutdown()` to flush metrics.

3. Instrumentation approach
   - HTTP: use `@opentelemetry/instrumentation-http` plus manual wrapper to label proxied requests with `site_id`, `resource_id`, `backend`, etc.
   - Proxy logic: instrument where Pangolin forwards requests to backends; record counts, statuses and latencies.
   - Auth: instrument login/logout flows, failed attempts, active sessions gauge.
   - Tunnel events: instrument connect/disconnect/reconnect and throughput/latency where Pangolin has visibility.
   - Background jobs, config reloads, cert expiry checks: instrument events and counters.

4. Histograms & buckets
   - Configure histogram buckets per spec (duration buckets and byte-size buckets).
   - Use seconds for durations; bytes for sizes.

5. Exporter configuration (runtime)
   - Environment variables (suggested defaults):
     - `PANGOLIN_METRICS_PROMETHEUS_ENABLED=true`
     - `PANGOLIN_METRICS_OTLP_ENABLED=false`
     - `OTEL_EXPORTER_OTLP_ENDPOINT` (when OTLP enabled)
     - `OTEL_EXPORTER_OTLP_PROTOCOL` (http/protobuf or grpc)
     - `OTEL_SERVICE_NAME=pangolin`
     - `OTEL_RESOURCE_ATTRIBUTES` (e.g., `service.instance.id=...`)
     - `OTEL_METRIC_EXPORT_INTERVAL` (ms)

6. Local testing
   - Provide `docker-compose.metrics.yml` with:
     - Pangolin
     - OpenTelemetry Collector (example config)
     - Prometheus (scraping `/metrics` or Collector)
     - Grafana (optional)
   - Validate both direct Prometheus scrape and OTLP → Collector → remote_write flows.

7. Collector example
   - Include `example collector.yaml` demonstrating:
     - OTLP receiver
     - Transform processor to promote resource attributes (e.g., `site_id`, `resource_id`)
     - Prometheus remote_write exporter (generic endpoint)
     - Notes on name normalization and out‑of‑order ingestion if sending OTLP to Prometheus

8. Documentation
   - `observability.md`:
     - Metric catalog (name, type, labels, units, description)
     - How to enable/disable Prometheus exporter and OTLP exporter via env vars
     - How to run Docker Compose test stack
     - How to add a new metric (naming, labels, buckets)

9. Testing & validation
   - Manual test: start compose, generate traffic, curl `/metrics`, verify metrics names, units, labels and histogram buckets.
   - Include sample `/metrics` output in the PR.
   - ...

---

## References & Best Practices

- [Traefik – Metrics (observability)](https://doc.traefik.io/traefik/reference/install-configuration/observability/metrics/) – Traefik metrics configuration and exporter options (Prometheus, OpenTelemetry).  
- [OpenTelemetry – JavaScript/TypeScript: Getting Started / Instrumentation Guide](https://opentelemetry.io/docs/languages/js/) – How to instrument JavaScript/TypeScript/Node.js applications with OpenTelemetry.  
- [OpenTelemetry – JavaScript/TypeScript: Exporters](https://opentelemetry.io/docs/languages/js/exporters/) – Exporter options for Node.js and browser (OTLP, Prometheus, etc.).

**Practical walkthroughs & blog posts**  

- [OpenTelemetry Blog – Prometheus + OpenTelemetry (2024)](https://opentelemetry.io/blog/2024/prom-and-otel/) – Practical notes on combining Prometheus and OpenTelemetry.  
- [Grafana Blog – A Practical Guide to Data Collection with OpenTelemetry and Prometheus (Jul 2023)](https://grafana.com/blog/2023/07/20/a-practical-guide-to-data-collection-with-opentelemetry-and-prometheus/) – Hands‑on examples and best practices for OTEL + Prometheus.
- [BetterStack – OpenTelemetry for Node.js](https://betterstack.com/community/guides/observability/opentelemetry-metrics-nodejs/) – Practical guide for instrumenting Node.js apps with OpenTelemetry.  
- [BetterStack – OpenTelemetry Metrics vs Prometheus Metrics](https://betterstack.com/community/guides/observability/opentelemetry-metrics-vs-prometheus-metrics/) – Comparison and guidance on when to use OTEL vs Prometheus metrics.

Category	Metric Name	Type	Labels	Unit / Notes
Site / Global	`pangolin_site_active_sites`	Gauge	`site_id`, `region`	count
	`pangolin_site_online`	Gauge 0/1	`site_id`, `transport`	bool
	`pangolin_site_bandwidth_bytes_total`	Counter	`site_id`, `direction`, `protocol`	bytes
	`pangolin_site_uptime_seconds_total`	Counter	`site_id`	seconds
	`pangolin_site_connection_drops_total`	Counter	`site_id`	count
	`pangolin_site_handshake_latency_seconds`	Histogram	`site_id`, `transport`	seconds
Resource / App	`pangolin_resource_requests_total`	Counter	`site_id`,`resource_id`,`backend`,`method`,`status`	count
	`pangolin_resource_request_duration_seconds`	Histogram	`site_id`,`resource_id`,`backend`,`method`	seconds
	`pangolin_resource_active_connections`	Gauge	`site_id`,`resource_id`,`protocol`	count
	`pangolin_resource_errors_total`	Counter	`site_id`,`resource_id`,`backend`,`error_type`	count
	`pangolin_resource_bandwidth_bytes_total`	Counter	`site_id`,`resource_id`,`direction`	bytes
Tunnel / Transport	`pangolin_tunnel_up`	Gauge 0/1	`site_id`,`transport`	bool
	`pangolin_tunnel_reconnects_total`	Counter	`site_id`,`transport`,`reason`	count
	`pangolin_tunnel_latency_seconds`	Histogram	`site_id`,`transport`	seconds
	`pangolin_tunnel_bytes_total`	Counter	`site_id`,`transport`,`direction`	bytes
	`pangolin_wg_handshake_total`	Counter	`site_id`,`result`	count
Backend	`pangolin_backend_health_status`	Gauge 1/0	`backend`,`site_id`	bool
	`pangolin_backend_connection_errors_total`	Counter	`backend`,`site_id`,`error_type`	count
	`pangolin_backend_response_size_bytes`	Histogram	`backend`,`site_id`	bytes
Auth / Identity	`pangolin_auth_requests_total`	Counter	`site_id`,`auth_method`,`result`	count
	`pangolin_auth_request_duration_seconds`	Histogram	`auth_method`,`result`	seconds
	`pangolin_auth_active_users`	Gauge	`site_id`,`auth_method`	count
	`pangolin_auth_failure_reasons_total`	Counter	`site_id`,`reason`,`auth_method`	count
Tokens / Sessions	`pangolin_token_issued_total`	Counter	`site_id`,`auth_method`	count
	`pangolin_token_revoked_total`	Counter	`reason`	count
	`pangolin_token_refresh_total`	Counter	`site_id`,`result`	count
UI / API	`pangolin_ui_requests_total`	Counter	`endpoint`,`method`,`status`	count
	`pangolin_ui_active_sessions`	Gauge		count
Operational	`pangolin_config_reloads_total`	Counter	`result`	count
	`pangolin_restart_count_total`	Counter		count
	`pangolin_background_jobs_total`	Counter	`job_type`,`status`	count
	`pangolin_certificates_expiry_days`	Gauge	`site_id`,`resource_id`	days

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Feature Request] Implement OpenTelemetry Metrics in Pangolin #1429

Add OpenTelemetry-based observability to Pangolin

Summary / Goal

Why OpenTelemetry (OTel)

Requirements & Constraints

Recommended Pangolin Metrics (TypeScript implementation)

Implementation Plan

References & Best Practices

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Feature Request] Implement OpenTelemetry Metrics in Pangolin #1429

Description

Add OpenTelemetry-based observability to Pangolin

Summary / Goal

Why OpenTelemetry (OTel)

Requirements & Constraints

Recommended Pangolin Metrics (TypeScript implementation)

Implementation Plan

References & Best Practices

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions