Epic 3.5: Observability & Operations

# Epic 3.5: Observability &amp; Operations

## Status: 🚧 IN PROGRESS (Dashboards Deployed, Logging Pending)

## Overview

Implement comprehensive observability for M3W production deployment using **Grafana Cloud** (free tier) as the centralized monitoring platform, with a clear migration path to self-hosted when needed.

---

## Architecture Decision (2025-12-21)

### Why Grafana Cloud over Self-hosted?

| Consideration | Self-hosted | Grafana Cloud |
|---------------|-------------|---------------|
| **HA when node crashes** | ❌ Monitoring dies with app | ✅ External, always available |
| **Cost** | ~$30/mo for dedicated VM | $0 (free tier) |
| **Maintenance** | Need to manage stack | Zero ops |
| **Resource usage** | 550m CPU, 960Mi RAM | ~40m CPU, ~70Mi (Alloy only) |

**Decision**: Start with Grafana Cloud Free (50GB logs/mo, 10k metrics series, 14-day retention)

### Why no Sentry?

Sentry's error aggregation features can be replicated with:
- Structured JSON logs with traceId
- Loki queries: `{app="m3w-backend"} | json | level="error"`
- Grafana alerting on error rate

---

## Current Architecture (Grafana Cloud)

```
┌─────────────────────────────────────────────────────────────────────────┐
│                      Grafana Cloud (SaaS, Free)                         │
│  ┌────────────────┐  ┌────────────────┐  ┌────────────────┐             │
│  │    Grafana     │  │     Loki       │  │  Prometheus    │             │
│  │   (Dashboard)  │  │    (Logs)      │  │   (Mimir)      │             │
│  └────────────────┘  └────────────────┘  └────────────────┘             │
└───────────────────────────────────────────────────────────────────────────┘
                                ▲ HTTPS push
┌───────────────────────────────┴─────────────────────────────────────────┐
│  K3s / Gateway VMs                                                      │
│  ┌────────────────────────┐                                             │
│  │ Grafana Alloy          │  ← Unified agent for logs + metrics         │
│  │ ├─ logs → Loki         │                                             │
│  │ └─ metrics → Mimir     │                                             │
│  └────────────────────────┘                                             │
└─────────────────────────────────────────────────────────────────────────┘
```

---

## Log Sources

| Source | Type | Format | Collection |
|--------|------|--------|------------|
| Backend (Pino) | Business logs | JSON | Pod stdout → Alloy |
| Frontend | Client errors | JSON | `/api/logs` → Backend → Alloy |
| Gateway (OpenResty) | Access logs | JSON | Docker logs → Alloy |
| PostgreSQL | DB logs | Text | Pod stderr → Alloy |
| MinIO | Storage logs | Text | Pod stdout → Alloy |
| System (journald) | K3s, WireGuard | Text | journald → Alloy |

---

## Sub-Issues

### Infrastructure (m3w-k8s)
- [x] #251 Set up Grafana Cloud + deploy Alloy agents ✅ COMPLETED
- [x] #252 OpenResty JSON access log format ✅ COMPLETED
- [x] #248 Grafana dashboards and alerting rules ✅ COMPLETED (via Terraform)
- [x] ~~#191 Set up Prometheus + Grafana stack~~ → Closed, replaced by #251
- [x] ~~#192 Set up Loki log aggregation~~ → Closed, replaced by #251

### Application (m3w)
- [x] #249 Backend structured logging with traceId ✅ PR #267 (ready, blocked by bugfixes)
- [x] #250 Frontend error logging SDK ✅ PR #265 (ready, blocked by bugfixes)
- [x] #269 Refactor: remove logger-client dependency on authStore
- [ ] #274 Backend HTTP logger outputs text instead of JSON
- [x] ~~#193 Integrate Sentry~~ → Closed, not needed (use Loki)
- [x] ~~#200 Add metrics endpoint~~ → Closed, deferred (Alloy collects node metrics)

---

## Deployed Dashboards &amp; Alerts (2025-12-28)

### Dashboards

| Dashboard | URL | Data Source | Status |
|-----------|-----|-------------|--------|
| System Overview | [m3w-system-overview](https://test3207.grafana.net/d/m3w-system-overview) | Prometheus | ✅ Working |
| Application | [m3w-application](https://test3207.grafana.net/d/m3w-application) | Prometheus + Loki | ⚠️ Partial |
| Log Explorer | [m3w-log-explorer](https://test3207.grafana.net/d/m3w-log-explorer) | Loki | ⏳ Pending logs |

### Alert Rules

| Alert | Severity | Condition | Status |
|-------|----------|-----------|--------|
| Node Down | Critical | No metrics 2min | ✅ Active |
| Disk Usage Critical | Critical | >90% | ✅ Active |
| Memory Usage Critical | Critical | >90% | ✅ Active |
| High CPU | Warning | >80% 10min | ✅ Active |
| High Memory | Warning | >80% 10min | ✅ Active |
| Disk Usage Warning | Warning | >80% | ✅ Active |

---

## Resource Budget (Current: Grafana Cloud)

| Component | Per Node | Total (4 nodes) |
|-----------|----------|-----------------|
| Alloy | 30m CPU, 50Mi RAM | 120m CPU, 200Mi RAM |

---

## Grafana Cloud Free Tier Limits

| Resource | Limit | Our Usage (est.) |
|----------|-------|------------------|
| Metrics series | 10,000 | ~400 |
| Logs | 50 GB/month | ~30 GB |
| Traces | 50 GB/month | 0 (future) |
| Retention | 14 days | Sufficient (can export via LogCLI) |
| Users | 3 | 1 |
| Alert rules | 100 | 6 |

---

## Success Criteria

### Phase 1 (Grafana Cloud) - ✅ MOSTLY COMPLETE
- [x] All 4 nodes sending logs to Grafana Cloud
- [x] All 4 nodes sending metrics to Grafana Cloud
- [x] Gateway access logs in Loki with request tracing
- [x] Dashboard showing system health at a glance
- [x] Email alerts for critical errors and downtime
- [ ] Frontend errors visible in Loki (blocked by PR #265)
- [ ] Backend structured logs (blocked by PR #267)

### Phase 2 (Future: Distributed Tracing) - 🔮 LONG TERM
When to trigger: Need to debug slow requests or complex call chains

- [ ] Backend OpenTelemetry integration (auto-instrumentation)
- [ ] Alloy OTLP receiver configuration
- [ ] Traces → Grafana Tempo
- [ ] Log ↔ Trace correlation (click traceId to see full trace)

**Why not now?**
- Current traceId in logs satisfies 90% of debugging needs
- Low traffic, no complex performance bottlenecks yet
- OTel adds ~15MB dependencies, +5% latency overhead
- Will implement when: slow request debugging is difficult, or microservices are introduced

### Phase 3 (Future: Self-Hosted Migration Ready)
- [ ] Alloy config uses environment variables for endpoints
- [ ] Documented migration runbook
- [ ] Terraform/Ansible ready for monitoring VM

---

## Long-term Roadmap

| Phase | Trigger | Direction |
|-------|---------|-----------| 
| ✅ Current | - | Grafana Cloud free tier + traceId logs |
| 🔮 Phase 2 | Slow request debugging needed | OpenTelemetry → Tempo (distributed tracing) |
| 🔮 Phase 3 | Exceed free quota (50GB logs/mo) | Upgrade to paid tier OR self-host Loki |
| 🔮 Phase 4 | Multi-team collaboration | On-call rotation, PagerDuty integration |

---

## Documentation

- [observability-design.md](https://github.com/test3207/m3w-k8s/blob/main/docs/observability-design.md) - Architecture decisions &amp; interview talking points
- [observability-guide.md](https://github.com/test3207/m3w-k8s/blob/main/docs/observability-guide.md) - Detailed config &amp; query examples

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epic 3.5: Observability & Operations #190

Epic 3.5: Observability & Operations

Status: 🚧 IN PROGRESS (Dashboards Deployed, Logging Pending)

Overview

Architecture Decision (2025-12-21)

Why Grafana Cloud over Self-hosted?

Why no Sentry?

Current Architecture (Grafana Cloud)

Log Sources

Sub-Issues

Infrastructure (m3w-k8s)

Application (m3w)

Deployed Dashboards & Alerts (2025-12-28)

Dashboards

Alert Rules

Resource Budget (Current: Grafana Cloud)

Grafana Cloud Free Tier Limits

Success Criteria

Phase 1 (Grafana Cloud) - ✅ MOSTLY COMPLETE

Phase 2 (Future: Distributed Tracing) - 🔮 LONG TERM

Phase 3 (Future: Self-Hosted Migration Ready)

Long-term Roadmap

Documentation

Sub-issues

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Consideration	Self-hosted	Grafana Cloud
HA when node crashes	❌ Monitoring dies with app	✅ External, always available
Cost	~$30/mo for dedicated VM	$0 (free tier)
Maintenance	Need to manage stack	Zero ops
Resource usage	550m CPU, 960Mi RAM	~40m CPU, ~70Mi (Alloy only)

Source	Type	Format	Collection
Backend (Pino)	Business logs	JSON	Pod stdout → Alloy
Frontend	Client errors	JSON	`/api/logs` → Backend → Alloy
Gateway (OpenResty)	Access logs	JSON	Docker logs → Alloy
PostgreSQL	DB logs	Text	Pod stderr → Alloy
MinIO	Storage logs	Text	Pod stdout → Alloy
System (journald)	K3s, WireGuard	Text	journald → Alloy

Dashboard	URL	Data Source	Status
System Overview	m3w-system-overview	Prometheus	✅ Working
Application	m3w-application	Prometheus + Loki	⚠️ Partial
Log Explorer	m3w-log-explorer	Loki	⏳ Pending logs

Alert	Severity	Condition	Status
Node Down	Critical	No metrics 2min	✅ Active
Disk Usage Critical	Critical	>90%	✅ Active
Memory Usage Critical	Critical	>90%	✅ Active
High CPU	Warning	>80% 10min	✅ Active
High Memory	Warning	>80% 10min	✅ Active
Disk Usage Warning	Warning	>80%	✅ Active

Resource	Limit	Our Usage (est.)
Metrics series	10,000	~400
Logs	50 GB/month	~30 GB
Traces	50 GB/month	0 (future)
Retention	14 days	Sufficient (can export via LogCLI)
Users	3	1
Alert rules	100	6

Phase	Trigger	Direction
✅ Current	-	Grafana Cloud free tier + traceId logs
🔮 Phase 2	Slow request debugging needed	OpenTelemetry → Tempo (distributed tracing)
🔮 Phase 3	Exceed free quota (50GB logs/mo)	Upgrade to paid tier OR self-host Loki
🔮 Phase 4	Multi-team collaboration	On-call rotation, PagerDuty integration

Epic 3.5: Observability & Operations #190

Description

Epic 3.5: Observability & Operations

Status: 🚧 IN PROGRESS (Dashboards Deployed, Logging Pending)

Overview

Architecture Decision (2025-12-21)

Why Grafana Cloud over Self-hosted?

Why no Sentry?

Current Architecture (Grafana Cloud)

Log Sources

Sub-Issues

Infrastructure (m3w-k8s)

Application (m3w)

Deployed Dashboards & Alerts (2025-12-28)

Dashboards

Alert Rules

Resource Budget (Current: Grafana Cloud)

Grafana Cloud Free Tier Limits

Success Criteria

Phase 1 (Grafana Cloud) - ✅ MOSTLY COMPLETE

Phase 2 (Future: Distributed Tracing) - 🔮 LONG TERM

Phase 3 (Future: Self-Hosted Migration Ready)

Long-term Roadmap

Documentation

Sub-issues

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions