Skip to content

Epic 3.5: Observability & Operations #190

@test3207

Description

@test3207

Epic 3.5: Observability & Operations

Status: 🚧 IN PROGRESS (Dashboards Deployed, Logging Pending)

Overview

Implement comprehensive observability for M3W production deployment using Grafana Cloud (free tier) as the centralized monitoring platform, with a clear migration path to self-hosted when needed.


Architecture Decision (2025-12-21)

Why Grafana Cloud over Self-hosted?

Consideration Self-hosted Grafana Cloud
HA when node crashes ❌ Monitoring dies with app ✅ External, always available
Cost ~$30/mo for dedicated VM $0 (free tier)
Maintenance Need to manage stack Zero ops
Resource usage 550m CPU, 960Mi RAM ~40m CPU, ~70Mi (Alloy only)

Decision: Start with Grafana Cloud Free (50GB logs/mo, 10k metrics series, 14-day retention)

Why no Sentry?

Sentry's error aggregation features can be replicated with:

  • Structured JSON logs with traceId
  • Loki queries: {app="m3w-backend"} | json | level="error"
  • Grafana alerting on error rate

Current Architecture (Grafana Cloud)

┌─────────────────────────────────────────────────────────────────────────┐
│                      Grafana Cloud (SaaS, Free)                         │
│  ┌────────────────┐  ┌────────────────┐  ┌────────────────┐             │
│  │    Grafana     │  │     Loki       │  │  Prometheus    │             │
│  │   (Dashboard)  │  │    (Logs)      │  │   (Mimir)      │             │
│  └────────────────┘  └────────────────┘  └────────────────┘             │
└───────────────────────────────────────────────────────────────────────────┘
                                ▲ HTTPS push
┌───────────────────────────────┴─────────────────────────────────────────┐
│  K3s / Gateway VMs                                                      │
│  ┌────────────────────────┐                                             │
│  │ Grafana Alloy          │  ← Unified agent for logs + metrics         │
│  │ ├─ logs → Loki         │                                             │
│  │ └─ metrics → Mimir     │                                             │
│  └────────────────────────┘                                             │
└─────────────────────────────────────────────────────────────────────────┘

Log Sources

Source Type Format Collection
Backend (Pino) Business logs JSON Pod stdout → Alloy
Frontend Client errors JSON /api/logs → Backend → Alloy
Gateway (OpenResty) Access logs JSON Docker logs → Alloy
PostgreSQL DB logs Text Pod stderr → Alloy
MinIO Storage logs Text Pod stdout → Alloy
System (journald) K3s, WireGuard Text journald → Alloy

Sub-Issues

Infrastructure (m3w-k8s)

Application (m3w)


Deployed Dashboards & Alerts (2025-12-28)

Dashboards

Dashboard URL Data Source Status
System Overview m3w-system-overview Prometheus ✅ Working
Application m3w-application Prometheus + Loki ⚠️ Partial
Log Explorer m3w-log-explorer Loki ⏳ Pending logs

Alert Rules

Alert Severity Condition Status
Node Down Critical No metrics 2min ✅ Active
Disk Usage Critical Critical >90% ✅ Active
Memory Usage Critical Critical >90% ✅ Active
High CPU Warning >80% 10min ✅ Active
High Memory Warning >80% 10min ✅ Active
Disk Usage Warning Warning >80% ✅ Active

Resource Budget (Current: Grafana Cloud)

Component Per Node Total (4 nodes)
Alloy 30m CPU, 50Mi RAM 120m CPU, 200Mi RAM

Grafana Cloud Free Tier Limits

Resource Limit Our Usage (est.)
Metrics series 10,000 ~400
Logs 50 GB/month ~30 GB
Traces 50 GB/month 0 (future)
Retention 14 days Sufficient (can export via LogCLI)
Users 3 1
Alert rules 100 6

Success Criteria

Phase 1 (Grafana Cloud) - ✅ MOSTLY COMPLETE

Phase 2 (Future: Distributed Tracing) - 🔮 LONG TERM

When to trigger: Need to debug slow requests or complex call chains

  • Backend OpenTelemetry integration (auto-instrumentation)
  • Alloy OTLP receiver configuration
  • Traces → Grafana Tempo
  • Log ↔ Trace correlation (click traceId to see full trace)

Why not now?

  • Current traceId in logs satisfies 90% of debugging needs
  • Low traffic, no complex performance bottlenecks yet
  • OTel adds ~15MB dependencies, +5% latency overhead
  • Will implement when: slow request debugging is difficult, or microservices are introduced

Phase 3 (Future: Self-Hosted Migration Ready)

  • Alloy config uses environment variables for endpoints
  • Documented migration runbook
  • Terraform/Ansible ready for monitoring VM

Long-term Roadmap

Phase Trigger Direction
✅ Current - Grafana Cloud free tier + traceId logs
🔮 Phase 2 Slow request debugging needed OpenTelemetry → Tempo (distributed tracing)
🔮 Phase 3 Exceed free quota (50GB logs/mo) Upgrade to paid tier OR self-host Loki
🔮 Phase 4 Multi-team collaboration On-call rotation, PagerDuty integration

Documentation

Sub-issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    epictasks under certain domains

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions