-
Notifications
You must be signed in to change notification settings - Fork 0
Open
9 / 109 of 10 issues completedLabels
epictasks under certain domainstasks under certain domains
Milestone
Description
Epic 3.5: Observability & Operations
Status: 🚧 IN PROGRESS (Dashboards Deployed, Logging Pending)
Overview
Implement comprehensive observability for M3W production deployment using Grafana Cloud (free tier) as the centralized monitoring platform, with a clear migration path to self-hosted when needed.
Architecture Decision (2025-12-21)
Why Grafana Cloud over Self-hosted?
| Consideration | Self-hosted | Grafana Cloud |
|---|---|---|
| HA when node crashes | ❌ Monitoring dies with app | ✅ External, always available |
| Cost | ~$30/mo for dedicated VM | $0 (free tier) |
| Maintenance | Need to manage stack | Zero ops |
| Resource usage | 550m CPU, 960Mi RAM | ~40m CPU, ~70Mi (Alloy only) |
Decision: Start with Grafana Cloud Free (50GB logs/mo, 10k metrics series, 14-day retention)
Why no Sentry?
Sentry's error aggregation features can be replicated with:
- Structured JSON logs with traceId
- Loki queries:
{app="m3w-backend"} | json | level="error" - Grafana alerting on error rate
Current Architecture (Grafana Cloud)
┌─────────────────────────────────────────────────────────────────────────┐
│ Grafana Cloud (SaaS, Free) │
│ ┌────────────────┐ ┌────────────────┐ ┌────────────────┐ │
│ │ Grafana │ │ Loki │ │ Prometheus │ │
│ │ (Dashboard) │ │ (Logs) │ │ (Mimir) │ │
│ └────────────────┘ └────────────────┘ └────────────────┘ │
└───────────────────────────────────────────────────────────────────────────┘
▲ HTTPS push
┌───────────────────────────────┴─────────────────────────────────────────┐
│ K3s / Gateway VMs │
│ ┌────────────────────────┐ │
│ │ Grafana Alloy │ ← Unified agent for logs + metrics │
│ │ ├─ logs → Loki │ │
│ │ └─ metrics → Mimir │ │
│ └────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
Log Sources
| Source | Type | Format | Collection |
|---|---|---|---|
| Backend (Pino) | Business logs | JSON | Pod stdout → Alloy |
| Frontend | Client errors | JSON | /api/logs → Backend → Alloy |
| Gateway (OpenResty) | Access logs | JSON | Docker logs → Alloy |
| PostgreSQL | DB logs | Text | Pod stderr → Alloy |
| MinIO | Storage logs | Text | Pod stdout → Alloy |
| System (journald) | K3s, WireGuard | Text | journald → Alloy |
Sub-Issues
Infrastructure (m3w-k8s)
- Set up Grafana Cloud + deploy Alloy agents #251 Set up Grafana Cloud + deploy Alloy agents ✅ COMPLETED
- OpenResty JSON access log format #252 OpenResty JSON access log format ✅ COMPLETED
- Grafana dashboards and alerting rules #248 Grafana dashboards and alerting rules ✅ COMPLETED (via Terraform)
-
Set up Prometheus + Grafana monitoring stack #191 Set up Prometheus + Grafana stack→ Closed, replaced by Set up Grafana Cloud + deploy Alloy agents #251 -
Set up Loki log aggregation #192 Set up Loki log aggregation→ Closed, replaced by Set up Grafana Cloud + deploy Alloy agents #251
Application (m3w)
- Backend structured logging with traceId #249 Backend structured logging with traceId ✅ PR refactor(backend): migrate to structured logging with createLogger #267 (ready, blocked by bugfixes)
- Frontend error logging SDK #250 Frontend error logging SDK ✅ PR feat: implement Epic 3.5 observability logging #265 (ready, blocked by bugfixes)
- refactor(frontend): remove logger-client dependency on authStore #269 Refactor: remove logger-client dependency on authStore
- fix: Backend HTTP logger outputs text instead of JSON in production #274 Backend HTTP logger outputs text instead of JSON
-
Integrate Sentry error tracking #193 Integrate Sentry→ Closed, not needed (use Loki) -
Add metrics endpoint and structured logging to backend #200 Add metrics endpoint→ Closed, deferred (Alloy collects node metrics)
Deployed Dashboards & Alerts (2025-12-28)
Dashboards
| Dashboard | URL | Data Source | Status |
|---|---|---|---|
| System Overview | m3w-system-overview | Prometheus | ✅ Working |
| Application | m3w-application | Prometheus + Loki | |
| Log Explorer | m3w-log-explorer | Loki | ⏳ Pending logs |
Alert Rules
| Alert | Severity | Condition | Status |
|---|---|---|---|
| Node Down | Critical | No metrics 2min | ✅ Active |
| Disk Usage Critical | Critical | >90% | ✅ Active |
| Memory Usage Critical | Critical | >90% | ✅ Active |
| High CPU | Warning | >80% 10min | ✅ Active |
| High Memory | Warning | >80% 10min | ✅ Active |
| Disk Usage Warning | Warning | >80% | ✅ Active |
Resource Budget (Current: Grafana Cloud)
| Component | Per Node | Total (4 nodes) |
|---|---|---|
| Alloy | 30m CPU, 50Mi RAM | 120m CPU, 200Mi RAM |
Grafana Cloud Free Tier Limits
| Resource | Limit | Our Usage (est.) |
|---|---|---|
| Metrics series | 10,000 | ~400 |
| Logs | 50 GB/month | ~30 GB |
| Traces | 50 GB/month | 0 (future) |
| Retention | 14 days | Sufficient (can export via LogCLI) |
| Users | 3 | 1 |
| Alert rules | 100 | 6 |
Success Criteria
Phase 1 (Grafana Cloud) - ✅ MOSTLY COMPLETE
- All 4 nodes sending logs to Grafana Cloud
- All 4 nodes sending metrics to Grafana Cloud
- Gateway access logs in Loki with request tracing
- Dashboard showing system health at a glance
- Email alerts for critical errors and downtime
- Frontend errors visible in Loki (blocked by PR feat: implement Epic 3.5 observability logging #265)
- Backend structured logs (blocked by PR refactor(backend): migrate to structured logging with createLogger #267)
Phase 2 (Future: Distributed Tracing) - 🔮 LONG TERM
When to trigger: Need to debug slow requests or complex call chains
- Backend OpenTelemetry integration (auto-instrumentation)
- Alloy OTLP receiver configuration
- Traces → Grafana Tempo
- Log ↔ Trace correlation (click traceId to see full trace)
Why not now?
- Current traceId in logs satisfies 90% of debugging needs
- Low traffic, no complex performance bottlenecks yet
- OTel adds ~15MB dependencies, +5% latency overhead
- Will implement when: slow request debugging is difficult, or microservices are introduced
Phase 3 (Future: Self-Hosted Migration Ready)
- Alloy config uses environment variables for endpoints
- Documented migration runbook
- Terraform/Ansible ready for monitoring VM
Long-term Roadmap
| Phase | Trigger | Direction |
|---|---|---|
| ✅ Current | - | Grafana Cloud free tier + traceId logs |
| 🔮 Phase 2 | Slow request debugging needed | OpenTelemetry → Tempo (distributed tracing) |
| 🔮 Phase 3 | Exceed free quota (50GB logs/mo) | Upgrade to paid tier OR self-host Loki |
| 🔮 Phase 4 | Multi-team collaboration | On-call rotation, PagerDuty integration |
Documentation
- observability-design.md - Architecture decisions & interview talking points
- observability-guide.md - Detailed config & query examples
Reactions are currently unavailable
Sub-issues
Metadata
Metadata
Assignees
Labels
epictasks under certain domainstasks under certain domains