-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Description
Summary
Add alert state analytics capabilities to Alertmanager to track state transitions of alerts over time. This will provide visibility into how alerts move between unprocessed, active, and suppressed states, including tracking which alerts are inhibited and by which other alerts.
Table of Contents
GH issues don't support ToC 😔
Motivation
The primary motivation for alert state analytics comes from the need to validate proposed enhancements to Alertmanager's clustering behavior, specifically #4315 which proposes making inhibitions part of the gossip protocol.
Currently, when investigating issues like:
- Failed inhibitions during instance restarts (#4064)
- Ready endpoint reporting ready before gossip settles (#3026)
- Duplicate alert notifications in clustered deployments
...we lack the data to:
- Quantify the impact - How often do inhibition failures occur in production?
- Validate solutions - Would making inhibitions part of gossip actually solve the problem?
- Measure improvements - Can we prove that a change reduced the frequency of issues?
- Debug production issues - What state transitions led to unexpected behavior?
Without analytics, we're making architectural decisions based on theory rather than data.
Real-World Impact
At Cloudflare, we use inhibitions heavily in our alerting infrastructure. We've had numerous cases of users reporting that they are getting alerted while the alert should have been inhibited. Without analytical data, it's extremely difficult to:
- Determine if this was actually an inhibition failure or a misconfiguration
- Identify patterns in when inhibition failures occur
- Correlate failures with specific cluster events or topology changes
- Provide evidence-based answers to users about what happened
Having this analytical data would allow us to accurately debug whether inhibition failures are occurring and why, distinguishing between misconfigurations and actual bugs in the system.
Goals
-
Track all alert state transitions including:
unprocessed→activeactive→suppressed(by silence or inhibition)suppressed→active- Any state →
resolved(when alert's EndsAt timestamp is in the past) resolved→deleted(when garbage collector removes it andmarker.Delete()is called)- State changes during cluster topology changes
-
Capture suppression relationships:
- Which alerts were suppressed (by silence or inhibition)
- What caused the suppression (silence ID or inhibiting alert fingerprint)
- When the suppression was established and released
-
Provide an interface to expose the data:
- For database integration: REST API endpoints for querying state history
- For event-based systems: Publish events to external message bus/queue (e.g., Kafka, Redis)
- Enable retrieval of state history for specific alerts and time ranges
-
Minimize performance impact:
- Asynchronous writes to not block alert processing
- Efficient storage to handle high-cardinality alert environments
- Optional feature (can be disabled if not needed)
Non-Goals
- Real-time alerting or dashboarding (analytics is for post-hoc analysis)
- Long-term storage (retention should be configurable and limited)
- Complex query DSL (simple API endpoints are sufficient)
- Replication across cluster members - each instance operates independently; consumers are responsible for merging/aggregating data from multiple instances if needed
Proposed Solutions
Option 1: Direct Database Integration with State-Aware Marker
Architecture:
- Wrap the existing
MemMarkerwith aStateAwareMarkerthat records state changes to a database - Use an embedded analytical database (e.g., DuckDB or SQLite)
- Employ high-performance bulk insert APIs for minimal overhead
- Add REST API endpoints to query the analytics data
Key Components:
1. State-Aware Marker
// StateAppender records alert state changes to storage
type StateAppender interface {
Append(fingerprint model.Fingerprint, state AlertState)
AppendSuppressed(fingerprint model.Fingerprint, state AlertState, suppressedBy []string)
Flush() error
Close() error
}
// StateAwareMarker decorates the existing marker with state tracking
type StateAwareMarker interface {
AlertMarker
GroupMarker
Flush() error
}- Decorates the existing
MemMarkerimplementation - Intercepts calls to
SetActiveOrSilenced(),SetInhibited(), andDelete() - Appends state changes to the database asynchronously via
StateAppender - Maintains backward compatibility with existing code
- Tracks both when alerts become resolved (EndsAt in past) and when they are deleted from the marker (GC cleanup)
2. Analytics Subscriber
// Writer defines the interface for writing alert data to storage
type Writer interface {
InsertAlert(ctx context.Context, alert *Alert) error
}
// Subscriber subscribes to alert updates and persists them
type Subscriber interface {
Run(ctx context.Context)
}- Subscribes to the alert provider's alert stream
- Writes alert metadata (labels, annotations) to the database
- Runs in a separate goroutine to avoid blocking alert processing
3. Database Storage
Schema:
-- Alerts table
CREATE TABLE alerts (
id UUID PRIMARY KEY,
fingerprint VARCHAR NOT NULL UNIQUE,
alertname VARCHAR NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- State changes table
CREATE TABLE alert_states (
id UUID PRIMARY KEY,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
alert_fingerprint VARCHAR NOT NULL,
state VARCHAR NOT NULL, -- 'unprocessed', 'active', 'suppressed', 'resolved', 'deleted'
suppressed_by VARCHAR, -- Fingerprint of inhibiting alert or silence ID (only for suppressed state)
suppressed_reason VARCHAR, -- 'silence' or 'inhibition' (only for suppressed state)
FOREIGN KEY (alert_fingerprint) REFERENCES alerts(fingerprint)
);
-- Labels and annotations (normalized)
CREATE TABLE labels (...);
CREATE TABLE annotations (...);- Uses deterministic UUIDs (UUIDv5) to avoid duplicate inserts
- Maintains in-memory maps to skip already-seen fingerprints
- Transactions ensure consistency
4. Storage Interface
// Database defines the interface for analytics storage
type Database interface {
Reader
Writer
}
type Reader interface {
GetAlertStatesByFingerprint(ctx context.Context, fingerprint model.Fingerprint) ([]*AlertState, error)
GetAllAlertsAndTheirStates(ctx context.Context) ([]*Alert, error)
}5. REST API Endpoints
New endpoints:
GET /api/v2/alerts/states- Get all alerts with their recent statesGET /api/v2/alerts/{fingerprint}/states- Get state history for a specific alert
Advantages:
- Minimal code changes to core alert processing logic
- High performance (bulk insert APIs can handle millions of rows/sec)
- Embedded database (no external dependencies)
- SQL queries for flexible analysis
- Relatively straightforward implementation
Disadvantages:
- Tight coupling between marker and database
- Requires embedded database dependency
- Database file management (rotation, cleanup)
- Potential for write amplification in high-cardinality environments
Performance Considerations:
- Bulk insert APIs provide extremely fast writes
- In-memory maps reduce duplicate writes by ~90%
- Async writes don't block alert processing
- Configurable retention (default: 7 days recommended)
Option 2: Event-Based Architecture
Architecture:
- Introduce an event system for alert lifecycle events
- Emit events for state changes without modifying the marker
- Publish events to external message bus/queue systems (e.g., Kafka, Redis, RabbitMQ)
- No built-in storage or REST API - consumers handle data persistence and querying
Key Components:
1. Event System
type AlertEventMetadata struct {
Alertname string
Labels model.LabelSet
Annotations model.LabelSet
SuppressedBy []string // Silence IDs or inhibiting alert fingerprints
SuppressedReason string // 'silence' or 'inhibition'
}
type AlertEvent struct {
Timestamp time.Time
Fingerprint model.Fingerprint
EventType EventType // StateChanged, Suppressed, Unsuppressed, Resolved, Deleted
OldState AlertState
NewState AlertState
Metadata AlertEventMetadata
}
type EventHandler interface {
HandleEvent(ctx context.Context, event AlertEvent) error
}
type EventBus interface {
Subscribe(handler EventHandler)
Publish(ctx context.Context, event AlertEvent) error
}2. Event Emission Points
- Modify
MemMarker.SetActiveOrSilenced()to emit events (for active/silenced transitions) - Modify
MemMarker.SetInhibited()to emit events (for inhibition transitions) - Modify
MemMarker.Delete()to emit events (for alert deletion) - Hook into alert resolution detection (when EndsAt timestamp passes)
- Emit events with full context including suppression details in metadata
- Events include timestamps for ordering; consumers can use timestamp or UUIDv7 to handle out-of-order delivery
3. Event Publisher
type EventPublisher interface {
EventHandler
}
// Implementation would publish events to external message bus (Kafka, Redis, etc.)
// Examples: KafkaPublisher, RedisPublisher, RabbitMQPublisherAdvantages:
- Loose coupling - analytics doesn't affect core logic
- Extensible - easy to add new event handlers
- Could be used for other features (webhooks, audit logs)
- Easier to disable or configure
- Offloads storage and querying to external systems
- Can integrate with existing event processing infrastructure
Disadvantages:
- More invasive changes to
MemMarker - Event bus adds complexity
- Potential for event loss if handlers are slow
- Need to implement event buffering/retries
- Requires external infrastructure (message bus)
- No built-in querying capability - consumers must implement their own storage/queries
- More operational overhead
Configuration
Option 1: Database Integration Configuration
# alertmanager.yml
analytics:
enabled: true
type: database
storage:
path: /data/analytics.db
retention: 168h # 7 days
# Optional: limit database size
max_size_mb: 1024
# Optional: sample rate (1.0 = 100%, 0.1 = 10%)
sample_rate: 1.0Command-line flags:
--analytics.enabled
--analytics.type=database
--analytics.storage.path=/data/analytics.db
--analytics.retention=168hOption 2: Event Publisher Configuration
# alertmanager.yml
analytics:
enabled: true
type: event_publisher
publisher:
type: kafka # or redis, rabbitmq
brokers:
- kafka1:9092
- kafka2:9092
topic: alertmanager-state-events
# Optional: sample rate
sample_rate: 1.0Command-line flags:
--analytics.enabled
--analytics.type=event_publisher
--analytics.publisher.type=kafka
--analytics.publisher.brokers=kafka1:9092,kafka2:9092
--analytics.publisher.topic=alertmanager-state-eventsAPI Examples (Option 1 Only)
Get all alerts with recent state changes
GET /api/v2/alerts/states
Response:
[
{
"fingerprint": "abc123",
"alertname": "HighCPU",
"labels": {...},
"annotations": {...},
"states": [
{
"id": "uuid",
"timestamp": "2025-11-13T14:30:00Z",
"state": "active"
},
{
"id": "uuid",
"timestamp": "2025-11-13T14:35:00Z",
"state": "suppressed",
"suppressed_by": "def456",
"suppressed_reason": "inhibited"
}
]
}
]Get state history for a specific alert
GET /api/v2/alerts/{fingerprint}/states
Response:
{
"fingerprint": "abc123",
"states": [
{
"id": "uuid",
"timestamp": "2025-11-13T14:00:00Z",
"state": "active"
},
{
"id": "uuid",
"timestamp": "2025-11-13T14:30:00Z",
"state": "suppressed",
"suppressed_by": "def456",
"suppressed_reason": "inhibited"
},
{
"id": "uuid",
"timestamp": "2025-11-13T14:45:00Z",
"state": "active"
}
]
}Event Examples (Option 2 Only)
Alert State Change Event
{
"timestamp": "2025-11-13T14:30:00Z",
"fingerprint": "abc123",
"event_type": "state_changed",
"old_state": "active",
"new_state": "suppressed",
"metadata": {
"alertname": "HighCPU",
"labels": {...},
"suppressed_by": "def456",
"suppressed_reason": "inhibition"
}
}Alert Deletion Event
{
"timestamp": "2025-11-13T15:00:00Z",
"fingerprint": "abc123",
"event_type": "deleted",
"old_state": "resolved",
"new_state": "deleted",
"metadata": {
"alertname": "HighCPU",
"labels": {...}
}
}Open Questions
-
Retention (Option 1 only): What's the right default retention period?
- Proposal: 7 days (168 hours)
- Rationale: Sufficient for post-mortem analysis, limited disk usage
- Configurable for different use cases
-
Schema Evolution: How do we handle schema changes?
- Option 1: Version the schema in the database, provide migration path
- Option 2: Version events, consumers handle different event versions
- Consider forward/backward compatibility in both cases
References
Metadata
Metadata
Assignees
Type
Projects
Status