Proposal: Alert State Analytics for Alertmanager


## Summary

Add alert state analytics capabilities to Alertmanager to track state transitions of alerts over time. This will provide visibility into how alerts move between `unprocessed`, `active`, and `suppressed` states, including tracking which alerts are inhibited and by which other alerts.

## Table of Contents

GH issues don't support ToC 😔

## Motivation

The primary motivation for alert state analytics comes from the need to validate proposed enhancements to Alertmanager's clustering behavior, specifically [#4315](https://github.com/prometheus/alertmanager/issues/4315) which proposes making inhibitions part of the gossip protocol.

Currently, when investigating issues like:
- Failed inhibitions during instance restarts ([#4064](https://github.com/prometheus/alertmanager/issues/4064))
- Ready endpoint reporting ready before gossip settles ([#3026](https://github.com/prometheus/alertmanager/issues/3026))
- Duplicate alert notifications in clustered deployments

...we lack the data to:
1. **Quantify the impact** - How often do inhibition failures occur in production?
2. **Validate solutions** - Would making inhibitions part of gossip actually solve the problem?
3. **Measure improvements** - Can we prove that a change reduced the frequency of issues?
4. **Debug production issues** - What state transitions led to unexpected behavior?

Without analytics, we're making architectural decisions based on theory rather than data.

### Real-World Impact

At Cloudflare, we use inhibitions heavily in our alerting infrastructure. We've had numerous cases of users reporting that they are getting alerted while the alert should have been inhibited. Without analytical data, it's extremely difficult to:
- Determine if this was actually an inhibition failure or a misconfiguration
- Identify patterns in when inhibition failures occur
- Correlate failures with specific cluster events or topology changes
- Provide evidence-based answers to users about what happened

Having this analytical data would allow us to accurately debug whether inhibition failures are occurring and why, distinguishing between misconfigurations and actual bugs in the system.

## Goals

1. **Track all alert state transitions** including:
   - `unprocessed` → `active`
   - `active` → `suppressed` (by silence or inhibition)
   - `suppressed` → `active`
   - Any state → `resolved` (when alert's EndsAt timestamp is in the past)
   - `resolved` → `deleted` (when garbage collector removes it and `marker.Delete()` is called)
   - State changes during cluster topology changes

2. **Capture suppression relationships**:
   - Which alerts were suppressed (by silence or inhibition)
   - What caused the suppression (silence ID or inhibiting alert fingerprint)
   - When the suppression was established and released

3. **Provide an interface to expose the data**:
   - For database integration: REST API endpoints for querying state history
   - For event-based systems: Publish events to external message bus/queue (e.g., Kafka, Redis)
   - Enable retrieval of state history for specific alerts and time ranges

4. **Minimize performance impact**:
   - Asynchronous writes to not block alert processing
   - Efficient storage to handle high-cardinality alert environments
   - Optional feature (can be disabled if not needed)

## Non-Goals

- Real-time alerting or dashboarding (analytics is for post-hoc analysis)
- Long-term storage (retention should be configurable and limited)
- Complex query DSL (simple API endpoints are sufficient)
- Replication across cluster members - each instance operates independently; consumers are responsible for merging/aggregating data from multiple instances if needed

## Proposed Solutions

### Option 1: Direct Database Integration with State-Aware Marker

**Architecture:**
- Wrap the existing `MemMarker` with a `StateAwareMarker` that records state changes to a database
- Use an embedded analytical database (e.g., DuckDB or SQLite)
- Employ high-performance bulk insert APIs for minimal overhead
- Add REST API endpoints to query the analytics data

**Key Components:**

#### 1. State-Aware Marker
```go
// StateAppender records alert state changes to storage
type StateAppender interface {
    Append(fingerprint model.Fingerprint, state AlertState)
    AppendSuppressed(fingerprint model.Fingerprint, state AlertState, suppressedBy []string)
    Flush() error
    Close() error
}

// StateAwareMarker decorates the existing marker with state tracking
type StateAwareMarker interface {
    AlertMarker
    GroupMarker
    Flush() error
}
```

- Decorates the existing `MemMarker` implementation
- Intercepts calls to `SetActiveOrSilenced()`, `SetInhibited()`, and `Delete()`
- Appends state changes to the database asynchronously via `StateAppender`
- Maintains backward compatibility with existing code
- Tracks both when alerts become resolved (EndsAt in past) and when they are deleted from the marker (GC cleanup)

#### 2. Analytics Subscriber
```go
// Writer defines the interface for writing alert data to storage
type Writer interface {
    InsertAlert(ctx context.Context, alert *Alert) error
}

// Subscriber subscribes to alert updates and persists them
type Subscriber interface {
    Run(ctx context.Context)
}
```

- Subscribes to the alert provider's alert stream
- Writes alert metadata (labels, annotations) to the database
- Runs in a separate goroutine to avoid blocking alert processing

#### 3. Database Storage

**Schema:**
```sql
-- Alerts table
CREATE TABLE alerts (
    id UUID PRIMARY KEY,
    fingerprint VARCHAR NOT NULL UNIQUE,
    alertname VARCHAR NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- State changes table
CREATE TABLE alert_states (
    id UUID PRIMARY KEY,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    alert_fingerprint VARCHAR NOT NULL,
    state VARCHAR NOT NULL,  -- 'unprocessed', 'active', 'suppressed', 'resolved', 'deleted'
    suppressed_by VARCHAR,  -- Fingerprint of inhibiting alert or silence ID (only for suppressed state)
    suppressed_reason VARCHAR,  -- 'silence' or 'inhibition' (only for suppressed state)
    FOREIGN KEY (alert_fingerprint) REFERENCES alerts(fingerprint)
);

-- Labels and annotations (normalized)
CREATE TABLE labels (...);
CREATE TABLE annotations (...);
```

- Uses deterministic UUIDs (UUIDv5) to avoid duplicate inserts
- Maintains in-memory maps to skip already-seen fingerprints
- Transactions ensure consistency

#### 4. Storage Interface
```go
// Database defines the interface for analytics storage
type Database interface {
    Reader
    Writer
}

type Reader interface {
    GetAlertStatesByFingerprint(ctx context.Context, fingerprint model.Fingerprint) ([]*AlertState, error)
    GetAllAlertsAndTheirStates(ctx context.Context) ([]*Alert, error)
}
```

#### 5. REST API Endpoints

New endpoints:
- `GET /api/v2/alerts/states` - Get all alerts with their recent states
- `GET /api/v2/alerts/{fingerprint}/states` - Get state history for a specific alert

**Advantages:**
- Minimal code changes to core alert processing logic
- High performance (bulk insert APIs can handle millions of rows/sec)
- Embedded database (no external dependencies)
- SQL queries for flexible analysis
- Relatively straightforward implementation

**Disadvantages:**
- Tight coupling between marker and database
- Requires embedded database dependency
- Database file management (rotation, cleanup)
- Potential for write amplification in high-cardinality environments

**Performance Considerations:**
- Bulk insert APIs provide extremely fast writes
- In-memory maps reduce duplicate writes by ~90%
- Async writes don't block alert processing
- Configurable retention (default: 7 days recommended)

### Option 2: Event-Based Architecture

**Architecture:**
- Introduce an event system for alert lifecycle events
- Emit events for state changes without modifying the marker
- Publish events to external message bus/queue systems (e.g., Kafka, Redis, RabbitMQ)
- No built-in storage or REST API - consumers handle data persistence and querying

**Key Components:**

#### 1. Event System
```go
type AlertEventMetadata struct {
    Alertname        string
    Labels           model.LabelSet
    Annotations      model.LabelSet
    SuppressedBy     []string  // Silence IDs or inhibiting alert fingerprints
    SuppressedReason string    // 'silence' or 'inhibition'
}

type AlertEvent struct {
    Timestamp   time.Time
    Fingerprint model.Fingerprint
    EventType   EventType  // StateChanged, Suppressed, Unsuppressed, Resolved, Deleted
    OldState    AlertState
    NewState    AlertState
    Metadata    AlertEventMetadata
}

type EventHandler interface {
    HandleEvent(ctx context.Context, event AlertEvent) error
}

type EventBus interface {
    Subscribe(handler EventHandler)
    Publish(ctx context.Context, event AlertEvent) error
}
```

#### 2. Event Emission Points
- Modify `MemMarker.SetActiveOrSilenced()` to emit events (for active/silenced transitions)
- Modify `MemMarker.SetInhibited()` to emit events (for inhibition transitions)
- Modify `MemMarker.Delete()` to emit events (for alert deletion)
- Hook into alert resolution detection (when EndsAt timestamp passes)
- Emit events with full context including suppression details in metadata
- Events include timestamps for ordering; consumers can use timestamp or UUIDv7 to handle out-of-order delivery

#### 3. Event Publisher
```go
type EventPublisher interface {
    EventHandler
}

// Implementation would publish events to external message bus (Kafka, Redis, etc.)
// Examples: KafkaPublisher, RedisPublisher, RabbitMQPublisher
```

**Advantages:**
- Loose coupling - analytics doesn't affect core logic
- Extensible - easy to add new event handlers
- Could be used for other features (webhooks, audit logs)
- Easier to disable or configure
- Offloads storage and querying to external systems
- Can integrate with existing event processing infrastructure

**Disadvantages:**
- More invasive changes to `MemMarker`
- Event bus adds complexity
- Potential for event loss if handlers are slow
- Need to implement event buffering/retries
- Requires external infrastructure (message bus)
- No built-in querying capability - consumers must implement their own storage/queries
- More operational overhead

## Configuration

### Option 1: Database Integration Configuration

```yaml
# alertmanager.yml
analytics:
  enabled: true
  type: database
  storage:
    path: /data/analytics.db
    retention: 168h  # 7 days
  # Optional: limit database size
  max_size_mb: 1024
  # Optional: sample rate (1.0 = 100%, 0.1 = 10%)
  sample_rate: 1.0
```

Command-line flags:
```bash
--analytics.enabled
--analytics.type=database
--analytics.storage.path=/data/analytics.db
--analytics.retention=168h
```

### Option 2: Event Publisher Configuration

```yaml
# alertmanager.yml
analytics:
  enabled: true
  type: event_publisher
  publisher:
    type: kafka  # or redis, rabbitmq
    brokers:
      - kafka1:9092
      - kafka2:9092
    topic: alertmanager-state-events
    # Optional: sample rate
    sample_rate: 1.0
```

Command-line flags:
```bash
--analytics.enabled
--analytics.type=event_publisher
--analytics.publisher.type=kafka
--analytics.publisher.brokers=kafka1:9092,kafka2:9092
--analytics.publisher.topic=alertmanager-state-events
```

## API Examples (Option 1 Only)

### Get all alerts with recent state changes
```bash
GET /api/v2/alerts/states

Response:
[
  {
    "fingerprint": "abc123",
    "alertname": "HighCPU",
    "labels": {...},
    "annotations": {...},
    "states": [
      {
        "id": "uuid",
        "timestamp": "2025-11-13T14:30:00Z",
        "state": "active"
      },
      {
        "id": "uuid",
        "timestamp": "2025-11-13T14:35:00Z",
        "state": "suppressed",
        "suppressed_by": "def456",
        "suppressed_reason": "inhibited"
      }
    ]
  }
]
```

### Get state history for a specific alert
```bash
GET /api/v2/alerts/{fingerprint}/states

Response:
{
  "fingerprint": "abc123",
  "states": [
    {
      "id": "uuid",
      "timestamp": "2025-11-13T14:00:00Z",
      "state": "active"
    },
    {
      "id": "uuid",
      "timestamp": "2025-11-13T14:30:00Z",
      "state": "suppressed",
      "suppressed_by": "def456",
      "suppressed_reason": "inhibited"
    },
    {
      "id": "uuid",
      "timestamp": "2025-11-13T14:45:00Z",
      "state": "active"
    }
  ]
}
```

## Event Examples (Option 2 Only)

### Alert State Change Event
```json
{
  "timestamp": "2025-11-13T14:30:00Z",
  "fingerprint": "abc123",
  "event_type": "state_changed",
  "old_state": "active",
  "new_state": "suppressed",
  "metadata": {
    "alertname": "HighCPU",
    "labels": {...},
    "suppressed_by": "def456",
    "suppressed_reason": "inhibition"
  }
}
```

### Alert Deletion Event
```json
{
  "timestamp": "2025-11-13T15:00:00Z",
  "fingerprint": "abc123",
  "event_type": "deleted",
  "old_state": "resolved",
  "new_state": "deleted",
  "metadata": {
    "alertname": "HighCPU",
    "labels": {...}
  }
}
```

## Open Questions

1. **Retention** (Option 1 only): What's the right default retention period?
   - Proposal: 7 days (168 hours)
   - Rationale: Sufficient for post-mortem analysis, limited disk usage
   - Configurable for different use cases

2. **Schema Evolution**: How do we handle schema changes?
   - Option 1: Version the schema in the database, provide migration path
   - Option 2: Version events, consumers handle different event versions
   - Consider forward/backward compatibility in both cases

## References

- [#4315 - make inhibitions part of the gossip](https://github.com/prometheus/alertmanager/issues/4315)
- [#4064 - Alerts that should be inhibited fire on Alertmanager reload/restart](https://github.com/prometheus/alertmanager/issues/4064)
- [#3026 - /-/ready reports ready before cluster gossip has settled](https://github.com/prometheus/alertmanager/issues/3026)

Proposal: Alert State Analytics for Alertmanager #4732

Description

Summary

Table of Contents

Motivation

Real-World Impact

Goals

Non-Goals

Proposed Solutions

Option 1: Direct Database Integration with State-Aware Marker

1. State-Aware Marker

2. Analytics Subscriber

3. Database Storage

4. Storage Interface

5. REST API Endpoints

Option 2: Event-Based Architecture

1. Event System

2. Event Emission Points

3. Event Publisher

Configuration

Option 1: Database Integration Configuration

Option 2: Event Publisher Configuration

API Examples (Option 1 Only)

Get all alerts with recent state changes

Get state history for a specific alert

Event Examples (Option 2 Only)

Alert State Change Event

Alert Deletion Event

Open Questions

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions