Instrument with OTel tracing #3673

hairyhenderson · 2024-01-16T22:20:18Z

I've pulled in tracing/tracing.go from prometheus/prometheus to kickstart this, and to provide a familiar (to Prometheus operators) configuration mechanism.

It's possible that it'd be worth pushing some of this up into prometheus/common, but I'll defer on that one.

So far, all incoming requests are instrumented, and notifications are instrumented, with webhooks and e-mails getting slightly more detailed instrumentation, both with downward trace propagation.

The decoupled nature of how alerts are handled within Alertmanager means that there's a bunch of disjointed spans, but I've attempted to rectify some of that by using span links.

Signed-off-by: Dave Henderson <dhenderson@gmail.com>

hairyhenderson · 2024-01-16T22:32:40Z

unclear why CI's failing - I wonder if that test's unrelatedly flaky?

els0r · 2024-01-19T14:44:10Z

I've been wondering about the support of this for months and it's awesome to see movement on this front. Much appreciated!

hairyhenderson · 2024-01-19T15:52:58Z

@els0r you're welcome! Let me know if there are any particular things you want traced - this is really just a start.

tracing/tracing.go

grobinson-grafana

I took a quick look at the code and left some comments. @hairyhenderson, can you show me how to test this next week so I can better understand how it all works?

cmd/alertmanager/main.go

grobinson-grafana · 2024-02-02T19:52:41Z

config/config.go

 	MuteTimeIntervals []MuteTimeInterval `yaml:"mute_time_intervals,omitempty" json:"mute_time_intervals,omitempty"`
 	TimeIntervals     []TimeInterval     `yaml:"time_intervals,omitempty" json:"time_intervals,omitempty"`

+	TracingConfig TracingConfig `yaml:"tracing,omitempty" json:"tracing,omitempty"`


These might be better put in GlobalConfig?

That was my first instinct actually, but in prometheus/prometheus it's not under global. Should we prefer to keep things under global, or follow the pattern established in prometheus/prometheus?

Hmm. It looks like the Prometheus configuration file is structured a little different from Alertmanager as the Prometheus configuration file also has options to change the behavior of the Prometheus server, rather than just contain data like rules. However, Alertmanager configuration just contains data, and all behavioral configuration is instead done via command line flags. @gotjosh what should we do here?

I read on prometheus-users that if the configuration is expected to be reloadable at runtime it should go in the configuration file, otherwise it should go in command line args.

https://groups.google.com/g/prometheus-users/c/qYKpBiuN8y4/m/9z062yU4CQAJ

config/config.go

grobinson-grafana · 2024-02-02T19:54:41Z

config/config.go

+
+func validateHeadersForTracing(headers map[string]string) error {
+	for header := range headers {
+		if strings.ToLower(header) == "authorization" {


I think we need to remove whitespace? Looking at the RFC 2616:

Field names are case-insensitive. The field value MAY be preceded by any amount of LWS, though a single SP is preferred.

This is also copied from prometheus/prometheus: https://github.com/prometheus/prometheus/blob/main/config/config.go#L1083

The second quoted sentence refers to the field value, not the name. It's not permissible to have whitespace in names at all (they're just tokens - i.e. one or more non-control/non-separator character). So even if someone had a configuration with a header that included whitespace, it would likely be ignored or discarded later on.

Ah – I misread! 🙂

dispatch/dispatch.go

grobinson-grafana · 2024-02-02T20:00:16Z

dispatch/dispatch.go

+			func() {
+				traceCtx, span := tracer.Start(d.ctx, "dispatch.Dispatcher.dispatch",
+					trace.WithAttributes(
+						attribute.String("alert.name", alert.Name()),


Compared to the fingerprint, alert.Name() is not used much in Alertmanager. We might be able to omit it here as it's a lot of data, and I think most operators have been fine using the fingerprint in the past.

We might be able to omit it here as it's a lot of data, and I think most operators have been fine using the fingerprint in the past.

isn't it what gets sent as the alertname label? surely that's a lot more human-readable than the fingerprint?

You could have thousands of alerts all with the same alertname. It's just a label, it doesn't have a special meaning. That's where fingerprint is useful, as it is a unique identifier for an alert.

From an operational perspective I think the alert name is still quite valuable... Here's a sample from my test setup:

alert.fingerprint | "2a7a97fbf890646a" alert.name | "AlwaysFiring"

I wouldn't be able to relate this trace back to my alert configuration without seeing AlwaysFiring...

I think as long as we have both then 👍

grobinson-grafana · 2024-02-02T20:02:51Z

dispatch/dispatch.go

+					trace.WithAttributes(
+						attribute.String("alert.name", alert.Name()),
+						attribute.String("alert.fingerprint", alert.Fingerprint().String()),
+						attribute.String("alert.status", string(alert.Status())),


I think it should be mentioned that the status of an alert can change between the dispatcher and when the aggregation group is flushed. If that will affect continuation of the trace, we should remove it.

If that will affect continuation of the trace, we should remove it.

Not at all - it's just metadata. But if it can change maybe that's misleading, and if we set it at all we should maybe set it at the end when we have the final status 🤔

It can change because the Status is calculated using time.Now().

tracing/tracing.go

Signed-off-by: Dave Henderson <dhenderson@gmail.com>

hairyhenderson · 2024-02-08T22:21:51Z

@grobinson-grafana

@hairyhenderson, can you show me how to test this next week so I can better understand how it all works?

sorry for the delay... laptop refresh caused me to lose my test configs 🤦‍♂️

I tested this by setting up a local Grafana, Tempo, and Agent (configured to receive OTLP).

Then I fired up the webhook echo service (go run ./examples/webhook/echo.go - I actually modified that code first to print headers too so I could see the trace headers)

I then ran alertmanager like this:

$ ./alertmanager --web.listen-address=127.0.0.1:9093 --log.level=debug --config.file=am-trace-test.yml

Here's am-trace-test.yml:

global:
  smtp_smarthost: 'localhost:2525'
  smtp_from: 'alertmanager@example.org'
route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  receiver: 'web.hook'
  routes:
    - matchers:
        - severity="page"
      receiver: web.hook
    - matchers:
        - severity="email"
      receiver: smtp
receivers:
  - name: 'web.hook'
    webhook_configs:
      - url: 'http://127.0.0.1:5001/'
  - name: 'smtp'
    email_configs:
      - to: 'nobody@nowhere.org'
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

tracing:
  client_type: "grpc"
  endpoint: 'localhost:4317'
  sampling_fraction: 1.0
  insecure: true

Then I ran prometheus like this:

$ ./prometheus --config.file=prom-alert-tracing-test.yml --web.listen-address=127.0.0.1:9990

Here's prom-alert-tracing-test.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - 127.0.0.1:9093

rule_files:
  - prom-alert-tracing-rules.yml

scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9990"]

tracing:
  client_type: "grpc"
  endpoint: 'localhost:4317'
  sampling_fraction: 1.0
  insecure: true

And here's the referenced prom-alert-tracing-rules.yml:

groups:
  - name: alerts
    rules:
      - alert: AlwaysFiring
        expr: '1'
        labels:
          severity: page
      - alert: AlsoAlwaysFiring
        expr: '1'
        labels:
          severity: email

For the smtp receiver, I also set up an instance of smtprelay and a simple SMTP echo service. I've lost the configs for those, and can whip something up if you need...

Let me know if this makes sense, otherwise let's set up some time tomorrow or next week to pair on this 😉

Signed-off-by: Dave Henderson <dhenderson@gmail.com>

TheMeier · 2024-03-10T12:22:26Z

IIRC this

--- FAIL: TestResolved (5.38s)
    acceptance.go:182: failed to start alertmanager cluster: unable to get a successful response from the Alertmanager: Get "http://127.0.0.1:40783/api/v2/status": dial tcp 127.0.0.1:40783: connect: connection refused

is a flaky test, re-running shoud fix it

grobinson-grafana · 2024-03-26T14:22:01Z

cmd/alertmanager/main.go

 		go disp.Run()
 		go inhibitor.Run()

+		err = tracingManager.ApplyConfig(conf)


I'm just documenting this here as I'm not so familiar with this code, this is not feedback or a request for change. This code here is run when the Coordinator, which is responsible for reloading configurations at runtime, calls the event handlers registered via the Subscribe method. This specific function is responsible for reading the configuration and setting up the receivers, routes, inhibition rules, time intervals, inhibitor, silences, etc.

grobinson-grafana · 2024-03-26T14:22:19Z

cmd/alertmanager/main.go

 		go disp.Run()
 		go inhibitor.Run()

+		err = tracingManager.ApplyConfig(conf)


Suggested change

err = tracingManager.ApplyConfig(conf)

if err = tracingManager.ApplyConfig(conf); err != nil {

grobinson-grafana · 2024-03-26T14:26:53Z

cmd/alertmanager/main.go

+			return fmt.Errorf("failed to apply tracing config: %w", err)
+		}
+
+		go tracingManager.Run()


This has a goroutine leak because when we reload the configuration, we don't stop the current tracing manager. Instead, we start to accumulate goroutines blocked on <-m.done. I was able to see this in /debug/pprof.

OK so the issue is that when the configuration doesn't change, ApplyConfig returns nil, but we still start another goroutine for Run.

grobinson-grafana · 2024-03-26T14:27:21Z

cmd/alertmanager/main.go

 		case <-term:
 			level.Info(logger).Log("msg", "Received SIGTERM, exiting gracefully...")
+
+			// shut down the tracing manager to flush any remaining spans.


I think we also need to shut down on reload?

grobinson-grafana · 2024-03-26T14:32:17Z

config/config.go

 	MuteTimeIntervals []MuteTimeInterval `yaml:"mute_time_intervals,omitempty" json:"mute_time_intervals,omitempty"`
 	TimeIntervals     []TimeInterval     `yaml:"time_intervals,omitempty" json:"time_intervals,omitempty"`

+	TracingConfig TracingConfig `yaml:"tracing,omitempty" json:"tracing,omitempty"`


Hmm. It looks like the Prometheus configuration file is structured a little different from Alertmanager as the Prometheus configuration file also has options to change the behavior of the Prometheus server, rather than just contain data like rules. However, Alertmanager configuration just contains data, and all behavioral configuration is instead done via command line flags. @gotjosh what should we do here?

grobinson-grafana · 2024-03-26T14:32:38Z

config/config.go

 	return json.Marshal(result)
 }
+
+// TODO: probably move these into prometheus/common since they're copied from


Seems reasonable to me. Can we do this?

grobinson-grafana · 2024-03-26T14:33:38Z

config/config.go

+
+func validateHeadersForTracing(headers map[string]string) error {
+	for header := range headers {
+		if strings.ToLower(header) == "authorization" {


Ah – I misread! 🙂

grobinson-grafana · 2024-03-26T15:06:01Z

notify/email/email.go

-func (n *Email) Notify(ctx context.Context, as ...*types.Alert) (bool, error) {
+func (n *Email) Notify(ctx context.Context, as ...*types.Alert) (ok bool, err error) {
+	ctx, span := tracer.Start(ctx, "email.Email.Notify", trace.WithAttributes(
+		attribute.Int("alerts", len(as)),


Let's move this up attribute up into dispatch.go instead?

it's already here actually: https://github.com/prometheus/alertmanager/pull/3673/files#diff-71b3813922f4c2abc260ff5be5e31cf64eeb31eb3d7272f19f4fa889fc3b27d5R93 so it could just be dropped...

grobinson-grafana · 2024-03-26T15:09:23Z

tracing/http.go

+	"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
+)
+
+// TODO: maybe move these into prometheus/common?


This looks like good common code to me. Does prometheus/prometheus use this too?

grobinson-grafana · 2024-03-26T15:15:01Z

dispatch/dispatch.go

 			}

-			level.Debug(d.logger).Log("msg", "Received alert", "alert", alert)
+			// this block is wrapped in a function to make sure that the span


Let's extract all this logic into a method like dispatch or something? This is what I was thinking:

for { select { case alert, ok := <-it.Next(): if !ok { // Iterator exhausted for some reason. if err := it.Err(); err != nil { level.Error(d.logger).Log("msg", "Error on alert update", "err", err) } return } d.dispatch(alert)

grobinson-grafana · 2024-03-26T15:17:05Z

dispatch/dispatch.go

+			func() {
+				traceCtx, span := tracer.Start(d.ctx, "dispatch.Dispatcher.dispatch",
+					trace.WithAttributes(
+						attribute.String("alert.name", alert.Name()),


I think as long as we have both then 👍

grobinson-grafana · 2024-03-26T15:18:00Z

dispatch/dispatch.go

+					trace.WithAttributes(
+						attribute.String("alert.name", alert.Name()),
+						attribute.String("alert.fingerprint", alert.Fingerprint().String()),
+						attribute.String("alert.status", string(alert.Status())),


It can change because the Status is calculated using time.Now().

grobinson-grafana · 2024-03-26T15:21:26Z

dispatch/dispatch.go

+				)
+				defer span.End()
+
+				// make a link to this span - we can't make the processAlert


Do we need to be aware of constraints using links here? For example, the aggregation group run method can run for minutes/hours/days/weeks/years. What will happen if the linked trace is older than the retention period?

grobinson-grafana · 2024-03-26T15:27:29Z

tracing/tracing.go

+}
+
+// Stop gracefully shuts down the tracer provider and stops the tracing manager.
+func (m *Manager) Stop() {


There is a race condition between ApplyConfig and Stop around m.shutdownFunc as one goroutine can be updating the variable while another goroutine can be calling the variable. I think we either need a mutex or some docs to mention that these methods are not thread-safe.

stevesg

I've been testing this change (via vendoring into Mimir) and it looks good to me, definitely something we'd be interested in having. I grabbed a trace of a webhook (an initial connection):

stevesg · 2024-09-24T11:59:10Z

notify/webhook/webhook.go

+		attribute.Int("alerts", len(alerts)),
+	))
+	defer span.End()
+


Further down here (after ExtractGroupKey()), it would be nice to add the group key to the span. Not sure if the number of truncated alerts is valuable, but you never know.

span.SetAttributes(attribute.String("group_key", groupKey)) span.SetAttributes(attribute.Int("alerts_truncated", numTruncated))

jmichalek132 · 2024-09-24T16:13:52Z

I've been testing this change (via vendoring into Mimir) and it looks good to me, definitely something we'd be interested in having. I grabbed a trace of a webhook (an initial connection):

Might be worth considering putting things such as http.dns, http.connect and http.tls as spans events instead of individual spans.

siavashs · 2025-10-31T15:58:47Z

@hairyhenderson can you revive this PR and address all the comments?
If not please let me know so we can look into it.

siavashs · 2025-11-14T13:25:28Z

I assigned this to myself, will rebase and build on top of this in a new PR.

Add tracing support using otel: - api: extract trace and span IDs from request context and add to alerts - types: add trace and span IDs to alerts - dispatch: add distributed tracing support - notify: add distributed tracing support This change borrows part of the implementation from prometheus#3673 Fixes prometheus#3670 Signed-off-by: Dave Henderson <dhenderson@gmail.com> Signed-off-by: Siavash Safi <siavash@cloudflare.com>

siavashs · 2025-11-15T12:45:52Z

Closing this in favour of #4745
I tried to address all your suggestions there.
Credits given to @hairyhenderson for their contribution.
Thank you

Add tracing support using otel: - api: extract trace and span IDs from request context and add to alerts - types: add trace and span IDs to alerts - dispatch: add distributed tracing support - notify: add distributed tracing support This change borrows part of the implementation from prometheus#3673 Fixes prometheus#3670 Signed-off-by: Dave Henderson <dhenderson@gmail.com> Signed-off-by: Siavash Safi <siavash@cloudflare.com>

Add tracing support using otel to the the following components: - api: extract trace and span IDs from request context - provider: mem put - dispatch: split logic and use better naming - inhibit: source and target traces, mutes, etc. - silence: query and mutes - notify: add distributed tracing support to stages and all http requests This change borrows part of the implementation from prometheus#3673 Fixes prometheus#3670 Signed-off-by: Dave Henderson <dhenderson@gmail.com> Signed-off-by: Siavash Safi <siavash@cloudflare.com>

Add tracing support using otel to the the following components: - api: extract trace and span IDs from request context - provider: mem put - dispatch: split logic and use better naming - inhibit: source and target traces, mutes, etc. - silence: query, expire, mutes - notify: add distributed tracing support to stages and all http requests This change borrows part of the implementation from prometheus#3673 Fixes prometheus#3670 Signed-off-by: Dave Henderson <dhenderson@gmail.com> Signed-off-by: Siavash Safi <siavash@cloudflare.com>

Add tracing support using otel to the the following components: - api: extract trace and span IDs from request context - provider: mem put - dispatch: split logic and use better naming - inhibit: source and target traces, mutes, etc. drop metrics - silence: query, expire, mutes - notify: add distributed tracing support to stages and all http requests Note: inhibitor metrics are dropped since we have tracing now and they are not needed. We have not released any version with these metrics so we can drop them safely, this is not a breaking change. This change borrows part of the implementation from prometheus#3673 Fixes prometheus#3670 Signed-off-by: Dave Henderson <dhenderson@gmail.com> Signed-off-by: Siavash Safi <siavash@cloudflare.com>

hairyhenderson force-pushed the otel-tracing-support branch 2 times, most recently from 1de5573 to 20b0565 Compare January 16, 2024 22:21

Instrument with OTel tracing

b106234

Signed-off-by: Dave Henderson <dhenderson@gmail.com>

hairyhenderson force-pushed the otel-tracing-support branch from 20b0565 to b106234 Compare January 16, 2024 22:23

qinxx108 reviewed Jan 25, 2024

View reviewed changes

tracing/tracing.go Show resolved Hide resolved

grobinson-grafana reviewed Feb 2, 2024

View reviewed changes

Stop the tracing manager when SIGTERM is handled

3c5cf87

Signed-off-by: Dave Henderson <dhenderson@gmail.com>

hairyhenderson added 2 commits February 27, 2024 13:46

rename span

3ee1fbb

Signed-off-by: Dave Henderson <dhenderson@gmail.com>

update echo example webhook to also print request headers

beb6b52

Signed-off-by: Dave Henderson <dhenderson@gmail.com>

grobinson-grafana reviewed Mar 26, 2024

View reviewed changes

stevesg added a commit to grafana/prometheus-alertmanager that referenced this pull request Sep 24, 2024

WIP: Testing tracing changes from upstream prometheus#3673

21664db

stevesg reviewed Sep 24, 2024

View reviewed changes

stevesg mentioned this pull request Sep 24, 2024

Add tracing to webhook receiver. grafana/prometheus-alertmanager#91

Merged

eyazici90 mentioned this pull request Jan 16, 2025

Feature: Instrument Alertmanager for distributed tracing #3670

Open

grobinson-grafana mentioned this pull request Apr 4, 2025

make inhibitions part of the gossip #4315

Open

grobinson-grafana mentioned this pull request Nov 13, 2025

Proposal: Alert State Analytics for Alertmanager #4732

Open

siavashs self-assigned this Nov 14, 2025

siavashs mentioned this pull request Nov 15, 2025

feat: add distributed tracing support #4745

Open

11 tasks

siavashs closed this Nov 15, 2025

	err = tracingManager.ApplyConfig(conf)
	if err = tracingManager.ApplyConfig(conf); err != nil {

Instrument with OTel tracing #3673

Instrument with OTel tracing #3673

Uh oh!

Conversation

hairyhenderson commented Jan 16, 2024

Uh oh!

hairyhenderson commented Jan 16, 2024

Uh oh!

els0r commented Jan 19, 2024

Uh oh!

hairyhenderson commented Jan 19, 2024

Uh oh!

Uh oh!

grobinson-grafana left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hairyhenderson commented Feb 8, 2024

Uh oh!

TheMeier commented Mar 10, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

siavashs commented Nov 15, 2025 •

edited

Loading