Skip to content

Conversation

@hairyhenderson
Copy link

Fixes #3670

I've pulled in tracing/tracing.go from prometheus/prometheus to kickstart this, and to provide a familiar (to Prometheus operators) configuration mechanism.

It's possible that it'd be worth pushing some of this up into prometheus/common, but I'll defer on that one.

So far, all incoming requests are instrumented, and notifications are instrumented, with webhooks and e-mails getting slightly more detailed instrumentation, both with downward trace propagation.

The decoupled nature of how alerts are handled within Alertmanager means that there's a bunch of disjointed spans, but I've attempted to rectify some of that by using span links.

@hairyhenderson hairyhenderson force-pushed the otel-tracing-support branch 2 times, most recently from 1de5573 to 20b0565 Compare January 16, 2024 22:21
Signed-off-by: Dave Henderson <dhenderson@gmail.com>
@hairyhenderson
Copy link
Author

unclear why CI's failing - I wonder if that test's unrelatedly flaky?

@els0r
Copy link

els0r commented Jan 19, 2024

I've been wondering about the support of this for months and it's awesome to see movement on this front. Much appreciated!

@hairyhenderson
Copy link
Author

@els0r you're welcome! Let me know if there are any particular things you want traced - this is really just a start.

Copy link
Collaborator

@grobinson-grafana grobinson-grafana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a quick look at the code and left some comments. @hairyhenderson, can you show me how to test this next week so I can better understand how it all works?

MuteTimeIntervals []MuteTimeInterval `yaml:"mute_time_intervals,omitempty" json:"mute_time_intervals,omitempty"`
TimeIntervals []TimeInterval `yaml:"time_intervals,omitempty" json:"time_intervals,omitempty"`

TracingConfig TracingConfig `yaml:"tracing,omitempty" json:"tracing,omitempty"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These might be better put in GlobalConfig?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was my first instinct actually, but in prometheus/prometheus it's not under global. Should we prefer to keep things under global, or follow the pattern established in prometheus/prometheus?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. It looks like the Prometheus configuration file is structured a little different from Alertmanager as the Prometheus configuration file also has options to change the behavior of the Prometheus server, rather than just contain data like rules. However, Alertmanager configuration just contains data, and all behavioral configuration is instead done via command line flags. @gotjosh what should we do here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read on prometheus-users that if the configuration is expected to be reloadable at runtime it should go in the configuration file, otherwise it should go in command line args.

https://groups.google.com/g/prometheus-users/c/qYKpBiuN8y4/m/9z062yU4CQAJ


func validateHeadersForTracing(headers map[string]string) error {
for header := range headers {
if strings.ToLower(header) == "authorization" {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to remove whitespace? Looking at the RFC 2616:

Field names are case-insensitive. The field value MAY be preceded by any amount of LWS, though a single SP is preferred.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also copied from prometheus/prometheus: https://github.com/prometheus/prometheus/blob/main/config/config.go#L1083

The second quoted sentence refers to the field value, not the name. It's not permissible to have whitespace in names at all (they're just tokens - i.e. one or more non-control/non-separator character). So even if someone had a configuration with a header that included whitespace, it would likely be ignored or discarded later on.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah – I misread! 🙂

func() {
traceCtx, span := tracer.Start(d.ctx, "dispatch.Dispatcher.dispatch",
trace.WithAttributes(
attribute.String("alert.name", alert.Name()),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compared to the fingerprint, alert.Name() is not used much in Alertmanager. We might be able to omit it here as it's a lot of data, and I think most operators have been fine using the fingerprint in the past.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might be able to omit it here as it's a lot of data, and I think most operators have been fine using the fingerprint in the past.

isn't it what gets sent as the alertname label? surely that's a lot more human-readable than the fingerprint?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could have thousands of alerts all with the same alertname. It's just a label, it doesn't have a special meaning. That's where fingerprint is useful, as it is a unique identifier for an alert.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From an operational perspective I think the alert name is still quite valuable... Here's a sample from my test setup:

alert.fingerprint | "2a7a97fbf890646a"
alert.name | "AlwaysFiring"

I wouldn't be able to relate this trace back to my alert configuration without seeing AlwaysFiring...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think as long as we have both then 👍

trace.WithAttributes(
attribute.String("alert.name", alert.Name()),
attribute.String("alert.fingerprint", alert.Fingerprint().String()),
attribute.String("alert.status", string(alert.Status())),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be mentioned that the status of an alert can change between the dispatcher and when the aggregation group is flushed. If that will affect continuation of the trace, we should remove it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that will affect continuation of the trace, we should remove it.

Not at all - it's just metadata. But if it can change maybe that's misleading, and if we set it at all we should maybe set it at the end when we have the final status 🤔

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can change because the Status is calculated using time.Now().

Signed-off-by: Dave Henderson <dhenderson@gmail.com>
@hairyhenderson
Copy link
Author

@grobinson-grafana

@hairyhenderson, can you show me how to test this next week so I can better understand how it all works?

sorry for the delay... laptop refresh caused me to lose my test configs 🤦‍♂️

I tested this by setting up a local Grafana, Tempo, and Agent (configured to receive OTLP).

Then I fired up the webhook echo service (go run ./examples/webhook/echo.go - I actually modified that code first to print headers too so I could see the trace headers)

I then ran alertmanager like this:

$ ./alertmanager --web.listen-address=127.0.0.1:9093 --log.level=debug --config.file=am-trace-test.yml

Here's am-trace-test.yml:

global:
  smtp_smarthost: 'localhost:2525'
  smtp_from: 'alertmanager@example.org'
route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  receiver: 'web.hook'
  routes:
    - matchers:
        - severity="page"
      receiver: web.hook
    - matchers:
        - severity="email"
      receiver: smtp
receivers:
  - name: 'web.hook'
    webhook_configs:
      - url: 'http://127.0.0.1:5001/'
  - name: 'smtp'
    email_configs:
      - to: 'nobody@nowhere.org'
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

tracing:
  client_type: "grpc"
  endpoint: 'localhost:4317'
  sampling_fraction: 1.0
  insecure: true

Then I ran prometheus like this:

$ ./prometheus --config.file=prom-alert-tracing-test.yml --web.listen-address=127.0.0.1:9990

Here's prom-alert-tracing-test.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - 127.0.0.1:9093

rule_files:
  - prom-alert-tracing-rules.yml

scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9990"]

tracing:
  client_type: "grpc"
  endpoint: 'localhost:4317'
  sampling_fraction: 1.0
  insecure: true

And here's the referenced prom-alert-tracing-rules.yml:

groups:
  - name: alerts
    rules:
      - alert: AlwaysFiring
        expr: '1'
        labels:
          severity: page
      - alert: AlsoAlwaysFiring
        expr: '1'
        labels:
          severity: email

For the smtp receiver, I also set up an instance of smtprelay and a simple SMTP echo service. I've lost the configs for those, and can whip something up if you need...

Let me know if this makes sense, otherwise let's set up some time tomorrow or next week to pair on this 😉

Signed-off-by: Dave Henderson <dhenderson@gmail.com>
Signed-off-by: Dave Henderson <dhenderson@gmail.com>
@TheMeier
Copy link
Contributor

IIRC this

--- FAIL: TestResolved (5.38s)
    acceptance.go:182: failed to start alertmanager cluster: unable to get a successful response from the Alertmanager: Get "http://127.0.0.1:40783/api/v2/status": dial tcp 127.0.0.1:40783: connect: connection refused

is a flaky test, re-running shoud fix it

go disp.Run()
go inhibitor.Run()

err = tracingManager.ApplyConfig(conf)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just documenting this here as I'm not so familiar with this code, this is not feedback or a request for change. This code here is run when the Coordinator, which is responsible for reloading configurations at runtime, calls the event handlers registered via the Subscribe method. This specific function is responsible for reading the configuration and setting up the receivers, routes, inhibition rules, time intervals, inhibitor, silences, etc.

go disp.Run()
go inhibitor.Run()

err = tracingManager.ApplyConfig(conf)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
err = tracingManager.ApplyConfig(conf)
if err = tracingManager.ApplyConfig(conf); err != nil {

return fmt.Errorf("failed to apply tracing config: %w", err)
}

go tracingManager.Run()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has a goroutine leak because when we reload the configuration, we don't stop the current tracing manager. Instead, we start to accumulate goroutines blocked on <-m.done. I was able to see this in /debug/pprof.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK so the issue is that when the configuration doesn't change, ApplyConfig returns nil, but we still start another goroutine for Run.

case <-term:
level.Info(logger).Log("msg", "Received SIGTERM, exiting gracefully...")

// shut down the tracing manager to flush any remaining spans.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also need to shut down on reload?

MuteTimeIntervals []MuteTimeInterval `yaml:"mute_time_intervals,omitempty" json:"mute_time_intervals,omitempty"`
TimeIntervals []TimeInterval `yaml:"time_intervals,omitempty" json:"time_intervals,omitempty"`

TracingConfig TracingConfig `yaml:"tracing,omitempty" json:"tracing,omitempty"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. It looks like the Prometheus configuration file is structured a little different from Alertmanager as the Prometheus configuration file also has options to change the behavior of the Prometheus server, rather than just contain data like rules. However, Alertmanager configuration just contains data, and all behavioral configuration is instead done via command line flags. @gotjosh what should we do here?

return json.Marshal(result)
}

// TODO: probably move these into prometheus/common since they're copied from
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable to me. Can we do this?


func validateHeadersForTracing(headers map[string]string) error {
for header := range headers {
if strings.ToLower(header) == "authorization" {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah – I misread! 🙂

func (n *Email) Notify(ctx context.Context, as ...*types.Alert) (bool, error) {
func (n *Email) Notify(ctx context.Context, as ...*types.Alert) (ok bool, err error) {
ctx, span := tracer.Start(ctx, "email.Email.Notify", trace.WithAttributes(
attribute.Int("alerts", len(as)),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this up attribute up into dispatch.go instead?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
)

// TODO: maybe move these into prometheus/common?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like good common code to me. Does prometheus/prometheus use this too?

}

level.Debug(d.logger).Log("msg", "Received alert", "alert", alert)
// this block is wrapped in a function to make sure that the span
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's extract all this logic into a method like dispatch or something? This is what I was thinking:

	for {
		select {
		case alert, ok := <-it.Next():
			if !ok {
				// Iterator exhausted for some reason.
				if err := it.Err(); err != nil {
					level.Error(d.logger).Log("msg", "Error on alert update", "err", err)
				}
				return
			}
			d.dispatch(alert)

func() {
traceCtx, span := tracer.Start(d.ctx, "dispatch.Dispatcher.dispatch",
trace.WithAttributes(
attribute.String("alert.name", alert.Name()),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think as long as we have both then 👍

trace.WithAttributes(
attribute.String("alert.name", alert.Name()),
attribute.String("alert.fingerprint", alert.Fingerprint().String()),
attribute.String("alert.status", string(alert.Status())),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can change because the Status is calculated using time.Now().

)
defer span.End()

// make a link to this span - we can't make the processAlert
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to be aware of constraints using links here? For example, the aggregation group run method can run for minutes/hours/days/weeks/years. What will happen if the linked trace is older than the retention period?

}

// Stop gracefully shuts down the tracer provider and stops the tracing manager.
func (m *Manager) Stop() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a race condition between ApplyConfig and Stop around m.shutdownFunc as one goroutine can be updating the variable while another goroutine can be calling the variable. I think we either need a mutex or some docs to mention that these methods are not thread-safe.

stevesg added a commit to grafana/prometheus-alertmanager that referenced this pull request Sep 24, 2024
Copy link
Contributor

@stevesg stevesg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been testing this change (via vendoring into Mimir) and it looks good to me, definitely something we'd be interested in having. I grabbed a trace of a webhook (an initial connection):

Screenshot from 2024-09-24 13-36-50

attribute.Int("alerts", len(alerts)),
))
defer span.End()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Further down here (after ExtractGroupKey()), it would be nice to add the group key to the span. Not sure if the number of truncated alerts is valuable, but you never know.

	span.SetAttributes(attribute.String("group_key", groupKey))
	span.SetAttributes(attribute.Int("alerts_truncated", numTruncated))

@jmichalek132
Copy link

I've been testing this change (via vendoring into Mimir) and it looks good to me, definitely something we'd be interested in having. I grabbed a trace of a webhook (an initial connection):

Screenshot from 2024-09-24 13-36-50

Might be worth considering putting things such as http.dns, http.connect and http.tls as spans events instead of individual spans.

@siavashs
Copy link
Contributor

@hairyhenderson can you revive this PR and address all the comments?
If not please let me know so we can look into it.

@siavashs
Copy link
Contributor

I assigned this to myself, will rebase and build on top of this in a new PR.

siavashs pushed a commit to siavashs/alertmanager that referenced this pull request Nov 15, 2025
Add tracing support using otel:
- api: extract trace and span IDs from request context and add to alerts
- types: add trace and span IDs to alerts
- dispatch: add distributed tracing support
- notify: add distributed tracing support

This change borrows part of the implementation from prometheus#3673
Fixes prometheus#3670

Signed-off-by: Dave Henderson <dhenderson@gmail.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs pushed a commit to siavashs/alertmanager that referenced this pull request Nov 15, 2025
Add tracing support using otel:
- api: extract trace and span IDs from request context and add to alerts
- types: add trace and span IDs to alerts
- dispatch: add distributed tracing support
- notify: add distributed tracing support

This change borrows part of the implementation from prometheus#3673
Fixes prometheus#3670

Signed-off-by: Dave Henderson <dhenderson@gmail.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs pushed a commit to siavashs/alertmanager that referenced this pull request Nov 15, 2025
Add tracing support using otel:
- api: extract trace and span IDs from request context and add to alerts
- types: add trace and span IDs to alerts
- dispatch: add distributed tracing support
- notify: add distributed tracing support

This change borrows part of the implementation from prometheus#3673
Fixes prometheus#3670

Signed-off-by: Dave Henderson <dhenderson@gmail.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs pushed a commit to siavashs/alertmanager that referenced this pull request Nov 15, 2025
Add tracing support using otel:
- api: extract trace and span IDs from request context and add to alerts
- types: add trace and span IDs to alerts
- dispatch: add distributed tracing support
- notify: add distributed tracing support

This change borrows part of the implementation from prometheus#3673
Fixes prometheus#3670

Signed-off-by: Dave Henderson <dhenderson@gmail.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs pushed a commit to siavashs/alertmanager that referenced this pull request Nov 15, 2025
Add tracing support using otel:
- api: extract trace and span IDs from request context and add to alerts
- types: add trace and span IDs to alerts
- dispatch: add distributed tracing support
- notify: add distributed tracing support

This change borrows part of the implementation from prometheus#3673
Fixes prometheus#3670

Signed-off-by: Dave Henderson <dhenderson@gmail.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
@siavashs
Copy link
Contributor

siavashs commented Nov 15, 2025

Closing this in favour of #4745
I tried to address all your suggestions there.
Credits given to @hairyhenderson for their contribution.
Thank you

@siavashs siavashs closed this Nov 15, 2025
siavashs pushed a commit to siavashs/alertmanager that referenced this pull request Nov 15, 2025
Add tracing support using otel:
- api: extract trace and span IDs from request context and add to alerts
- types: add trace and span IDs to alerts
- dispatch: add distributed tracing support
- notify: add distributed tracing support

This change borrows part of the implementation from prometheus#3673
Fixes prometheus#3670

Signed-off-by: Dave Henderson <dhenderson@gmail.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs pushed a commit to siavashs/alertmanager that referenced this pull request Nov 16, 2025
Add tracing support using otel:
- api: extract trace and span IDs from request context and add to alerts
- types: add trace and span IDs to alerts
- dispatch: add distributed tracing support
- notify: add distributed tracing support

This change borrows part of the implementation from prometheus#3673
Fixes prometheus#3670

Signed-off-by: Dave Henderson <dhenderson@gmail.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs pushed a commit to siavashs/alertmanager that referenced this pull request Nov 17, 2025
Add tracing support using otel:
- api: extract trace and span IDs from request context and add to alerts
- types: add trace and span IDs to alerts
- dispatch: add distributed tracing support
- notify: add distributed tracing support

This change borrows part of the implementation from prometheus#3673
Fixes prometheus#3670

Signed-off-by: Dave Henderson <dhenderson@gmail.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs pushed a commit to siavashs/alertmanager that referenced this pull request Nov 17, 2025
Add tracing support using otel:
- api: extract trace and span IDs from request context and add to alerts
- types: add trace and span IDs to alerts
- dispatch: add distributed tracing support
- notify: add distributed tracing support

This change borrows part of the implementation from prometheus#3673
Fixes prometheus#3670

Signed-off-by: Dave Henderson <dhenderson@gmail.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs pushed a commit to siavashs/alertmanager that referenced this pull request Nov 17, 2025
Add tracing support using otel:
- api: extract trace and span IDs from request context and add to alerts
- types: add trace and span IDs to alerts
- dispatch: add distributed tracing support
- notify: add distributed tracing support

This change borrows part of the implementation from prometheus#3673
Fixes prometheus#3670

Signed-off-by: Dave Henderson <dhenderson@gmail.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs pushed a commit to siavashs/alertmanager that referenced this pull request Nov 21, 2025
Add tracing support using otel to the the following components:
- api: extract trace and span IDs from request context
- provider: mem put
- dispatch: split logic and use better naming
- inhibit: source and target traces, mutes, etc.
- silence: query and mutes
- notify: add distributed tracing support to stages and all http requests

This change borrows part of the implementation from prometheus#3673
Fixes prometheus#3670

Signed-off-by: Dave Henderson <dhenderson@gmail.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs pushed a commit to siavashs/alertmanager that referenced this pull request Nov 21, 2025
Add tracing support using otel to the the following components:
- api: extract trace and span IDs from request context
- provider: mem put
- dispatch: split logic and use better naming
- inhibit: source and target traces, mutes, etc.
- silence: query and mutes
- notify: add distributed tracing support to stages and all http requests

This change borrows part of the implementation from prometheus#3673
Fixes prometheus#3670

Signed-off-by: Dave Henderson <dhenderson@gmail.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs pushed a commit to siavashs/alertmanager that referenced this pull request Nov 21, 2025
Add tracing support using otel to the the following components:
- api: extract trace and span IDs from request context
- provider: mem put
- dispatch: split logic and use better naming
- inhibit: source and target traces, mutes, etc.
- silence: query and mutes
- notify: add distributed tracing support to stages and all http requests

This change borrows part of the implementation from prometheus#3673
Fixes prometheus#3670

Signed-off-by: Dave Henderson <dhenderson@gmail.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs pushed a commit to siavashs/alertmanager that referenced this pull request Nov 21, 2025
Add tracing support using otel to the the following components:
- api: extract trace and span IDs from request context
- provider: mem put
- dispatch: split logic and use better naming
- inhibit: source and target traces, mutes, etc.
- silence: query, expire, mutes
- notify: add distributed tracing support to stages and all http requests

This change borrows part of the implementation from prometheus#3673
Fixes prometheus#3670

Signed-off-by: Dave Henderson <dhenderson@gmail.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs pushed a commit to siavashs/alertmanager that referenced this pull request Nov 21, 2025
Add tracing support using otel to the the following components:
- api: extract trace and span IDs from request context
- provider: mem put
- dispatch: split logic and use better naming
- inhibit: source and target traces, mutes, etc.
- silence: query, expire, mutes
- notify: add distributed tracing support to stages and all http requests

This change borrows part of the implementation from prometheus#3673
Fixes prometheus#3670

Signed-off-by: Dave Henderson <dhenderson@gmail.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs pushed a commit to siavashs/alertmanager that referenced this pull request Nov 21, 2025
Add tracing support using otel to the the following components:
- api: extract trace and span IDs from request context
- provider: mem put
- dispatch: split logic and use better naming
- inhibit: source and target traces, mutes, etc.
- silence: query, expire, mutes
- notify: add distributed tracing support to stages and all http requests

This change borrows part of the implementation from prometheus#3673
Fixes prometheus#3670

Signed-off-by: Dave Henderson <dhenderson@gmail.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs pushed a commit to siavashs/alertmanager that referenced this pull request Nov 21, 2025
Add tracing support using otel to the the following components:
- api: extract trace and span IDs from request context
- provider: mem put
- dispatch: split logic and use better naming
- inhibit: source and target traces, mutes, etc.
- silence: query, expire, mutes
- notify: add distributed tracing support to stages and all http requests

This change borrows part of the implementation from prometheus#3673
Fixes prometheus#3670

Signed-off-by: Dave Henderson <dhenderson@gmail.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs pushed a commit to siavashs/alertmanager that referenced this pull request Nov 21, 2025
Add tracing support using otel to the the following components:
- api: extract trace and span IDs from request context
- provider: mem put
- dispatch: split logic and use better naming
- inhibit: source and target traces, mutes, etc. drop metrics
- silence: query, expire, mutes
- notify: add distributed tracing support to stages and all http requests

Note: inhibitor metrics are dropped since we have tracing now and they
are not needed. We have not released any version with these metrics so
we can drop them safely, this is not a breaking change.

This change borrows part of the implementation from prometheus#3673
Fixes prometheus#3670

Signed-off-by: Dave Henderson <dhenderson@gmail.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs pushed a commit to siavashs/alertmanager that referenced this pull request Nov 21, 2025
Add tracing support using otel to the the following components:
- api: extract trace and span IDs from request context
- provider: mem put
- dispatch: split logic and use better naming
- inhibit: source and target traces, mutes, etc. drop metrics
- silence: query, expire, mutes
- notify: add distributed tracing support to stages and all http requests

Note: inhibitor metrics are dropped since we have tracing now and they
are not needed. We have not released any version with these metrics so
we can drop them safely, this is not a breaking change.

This change borrows part of the implementation from prometheus#3673
Fixes prometheus#3670

Signed-off-by: Dave Henderson <dhenderson@gmail.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs pushed a commit to siavashs/alertmanager that referenced this pull request Nov 21, 2025
Add tracing support using otel to the the following components:
- api: extract trace and span IDs from request context
- provider: mem put
- dispatch: split logic and use better naming
- inhibit: source and target traces, mutes, etc. drop metrics
- silence: query, expire, mutes
- notify: add distributed tracing support to stages and all http requests

Note: inhibitor metrics are dropped since we have tracing now and they
are not needed. We have not released any version with these metrics so
we can drop them safely, this is not a breaking change.

This change borrows part of the implementation from prometheus#3673
Fixes prometheus#3670

Signed-off-by: Dave Henderson <dhenderson@gmail.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs pushed a commit to siavashs/alertmanager that referenced this pull request Nov 21, 2025
Add tracing support using otel to the the following components:
- api: extract trace and span IDs from request context
- provider: mem put
- dispatch: split logic and use better naming
- inhibit: source and target traces, mutes, etc. drop metrics
- silence: query, expire, mutes
- notify: add distributed tracing support to stages and all http requests

Note: inhibitor metrics are dropped since we have tracing now and they
are not needed. We have not released any version with these metrics so
we can drop them safely, this is not a breaking change.

This change borrows part of the implementation from prometheus#3673
Fixes prometheus#3670

Signed-off-by: Dave Henderson <dhenderson@gmail.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs pushed a commit to siavashs/alertmanager that referenced this pull request Nov 22, 2025
Add tracing support using otel to the the following components:
- api: extract trace and span IDs from request context
- provider: mem put
- dispatch: split logic and use better naming
- inhibit: source and target traces, mutes, etc. drop metrics
- silence: query, expire, mutes
- notify: add distributed tracing support to stages and all http requests

Note: inhibitor metrics are dropped since we have tracing now and they
are not needed. We have not released any version with these metrics so
we can drop them safely, this is not a breaking change.

This change borrows part of the implementation from prometheus#3673
Fixes prometheus#3670

Signed-off-by: Dave Henderson <dhenderson@gmail.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs pushed a commit to siavashs/alertmanager that referenced this pull request Nov 26, 2025
Add tracing support using otel to the the following components:
- api: extract trace and span IDs from request context
- provider: mem put
- dispatch: split logic and use better naming
- inhibit: source and target traces, mutes, etc. drop metrics
- silence: query, expire, mutes
- notify: add distributed tracing support to stages and all http requests

Note: inhibitor metrics are dropped since we have tracing now and they
are not needed. We have not released any version with these metrics so
we can drop them safely, this is not a breaking change.

This change borrows part of the implementation from prometheus#3673
Fixes prometheus#3670

Signed-off-by: Dave Henderson <dhenderson@gmail.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs pushed a commit to siavashs/alertmanager that referenced this pull request Nov 27, 2025
Add tracing support using otel to the the following components:
- api: extract trace and span IDs from request context
- provider: mem put
- dispatch: split logic and use better naming
- inhibit: source and target traces, mutes, etc. drop metrics
- silence: query, expire, mutes
- notify: add distributed tracing support to stages and all http requests

Note: inhibitor metrics are dropped since we have tracing now and they
are not needed. We have not released any version with these metrics so
we can drop them safely, this is not a breaking change.

This change borrows part of the implementation from prometheus#3673
Fixes prometheus#3670

Signed-off-by: Dave Henderson <dhenderson@gmail.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature: Instrument Alertmanager for distributed tracing

8 participants