Skip to content

Conversation

@siavashs
Copy link
Contributor

@siavashs siavashs commented Nov 15, 2025

Add tracing support using otel to the the following components:

  • api: extract trace and span IDs from request context
  • provider: mem put
  • dispatch: split logic and use better naming
  • inhibit: source and target traces, mutes, etc. drop metrics
  • silence: query, expire, mutes
  • notify: add distributed tracing support to stages and all http requests

Note: inhibitor metrics are dropped since we have tracing now and they
are not needed. We have not released any version with these metrics so
we can drop them safely, this is not a breaking change.

This change borrows part of the implementation from #3673
Fixes #3670

Signed-off-by: Dave Henderson dhenderson@gmail.com
Signed-off-by: Siavash Safi siavash@cloudflare.com

TODO list

Demo

Alertmanager receiving alerts from a "sender" and tracing it all the way to an Aggregation Group inside Dispatcher

Note that Sender here is a custom Cloudflare proxy which acts as a limiter, it is included as an example to show distributed tracing in action
image

Trace of Dispatcher flushing alerts and notifications failing to be sent to PagerDuty

image

@siavashs
Copy link
Contributor Author

siavashs commented Nov 15, 2025

I'll implement the Prometheus notifier changes in a new draft PR based on prometheus/prometheus#16355
So if the above PR got merged then it would be much easier to rebase the tracing changes.

Copy link

@thompson-tomo thompson-tomo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some feedback on attribute naming based on Open telemetry exposure in semconv. Might be worthwhile to add an alerting.platform.name attribute to the spans which stores alert manager.

Is there interest in having these signals defined in the open telemetry signal registry (semantic conventions)?

config/config.go Outdated
return json.Marshal(result)
}

// TODO: probably move these into prometheus/common since they're copied from
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably a good idea, what do you think @ArthurSens?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can prepare a PR for common later this week.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not be against it :)

@siavashs siavashs force-pushed the feat/tracing branch 4 times, most recently from 14b79bd to 47747ad Compare November 17, 2025 21:16
Comment on lines +92 to +94
trace.WithAttributes(attribute.String("alerting.notify.integration.name", i.name)),
trace.WithAttributes(attribute.Int("alerting.alerts.count", len(alerts))),
trace.WithSpanKind(trace.SpanKindClient),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an easy way to add additional integration details potential some from https://opentelemetry.io/docs/specs/semconv/http/http-spans/#http-client-span in particular server.*, note I assume this span represents the outbound call to a notification system hence the client span kind.

Copy link
Contributor Author

@siavashs siavashs Nov 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otelhttp generates a bunch of spans for each http request so I think it is redundant to add more info to the parent span, unless if we want to disable those and construct only one custom span:
image

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to also have a http span then perhaps this should be internal.

attribute.String("alerting.alert.name", alert.Name()),
attribute.String("alerting.alert.fingerprint", alert.Fingerprint().String()),
),
trace.WithSpanKind(trace.SpanKindConsumer),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check this one

Suggested change
trace.WithSpanKind(trace.SpanKindConsumer),
trace.WithSpanKind(trace.SpanKindInternal),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed that one, are these outgoing calls as per the test mentioned in #4745 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it is this case:

deferred execution (PRODUCER and CONSUMER spans).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but it still needs to be out going based on other information in the spec. Anyway I have raised open-telemetry/opentelemetry-specification#4758 to get further clarification.

@siavashs siavashs force-pushed the feat/tracing branch 9 times, most recently from c1482c7 to 5882778 Compare November 21, 2025 16:00
@siavashs
Copy link
Contributor Author

CI failures will be fixed after #4761 is merged.

@siavashs siavashs marked this pull request as ready for review November 21, 2025 16:28
@siavashs siavashs requested a review from SuperQ November 21, 2025 16:31
Copy link
Contributor

@OGKevin OGKevin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor ✨comments

Add tracing support using otel to the the following components:
- api: extract trace and span IDs from request context
- provider: mem put
- dispatch: split logic and use better naming
- inhibit: source and target traces, mutes, etc. drop metrics
- silence: query, expire, mutes
- notify: add distributed tracing support to stages and all http requests

Note: inhibitor metrics are dropped since we have tracing now and they
are not needed. We have not released any version with these metrics so
we can drop them safely, this is not a breaking change.

This change borrows part of the implementation from prometheus#3673
Fixes prometheus#3670

Signed-off-by: Dave Henderson <dhenderson@gmail.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature: Instrument Alertmanager for distributed tracing

6 participants