-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Instrument with OTel tracing #3673
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Instrument with OTel tracing #3673
Conversation
1de5573 to
20b0565
Compare
Signed-off-by: Dave Henderson <dhenderson@gmail.com>
20b0565 to
b106234
Compare
|
unclear why CI's failing - I wonder if that test's unrelatedly flaky? |
|
I've been wondering about the support of this for months and it's awesome to see movement on this front. Much appreciated! |
|
@els0r you're welcome! Let me know if there are any particular things you want traced - this is really just a start. |
grobinson-grafana
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a quick look at the code and left some comments. @hairyhenderson, can you show me how to test this next week so I can better understand how it all works?
| MuteTimeIntervals []MuteTimeInterval `yaml:"mute_time_intervals,omitempty" json:"mute_time_intervals,omitempty"` | ||
| TimeIntervals []TimeInterval `yaml:"time_intervals,omitempty" json:"time_intervals,omitempty"` | ||
|
|
||
| TracingConfig TracingConfig `yaml:"tracing,omitempty" json:"tracing,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These might be better put in GlobalConfig?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was my first instinct actually, but in prometheus/prometheus it's not under global. Should we prefer to keep things under global, or follow the pattern established in prometheus/prometheus?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. It looks like the Prometheus configuration file is structured a little different from Alertmanager as the Prometheus configuration file also has options to change the behavior of the Prometheus server, rather than just contain data like rules. However, Alertmanager configuration just contains data, and all behavioral configuration is instead done via command line flags. @gotjosh what should we do here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read on prometheus-users that if the configuration is expected to be reloadable at runtime it should go in the configuration file, otherwise it should go in command line args.
https://groups.google.com/g/prometheus-users/c/qYKpBiuN8y4/m/9z062yU4CQAJ
|
|
||
| func validateHeadersForTracing(headers map[string]string) error { | ||
| for header := range headers { | ||
| if strings.ToLower(header) == "authorization" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to remove whitespace? Looking at the RFC 2616:
Field names are case-insensitive. The field value MAY be preceded by any amount of LWS, though a single SP is preferred.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also copied from prometheus/prometheus: https://github.com/prometheus/prometheus/blob/main/config/config.go#L1083
The second quoted sentence refers to the field value, not the name. It's not permissible to have whitespace in names at all (they're just tokens - i.e. one or more non-control/non-separator character). So even if someone had a configuration with a header that included whitespace, it would likely be ignored or discarded later on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah – I misread! 🙂
| func() { | ||
| traceCtx, span := tracer.Start(d.ctx, "dispatch.Dispatcher.dispatch", | ||
| trace.WithAttributes( | ||
| attribute.String("alert.name", alert.Name()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compared to the fingerprint, alert.Name() is not used much in Alertmanager. We might be able to omit it here as it's a lot of data, and I think most operators have been fine using the fingerprint in the past.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might be able to omit it here as it's a lot of data, and I think most operators have been fine using the fingerprint in the past.
isn't it what gets sent as the alertname label? surely that's a lot more human-readable than the fingerprint?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could have thousands of alerts all with the same alertname. It's just a label, it doesn't have a special meaning. That's where fingerprint is useful, as it is a unique identifier for an alert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From an operational perspective I think the alert name is still quite valuable... Here's a sample from my test setup:
alert.fingerprint | "2a7a97fbf890646a"
alert.name | "AlwaysFiring"
I wouldn't be able to relate this trace back to my alert configuration without seeing AlwaysFiring...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think as long as we have both then 👍
| trace.WithAttributes( | ||
| attribute.String("alert.name", alert.Name()), | ||
| attribute.String("alert.fingerprint", alert.Fingerprint().String()), | ||
| attribute.String("alert.status", string(alert.Status())), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be mentioned that the status of an alert can change between the dispatcher and when the aggregation group is flushed. If that will affect continuation of the trace, we should remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that will affect continuation of the trace, we should remove it.
Not at all - it's just metadata. But if it can change maybe that's misleading, and if we set it at all we should maybe set it at the end when we have the final status 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can change because the Status is calculated using time.Now().
Signed-off-by: Dave Henderson <dhenderson@gmail.com>
sorry for the delay... laptop refresh caused me to lose my test configs 🤦♂️ I tested this by setting up a local Grafana, Tempo, and Agent (configured to receive OTLP). Then I fired up the webhook echo service ( I then ran $ ./alertmanager --web.listen-address=127.0.0.1:9093 --log.level=debug --config.file=am-trace-test.ymlHere's global:
smtp_smarthost: 'localhost:2525'
smtp_from: 'alertmanager@example.org'
route:
group_by: ['alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: 'web.hook'
routes:
- matchers:
- severity="page"
receiver: web.hook
- matchers:
- severity="email"
receiver: smtp
receivers:
- name: 'web.hook'
webhook_configs:
- url: 'http://127.0.0.1:5001/'
- name: 'smtp'
email_configs:
- to: 'nobody@nowhere.org'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
tracing:
client_type: "grpc"
endpoint: 'localhost:4317'
sampling_fraction: 1.0
insecure: trueThen I ran $ ./prometheus --config.file=prom-alert-tracing-test.yml --web.listen-address=127.0.0.1:9990Here's global:
scrape_interval: 15s
evaluation_interval: 15s
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- 127.0.0.1:9093
rule_files:
- prom-alert-tracing-rules.yml
scrape_configs:
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9990"]
tracing:
client_type: "grpc"
endpoint: 'localhost:4317'
sampling_fraction: 1.0
insecure: trueAnd here's the referenced groups:
- name: alerts
rules:
- alert: AlwaysFiring
expr: '1'
labels:
severity: page
- alert: AlsoAlwaysFiring
expr: '1'
labels:
severity: emailFor the Let me know if this makes sense, otherwise let's set up some time tomorrow or next week to pair on this 😉 |
Signed-off-by: Dave Henderson <dhenderson@gmail.com>
Signed-off-by: Dave Henderson <dhenderson@gmail.com>
|
IIRC this is a flaky test, re-running shoud fix it |
| go disp.Run() | ||
| go inhibitor.Run() | ||
|
|
||
| err = tracingManager.ApplyConfig(conf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just documenting this here as I'm not so familiar with this code, this is not feedback or a request for change. This code here is run when the Coordinator, which is responsible for reloading configurations at runtime, calls the event handlers registered via the Subscribe method. This specific function is responsible for reading the configuration and setting up the receivers, routes, inhibition rules, time intervals, inhibitor, silences, etc.
| go disp.Run() | ||
| go inhibitor.Run() | ||
|
|
||
| err = tracingManager.ApplyConfig(conf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| err = tracingManager.ApplyConfig(conf) | |
| if err = tracingManager.ApplyConfig(conf); err != nil { |
| return fmt.Errorf("failed to apply tracing config: %w", err) | ||
| } | ||
|
|
||
| go tracingManager.Run() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has a goroutine leak because when we reload the configuration, we don't stop the current tracing manager. Instead, we start to accumulate goroutines blocked on <-m.done. I was able to see this in /debug/pprof.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK so the issue is that when the configuration doesn't change, ApplyConfig returns nil, but we still start another goroutine for Run.
| case <-term: | ||
| level.Info(logger).Log("msg", "Received SIGTERM, exiting gracefully...") | ||
|
|
||
| // shut down the tracing manager to flush any remaining spans. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we also need to shut down on reload?
| MuteTimeIntervals []MuteTimeInterval `yaml:"mute_time_intervals,omitempty" json:"mute_time_intervals,omitempty"` | ||
| TimeIntervals []TimeInterval `yaml:"time_intervals,omitempty" json:"time_intervals,omitempty"` | ||
|
|
||
| TracingConfig TracingConfig `yaml:"tracing,omitempty" json:"tracing,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. It looks like the Prometheus configuration file is structured a little different from Alertmanager as the Prometheus configuration file also has options to change the behavior of the Prometheus server, rather than just contain data like rules. However, Alertmanager configuration just contains data, and all behavioral configuration is instead done via command line flags. @gotjosh what should we do here?
| return json.Marshal(result) | ||
| } | ||
|
|
||
| // TODO: probably move these into prometheus/common since they're copied from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable to me. Can we do this?
|
|
||
| func validateHeadersForTracing(headers map[string]string) error { | ||
| for header := range headers { | ||
| if strings.ToLower(header) == "authorization" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah – I misread! 🙂
| func (n *Email) Notify(ctx context.Context, as ...*types.Alert) (bool, error) { | ||
| func (n *Email) Notify(ctx context.Context, as ...*types.Alert) (ok bool, err error) { | ||
| ctx, span := tracer.Start(ctx, "email.Email.Notify", trace.WithAttributes( | ||
| attribute.Int("alerts", len(as)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this up attribute up into dispatch.go instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's already here actually: https://github.com/prometheus/alertmanager/pull/3673/files#diff-71b3813922f4c2abc260ff5be5e31cf64eeb31eb3d7272f19f4fa889fc3b27d5R93 so it could just be dropped...
| "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp" | ||
| ) | ||
|
|
||
| // TODO: maybe move these into prometheus/common? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like good common code to me. Does prometheus/prometheus use this too?
| } | ||
|
|
||
| level.Debug(d.logger).Log("msg", "Received alert", "alert", alert) | ||
| // this block is wrapped in a function to make sure that the span |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's extract all this logic into a method like dispatch or something? This is what I was thinking:
for {
select {
case alert, ok := <-it.Next():
if !ok {
// Iterator exhausted for some reason.
if err := it.Err(); err != nil {
level.Error(d.logger).Log("msg", "Error on alert update", "err", err)
}
return
}
d.dispatch(alert)| func() { | ||
| traceCtx, span := tracer.Start(d.ctx, "dispatch.Dispatcher.dispatch", | ||
| trace.WithAttributes( | ||
| attribute.String("alert.name", alert.Name()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think as long as we have both then 👍
| trace.WithAttributes( | ||
| attribute.String("alert.name", alert.Name()), | ||
| attribute.String("alert.fingerprint", alert.Fingerprint().String()), | ||
| attribute.String("alert.status", string(alert.Status())), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can change because the Status is calculated using time.Now().
| ) | ||
| defer span.End() | ||
|
|
||
| // make a link to this span - we can't make the processAlert |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to be aware of constraints using links here? For example, the aggregation group run method can run for minutes/hours/days/weeks/years. What will happen if the linked trace is older than the retention period?
| } | ||
|
|
||
| // Stop gracefully shuts down the tracer provider and stops the tracing manager. | ||
| func (m *Manager) Stop() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a race condition between ApplyConfig and Stop around m.shutdownFunc as one goroutine can be updating the variable while another goroutine can be calling the variable. I think we either need a mutex or some docs to mention that these methods are not thread-safe.
stevesg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| attribute.Int("alerts", len(alerts)), | ||
| )) | ||
| defer span.End() | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Further down here (after ExtractGroupKey()), it would be nice to add the group key to the span. Not sure if the number of truncated alerts is valuable, but you never know.
span.SetAttributes(attribute.String("group_key", groupKey))
span.SetAttributes(attribute.Int("alerts_truncated", numTruncated))
Might be worth considering putting things such as |
|
@hairyhenderson can you revive this PR and address all the comments? |
|
I assigned this to myself, will rebase and build on top of this in a new PR. |
Add tracing support using otel: - api: extract trace and span IDs from request context and add to alerts - types: add trace and span IDs to alerts - dispatch: add distributed tracing support - notify: add distributed tracing support This change borrows part of the implementation from prometheus#3673 Fixes prometheus#3670 Signed-off-by: Dave Henderson <dhenderson@gmail.com> Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Add tracing support using otel: - api: extract trace and span IDs from request context and add to alerts - types: add trace and span IDs to alerts - dispatch: add distributed tracing support - notify: add distributed tracing support This change borrows part of the implementation from prometheus#3673 Fixes prometheus#3670 Signed-off-by: Dave Henderson <dhenderson@gmail.com> Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Add tracing support using otel: - api: extract trace and span IDs from request context and add to alerts - types: add trace and span IDs to alerts - dispatch: add distributed tracing support - notify: add distributed tracing support This change borrows part of the implementation from prometheus#3673 Fixes prometheus#3670 Signed-off-by: Dave Henderson <dhenderson@gmail.com> Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Add tracing support using otel: - api: extract trace and span IDs from request context and add to alerts - types: add trace and span IDs to alerts - dispatch: add distributed tracing support - notify: add distributed tracing support This change borrows part of the implementation from prometheus#3673 Fixes prometheus#3670 Signed-off-by: Dave Henderson <dhenderson@gmail.com> Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Add tracing support using otel: - api: extract trace and span IDs from request context and add to alerts - types: add trace and span IDs to alerts - dispatch: add distributed tracing support - notify: add distributed tracing support This change borrows part of the implementation from prometheus#3673 Fixes prometheus#3670 Signed-off-by: Dave Henderson <dhenderson@gmail.com> Signed-off-by: Siavash Safi <siavash@cloudflare.com>
|
Closing this in favour of #4745 |
Add tracing support using otel: - api: extract trace and span IDs from request context and add to alerts - types: add trace and span IDs to alerts - dispatch: add distributed tracing support - notify: add distributed tracing support This change borrows part of the implementation from prometheus#3673 Fixes prometheus#3670 Signed-off-by: Dave Henderson <dhenderson@gmail.com> Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Add tracing support using otel: - api: extract trace and span IDs from request context and add to alerts - types: add trace and span IDs to alerts - dispatch: add distributed tracing support - notify: add distributed tracing support This change borrows part of the implementation from prometheus#3673 Fixes prometheus#3670 Signed-off-by: Dave Henderson <dhenderson@gmail.com> Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Add tracing support using otel: - api: extract trace and span IDs from request context and add to alerts - types: add trace and span IDs to alerts - dispatch: add distributed tracing support - notify: add distributed tracing support This change borrows part of the implementation from prometheus#3673 Fixes prometheus#3670 Signed-off-by: Dave Henderson <dhenderson@gmail.com> Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Add tracing support using otel: - api: extract trace and span IDs from request context and add to alerts - types: add trace and span IDs to alerts - dispatch: add distributed tracing support - notify: add distributed tracing support This change borrows part of the implementation from prometheus#3673 Fixes prometheus#3670 Signed-off-by: Dave Henderson <dhenderson@gmail.com> Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Add tracing support using otel: - api: extract trace and span IDs from request context and add to alerts - types: add trace and span IDs to alerts - dispatch: add distributed tracing support - notify: add distributed tracing support This change borrows part of the implementation from prometheus#3673 Fixes prometheus#3670 Signed-off-by: Dave Henderson <dhenderson@gmail.com> Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Add tracing support using otel to the the following components: - api: extract trace and span IDs from request context - provider: mem put - dispatch: split logic and use better naming - inhibit: source and target traces, mutes, etc. - silence: query and mutes - notify: add distributed tracing support to stages and all http requests This change borrows part of the implementation from prometheus#3673 Fixes prometheus#3670 Signed-off-by: Dave Henderson <dhenderson@gmail.com> Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Add tracing support using otel to the the following components: - api: extract trace and span IDs from request context - provider: mem put - dispatch: split logic and use better naming - inhibit: source and target traces, mutes, etc. - silence: query and mutes - notify: add distributed tracing support to stages and all http requests This change borrows part of the implementation from prometheus#3673 Fixes prometheus#3670 Signed-off-by: Dave Henderson <dhenderson@gmail.com> Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Add tracing support using otel to the the following components: - api: extract trace and span IDs from request context - provider: mem put - dispatch: split logic and use better naming - inhibit: source and target traces, mutes, etc. - silence: query and mutes - notify: add distributed tracing support to stages and all http requests This change borrows part of the implementation from prometheus#3673 Fixes prometheus#3670 Signed-off-by: Dave Henderson <dhenderson@gmail.com> Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Add tracing support using otel to the the following components: - api: extract trace and span IDs from request context - provider: mem put - dispatch: split logic and use better naming - inhibit: source and target traces, mutes, etc. - silence: query, expire, mutes - notify: add distributed tracing support to stages and all http requests This change borrows part of the implementation from prometheus#3673 Fixes prometheus#3670 Signed-off-by: Dave Henderson <dhenderson@gmail.com> Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Add tracing support using otel to the the following components: - api: extract trace and span IDs from request context - provider: mem put - dispatch: split logic and use better naming - inhibit: source and target traces, mutes, etc. - silence: query, expire, mutes - notify: add distributed tracing support to stages and all http requests This change borrows part of the implementation from prometheus#3673 Fixes prometheus#3670 Signed-off-by: Dave Henderson <dhenderson@gmail.com> Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Add tracing support using otel to the the following components: - api: extract trace and span IDs from request context - provider: mem put - dispatch: split logic and use better naming - inhibit: source and target traces, mutes, etc. - silence: query, expire, mutes - notify: add distributed tracing support to stages and all http requests This change borrows part of the implementation from prometheus#3673 Fixes prometheus#3670 Signed-off-by: Dave Henderson <dhenderson@gmail.com> Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Add tracing support using otel to the the following components: - api: extract trace and span IDs from request context - provider: mem put - dispatch: split logic and use better naming - inhibit: source and target traces, mutes, etc. - silence: query, expire, mutes - notify: add distributed tracing support to stages and all http requests This change borrows part of the implementation from prometheus#3673 Fixes prometheus#3670 Signed-off-by: Dave Henderson <dhenderson@gmail.com> Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Add tracing support using otel to the the following components: - api: extract trace and span IDs from request context - provider: mem put - dispatch: split logic and use better naming - inhibit: source and target traces, mutes, etc. drop metrics - silence: query, expire, mutes - notify: add distributed tracing support to stages and all http requests Note: inhibitor metrics are dropped since we have tracing now and they are not needed. We have not released any version with these metrics so we can drop them safely, this is not a breaking change. This change borrows part of the implementation from prometheus#3673 Fixes prometheus#3670 Signed-off-by: Dave Henderson <dhenderson@gmail.com> Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Add tracing support using otel to the the following components: - api: extract trace and span IDs from request context - provider: mem put - dispatch: split logic and use better naming - inhibit: source and target traces, mutes, etc. drop metrics - silence: query, expire, mutes - notify: add distributed tracing support to stages and all http requests Note: inhibitor metrics are dropped since we have tracing now and they are not needed. We have not released any version with these metrics so we can drop them safely, this is not a breaking change. This change borrows part of the implementation from prometheus#3673 Fixes prometheus#3670 Signed-off-by: Dave Henderson <dhenderson@gmail.com> Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Add tracing support using otel to the the following components: - api: extract trace and span IDs from request context - provider: mem put - dispatch: split logic and use better naming - inhibit: source and target traces, mutes, etc. drop metrics - silence: query, expire, mutes - notify: add distributed tracing support to stages and all http requests Note: inhibitor metrics are dropped since we have tracing now and they are not needed. We have not released any version with these metrics so we can drop them safely, this is not a breaking change. This change borrows part of the implementation from prometheus#3673 Fixes prometheus#3670 Signed-off-by: Dave Henderson <dhenderson@gmail.com> Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Add tracing support using otel to the the following components: - api: extract trace and span IDs from request context - provider: mem put - dispatch: split logic and use better naming - inhibit: source and target traces, mutes, etc. drop metrics - silence: query, expire, mutes - notify: add distributed tracing support to stages and all http requests Note: inhibitor metrics are dropped since we have tracing now and they are not needed. We have not released any version with these metrics so we can drop them safely, this is not a breaking change. This change borrows part of the implementation from prometheus#3673 Fixes prometheus#3670 Signed-off-by: Dave Henderson <dhenderson@gmail.com> Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Add tracing support using otel to the the following components: - api: extract trace and span IDs from request context - provider: mem put - dispatch: split logic and use better naming - inhibit: source and target traces, mutes, etc. drop metrics - silence: query, expire, mutes - notify: add distributed tracing support to stages and all http requests Note: inhibitor metrics are dropped since we have tracing now and they are not needed. We have not released any version with these metrics so we can drop them safely, this is not a breaking change. This change borrows part of the implementation from prometheus#3673 Fixes prometheus#3670 Signed-off-by: Dave Henderson <dhenderson@gmail.com> Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Add tracing support using otel to the the following components: - api: extract trace and span IDs from request context - provider: mem put - dispatch: split logic and use better naming - inhibit: source and target traces, mutes, etc. drop metrics - silence: query, expire, mutes - notify: add distributed tracing support to stages and all http requests Note: inhibitor metrics are dropped since we have tracing now and they are not needed. We have not released any version with these metrics so we can drop them safely, this is not a breaking change. This change borrows part of the implementation from prometheus#3673 Fixes prometheus#3670 Signed-off-by: Dave Henderson <dhenderson@gmail.com> Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Add tracing support using otel to the the following components: - api: extract trace and span IDs from request context - provider: mem put - dispatch: split logic and use better naming - inhibit: source and target traces, mutes, etc. drop metrics - silence: query, expire, mutes - notify: add distributed tracing support to stages and all http requests Note: inhibitor metrics are dropped since we have tracing now and they are not needed. We have not released any version with these metrics so we can drop them safely, this is not a breaking change. This change borrows part of the implementation from prometheus#3673 Fixes prometheus#3670 Signed-off-by: Dave Henderson <dhenderson@gmail.com> Signed-off-by: Siavash Safi <siavash@cloudflare.com>


Fixes #3670
I've pulled in
tracing/tracing.gofromprometheus/prometheusto kickstart this, and to provide a familiar (to Prometheus operators) configuration mechanism.It's possible that it'd be worth pushing some of this up into
prometheus/common, but I'll defer on that one.So far, all incoming requests are instrumented, and notifications are instrumented, with webhooks and e-mails getting slightly more detailed instrumentation, both with downward trace propagation.
The decoupled nature of how alerts are handled within Alertmanager means that there's a bunch of disjointed spans, but I've attempted to rectify some of that by using span links.