-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
eventbus: add metrics #2038
eventbus: add metrics #2038
Conversation
You'll need to change
datasource.uid to your prometheus datasource uid. |
p2p/host/eventbus/basic_metrics.go
Outdated
subscriberQueueLength = prometheus.NewHistogramVec( | ||
prometheus.HistogramOpts{ | ||
Namespace: metricNamespace, | ||
Name: "subscriber_queue_length", | ||
Help: "Subscriber queue length", | ||
Buckets: prometheus.ExponentialBuckets(1.0, 2.0, 10), | ||
}, | ||
[]string{"subscriber_name"}, | ||
) | ||
subscriberQueueFull = prometheus.NewGaugeVec( | ||
prometheus.GaugeOpts{ | ||
Namespace: metricNamespace, | ||
Name: "subscriber_queue_full", | ||
Help: "Subscriber Queue completely full", | ||
}, | ||
[]string{"subscriber_name"}, | ||
) | ||
subscriberEventQueued = prometheus.NewCounterVec( | ||
prometheus.CounterOpts{ | ||
Namespace: metricNamespace, | ||
Name: "subscriber_event_queued", | ||
Help: "Event Queued for subscriber", | ||
}, | ||
[]string{"subscriber_name"}, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to avoid having subscribers pass in a name. Can we just make this per event?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the queues are per subscriber and not per event.
We can remove the subscriber metrics and just track whether for a given event type one of the subscriber queues was full. Should I do that for now?
Out of curiosity, what's the issue with subscribers passing in a name?
p2p/host/eventbus/basic.go
Outdated
@@ -57,12 +67,22 @@ func NewBus() event.Bus { | |||
} | |||
} | |||
|
|||
func NewBusWithMetrics() event.Bus { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works totally fine, but we usually use the option pattern:
type Option func(*basicBus) error
func NewBus(opts ...Option) event.Bus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will fix. This would also be backwards compatible. nice :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what we did for the swarm: https://github.com/libp2p/go-libp2p/blob/master/p2p/net/swarm/swarm.go#L74-L79
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed. Not returning error for now since there's only one option that can't error.
p2p/host/eventbus/basic_metrics.go
Outdated
subscriberEventQueued.WithLabelValues(name).Inc() | ||
} | ||
|
||
func stripPrefix(s string) string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
@marten-seemann Addressed review comments. Removed custom passed in names from all subscribers for now. they are set to the filename. If you want I'll add them back. |
p2p/host/eventbus/basic_metrics.go
Outdated
notificationTime = prometheus.NewHistogramVec( | ||
prometheus.HistogramOpts{ | ||
Namespace: metricNamespace, | ||
Name: "notification_time_microseconds", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a convention in Prometheus to use SI units: https://prometheus.io/docs/practices/naming/. So this should be notification_time_seconds
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this a bit more, this measurement doesn't seem as useful any more:
- If all channels don't block, this takes basically no time at all, we're just adding an event to every channel.
- If a channel blocks because the subscriber is not reading from the channel, it will block indefinitely, and then this event won't be emitted.
I suggest remote it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
p2p/host/eventbus/basic_metrics.go
Outdated
numSubscribers = prometheus.NewGaugeVec( | ||
prometheus.GaugeOpts{ | ||
Namespace: metricNamespace, | ||
Name: "num_subscribers", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Name: "num_subscribers", | |
Name: "subscribers_total", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
"uid": "${DS_PROMETHEUS}" | ||
}, | ||
"editorMode": "builder", | ||
"expr": "sum by(event) (rate(libp2p_eventbus_events_emitted_total[1m]))", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed hardcoding
bae8856
to
96ee6b2
Compare
@marten-seemann addressed review comments. |
made subscriber queue a state timeline |
p2p/host/eventbus/basic_metrics.go
Outdated
var initMetricsOnce sync.Once | ||
|
||
func initMetrics() { | ||
prometheus.MustRegister(eventsEmitted, totalSubscribers, subscriberQueueLength, subscriberQueueFull, subscriberEventQueued) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's have this accept a registerer function and not register with the default registerer. Example: https://github.com/libp2p/go-libp2p/blob/36db69d65efc54ee5e028a02c6cefd82dd985f71/p2p/host/resource-manager/obs/stats.go#LL143C6-L143C14.
By default we can still use the default registerer, but there should be a way to override that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed. I'm using
type MetricsTracerOption = func(*metricsTracerSetting)
func NewMetricsTracer(opts ...MetricsTracerOption) MetricsTracer
p2p/host/eventbus/basic_metrics.go
Outdated
}, | ||
[]string{"event"}, | ||
) | ||
subscriberQueueLength = prometheus.NewHistogramVec( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this a histogram? I think this should be a gauge. What do you think was the value in having this be a histogram?
If it's a gauge, the promql query also becomes simpler and correct. I think the current query in the grafana dashboard is wrong. It's showing you the increase in queue length. If you're full, you'll see that number go to 0 since it's not increasing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to look at things like 90%ile queue length. I guess we can ignore that since if the consumer is slow the queue will get full eventually. Having a gauge alerts us to increasing queue length which should be enough.
I didn't realise the current query would go to 0, I thought sum / count would stay constant. I just checked, it does go to 0 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added two minor changes to this PR:
- Removing the version number from the subscriber event name. This makes it more pleasant to view in Grafana.
- A number of minor tweaks to the Grafana dashboard.
I think this is ready to go now. Thanks to @sukunrt for doing all the heavy lifting here!
Related to: #2020