Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eventbus: add metrics #2038

Merged
merged 18 commits into from
Feb 4, 2023
Merged

Conversation

sukunrt
Copy link
Member

@sukunrt sukunrt commented Feb 1, 2023

Related to: #2020

@sukunrt
Copy link
Member Author

sukunrt commented Feb 1, 2023

You'll need to change

"datasource": {
                                "type": "prometheus",
                                "uid": "aMik1H0Vz"
                        },
                        

datasource.uid to your prometheus datasource uid.

@sukunrt sukunrt mentioned this pull request Feb 1, 2023
@sukunrt sukunrt marked this pull request as ready for review February 2, 2023 14:11
Comment on lines 40 to 54
subscriberQueueLength = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Namespace: metricNamespace,
Name: "subscriber_queue_length",
Help: "Subscriber queue length",
Buckets: prometheus.ExponentialBuckets(1.0, 2.0, 10),
},
[]string{"subscriber_name"},
)
subscriberQueueFull = prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Namespace: metricNamespace,
Name: "subscriber_queue_full",
Help: "Subscriber Queue completely full",
},
[]string{"subscriber_name"},
)
subscriberEventQueued = prometheus.NewCounterVec(
prometheus.CounterOpts{
Namespace: metricNamespace,
Name: "subscriber_event_queued",
Help: "Event Queued for subscriber",
},
[]string{"subscriber_name"},
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to avoid having subscribers pass in a name. Can we just make this per event?

Copy link
Member Author

@sukunrt sukunrt Feb 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the queues are per subscriber and not per event.
We can remove the subscriber metrics and just track whether for a given event type one of the subscriber queues was full. Should I do that for now?

Out of curiosity, what's the issue with subscribers passing in a name?

@@ -57,12 +67,22 @@ func NewBus() event.Bus {
}
}

func NewBusWithMetrics() event.Bus {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works totally fine, but we usually use the option pattern:

type Option func(*basicBus) error

func NewBus(opts ...Option) event.Bus

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will fix. This would also be backwards compatible. nice :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed. Not returning error for now since there's only one option that can't error.

@p-shahi p-shahi changed the title add metrics to eventbus eventbus: add metrics Feb 2, 2023
subscriberEventQueued.WithLabelValues(name).Inc()
}

func stripPrefix(s string) string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@sukunrt
Copy link
Member Author

sukunrt commented Feb 3, 2023

@marten-seemann Addressed review comments. Removed custom passed in names from all subscribers for now. they are set to the filename. If you want I'll add them back.

notificationTime = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Namespace: metricNamespace,
Name: "notification_time_microseconds",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a convention in Prometheus to use SI units: https://prometheus.io/docs/practices/naming/. So this should be notification_time_seconds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this a bit more, this measurement doesn't seem as useful any more:

  • If all channels don't block, this takes basically no time at all, we're just adding an event to every channel.
  • If a channel blocks because the subscriber is not reading from the channel, it will block indefinitely, and then this event won't be emitted.

I suggest remote it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

numSubscribers = prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Namespace: metricNamespace,
Name: "num_subscribers",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Name: "num_subscribers",
Name: "subscribers_total",

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Grafana, this would look pretty cool as a Gauge:
image

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

"uid": "${DS_PROMETHEUS}"
},
"editorMode": "builder",
"expr": "sum by(event) (rate(libp2p_eventbus_events_emitted_total[1m]))",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to use the $__rate_interval variable instead of hardcoding a value. Also, I don't think you need the sum. This configuration should work:
image

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed hardcoding

@sukunrt
Copy link
Member Author

sukunrt commented Feb 3, 2023

@marten-seemann addressed review comments.

@sukunrt
Copy link
Member Author

sukunrt commented Feb 3, 2023

made subscriber queue a state timeline

var initMetricsOnce sync.Once

func initMetrics() {
prometheus.MustRegister(eventsEmitted, totalSubscribers, subscriberQueueLength, subscriberQueueFull, subscriberEventQueued)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's have this accept a registerer function and not register with the default registerer. Example: https://github.com/libp2p/go-libp2p/blob/36db69d65efc54ee5e028a02c6cefd82dd985f71/p2p/host/resource-manager/obs/stats.go#LL143C6-L143C14.

By default we can still use the default registerer, but there should be a way to override that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed. I'm using

type MetricsTracerOption = func(*metricsTracerSetting)

func NewMetricsTracer(opts ...MetricsTracerOption) MetricsTracer

},
[]string{"event"},
)
subscriberQueueLength = prometheus.NewHistogramVec(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a histogram? I think this should be a gauge. What do you think was the value in having this be a histogram?

If it's a gauge, the promql query also becomes simpler and correct. I think the current query in the grafana dashboard is wrong. It's showing you the increase in queue length. If you're full, you'll see that number go to 0 since it's not increasing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to look at things like 90%ile queue length. I guess we can ignore that since if the consumer is slow the queue will get full eventually. Having a gauge alerts us to increasing queue length which should be enough.

I didn't realise the current query would go to 0, I thought sum / count would stay constant. I just checked, it does go to 0 😅

@sukunrt sukunrt requested review from MarcoPolo and marten-seemann and removed request for MarcoPolo and marten-seemann February 4, 2023 04:20
Copy link
Contributor

@marten-seemann marten-seemann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added two minor changes to this PR:

  1. Removing the version number from the subscriber event name. This makes it more pleasant to view in Grafana.
  2. A number of minor tweaks to the Grafana dashboard.

I think this is ready to go now. Thanks to @sukunrt for doing all the heavy lifting here!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

eventbus: expose metrics
3 participants