-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
eventbus: log a warning if an event channel is full (and continue to block) #2361
Comments
Setting a timer is quite expensive. We discussed this a few months ago when debugging a similar case, and decided that event bus metrics are the cleaner solution. Maybe event bus shouldn’t do be async? That would be a pretty big design change, but it would allow us to measure the time it takes to consume an event (subtracting two timestamps is cheap compared to arming a timer). |
Btw, with solution above, the timer is only allocated once the subscription is overflowed, which won't happen in an app that reads notification correctly, so I don't think there is a price penalty for the timer when there are no bugs on app/protocol side. |
How so? You always need to set the timer, because you don’t know when the sending on the channel would block. Really, this is a problem that should be solved by monitoring. We already have a Grafana dashboard that immediately shows you the source of the problem, including line number. This sounds like an excellent use case for Grafana alerting. |
The block on the sending on the channel is identified with |
@MarcoPolo What more info would the log line provide over the libp2p_eventbus_subscriber_queue_length metric? |
Not all the nodes are running with metrics enabled, while every node is running with logs on. And there can be a case where the Kubo node hosted by PL with metrics allowed does not see any issues while some random node runner experiences it and has no data to bug report. The point is that these two are not exclusive but complementary. |
These are not mutually exclusive. I think we can do something nicer here besides asking every user to setup a prometheus and grafana instance. Especially since this has bitten multiple people who are familiar with the stack. We can avoid the timer by tracking how long it took to place an event on the channel and if that's over a threshold it'll log an error. That won't prevent against a consumer that never returns, but at least it would highlight a slow consumer. |
@marten-seemann when we say resetting a timer is expensive, do you mean anything other than cpu time taken?
both of them take about 30ns on m1 mac and 40-50ns on an old core i7(i7-8550U CPU @ 1.80GHz) Looking at these numbers, I'm fine with a pattern like this, though it is rather ugly. select {
case sendch <- evt:
default:
t.Reset(time.Second)
select {
case <-t.C:
log.Warn("queue full for %s", name)
sendch <- evt
case sendch <- evt:
}
} |
There's a bigger difference when you set the timer to an earlier instead of a later instance, since that will result in a syscall. The thing I'm worried about is the underlying API question. If we built an API that's so easy to misuse that it requires metrics AND logging, maybe there's something wrong with the API. I suggested switching to a sync API earlier, but of course you can always block a sync callback as well. That applies to any API that uses callbacks. |
#2383 will make it possible to spin up a local Grafana instance (with all our dashboards pre-installed!) by just running |
We had another case pop up where it wasn't obvious that an event channel from the eventbus was full and things were getting blocked. We could be nice here and log something for the user with the tradeoff of paying for an extra timer and select. I'm thinking something like this here.
Before
After
(and some extra logic around cleaning up and reseting the timer. Ideally keeping this timer as part of the
*node
)The text was updated successfully, but these errors were encountered: