eventbus: log a warning if an event channel is full (and continue to block) #2361

MarcoPolo · 2023-06-13T23:40:25Z

We had another case pop up where it wasn't obvious that an event channel from the eventbus was full and things were getting blocked. We could be nice here and log something for the user with the tradeoff of paying for an extra timer and select. I'm thinking something like this here.

Before

sink.ch <- evt

After

timer.Reset(time.Second)
select {
case <-timer:
	log.Errorf("Event to %s is being delayed because the event channel is full. Are you doing blocking work? If so, spawn a goroutine.", sink.name, len(sink.ch))
	sink.ch <- evt // still block
case sink.ch <- evt:
}

(and some extra logic around cleaning up and reseting the timer. Ideally keeping this timer as part of the *node)

The text was updated successfully, but these errors were encountered:

marten-seemann · 2023-06-14T06:02:01Z

Setting a timer is quite expensive. We discussed this a few months ago when debugging a similar case, and decided that event bus metrics are the cleaner solution.

Maybe event bus shouldn’t do be async? That would be a pretty big design change, but it would allow us to measure the time it takes to consume an event (subtracting two timestamps is cheap compared to arming a timer).

Wondertan · 2023-06-14T07:52:27Z

#2026

Wondertan · 2023-06-14T09:43:33Z

Btw, with solution above, the timer is only allocated once the subscription is overflowed, which won't happen in an app that reads notification correctly, so I don't think there is a price penalty for the timer when there are no bugs on app/protocol side.

marten-seemann · 2023-06-14T10:21:03Z

How so? You always need to set the timer, because you don’t know when the sending on the channel would block.

Really, this is a problem that should be solved by monitoring. We already have a Grafana dashboard that immediately shows you the source of the problem, including line number. This sounds like an excellent use case for Grafana alerting.

Wondertan · 2023-06-14T10:31:48Z

The block on the sending on the channel is identified with default

sukunrt · 2023-06-15T10:43:57Z

@MarcoPolo What more info would the log line provide over the libp2p_eventbus_subscriber_queue_length metric?

Wondertan · 2023-06-15T10:50:58Z

Not all the nodes are running with metrics enabled, while every node is running with logs on. And there can be a case where the Kubo node hosted by PL with metrics allowed does not see any issues while some random node runner experiences it and has no data to bug report. The point is that these two are not exclusive but complementary.

MarcoPolo · 2023-06-15T22:12:01Z

These are not mutually exclusive. I think we can do something nicer here besides asking every user to setup a prometheus and grafana instance. Especially since this has bitten multiple people who are familiar with the stack.

We can avoid the timer by tracking how long it took to place an event on the channel and if that's over a threshold it'll log an error. That won't prevent against a consumer that never returns, but at least it would highlight a slow consumer.

sukunrt · 2023-06-19T11:45:08Z

@marten-seemann when we say resetting a timer is expensive, do you mean anything other than cpu time taken?
These benchmarks take similar time for me

func BenchmarkTimerReset(b *testing.B) {
	var durations []time.Duration
	for i := 0; i < 100; i++ {
		durations = append(durations, time.Hour+time.Duration(i)*time.Minute)
	}
	t := time.NewTimer(100 * time.Millisecond)
	for i := 0; i < b.N; i++ {
		t.Reset(durations[i%len(durations)])
	}
}

func BenchmarkCounterVecIncrement(b *testing.B) {
	counter := prometheus.NewCounterVec(prometheus.CounterOpts{
		Namespace: "lib",
		Name:      "counter",
	},
		[]string{"label1"},
	)
	label := []string{"hello"}
	for i := 0; i < b.N; i++ {
		counter.WithLabelValues(label...).Inc()
	}
}

both of them take about 30ns on m1 mac and 40-50ns on an old core i7(i7-8550U CPU @ 1.80GHz)

Looking at these numbers, I'm fine with a pattern like this, though it is rather ugly.

select {
	case sendch <- evt:
	default:
		t.Reset(time.Second)
		select {
			case <-t.C:
				log.Warn("queue full for %s", name)
				sendch <- evt
			case sendch <- evt:
		}
}

marten-seemann · 2023-06-19T16:57:25Z

There's a bigger difference when you set the timer to an earlier instead of a later instance, since that will result in a syscall.

The thing I'm worried about is the underlying API question. If we built an API that's so easy to misuse that it requires metrics AND logging, maybe there's something wrong with the API. I suggested switching to a sync API earlier, but of course you can always block a sync callback as well. That applies to any API that uses callbacks.

marten-seemann · 2023-06-21T09:46:18Z

#2383 will make it possible to spin up a local Grafana instance (with all our dashboards pre-installed!) by just running docker-compose up in the dashboards directory. Can't get any simpler than that!

marten-seemann changed the title ~~Log a warning if an event channel is full (and continue to block)~~ eventbus: log a warning if an event channel is full (and continue to block) Jun 14, 2023

marten-seemann added this to go-libp2p Jun 14, 2023

marten-seemann moved this to Discuss in go-libp2p Jun 14, 2023

Wondertan mentioned this issue Jun 19, 2023

identify: stuck at reading multistream header #2379

Closed

This was referenced Nov 7, 2024

Issues with identify: identify failed to complete #2983

Open

feat: eventbus: log error on slow consumers #3031

Merged

MarcoPolo closed this as completed in #3031 Nov 13, 2024

github-project-automation bot moved this from Discuss to 🎉 Done in go-libp2p Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eventbus: log a warning if an event channel is full (and continue to block) #2361

eventbus: log a warning if an event channel is full (and continue to block) #2361

MarcoPolo commented Jun 13, 2023

marten-seemann commented Jun 14, 2023

Wondertan commented Jun 14, 2023

Wondertan commented Jun 14, 2023 •

edited

Loading

marten-seemann commented Jun 14, 2023

Wondertan commented Jun 14, 2023

sukunrt commented Jun 15, 2023

Wondertan commented Jun 15, 2023

MarcoPolo commented Jun 15, 2023

sukunrt commented Jun 19, 2023 •

edited

Loading

marten-seemann commented Jun 19, 2023

marten-seemann commented Jun 21, 2023

eventbus: log a warning if an event channel is full (and continue to block) #2361

eventbus: log a warning if an event channel is full (and continue to block) #2361

Comments

MarcoPolo commented Jun 13, 2023

Before

After

marten-seemann commented Jun 14, 2023

Wondertan commented Jun 14, 2023

Wondertan commented Jun 14, 2023 • edited Loading

marten-seemann commented Jun 14, 2023

Wondertan commented Jun 14, 2023

sukunrt commented Jun 15, 2023

Wondertan commented Jun 15, 2023

MarcoPolo commented Jun 15, 2023

sukunrt commented Jun 19, 2023 • edited Loading

marten-seemann commented Jun 19, 2023

marten-seemann commented Jun 21, 2023

Wondertan commented Jun 14, 2023 •

edited

Loading

sukunrt commented Jun 19, 2023 •

edited

Loading