Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eventbus: log a warning if an event channel is full (and continue to block) #2361

Closed
MarcoPolo opened this issue Jun 13, 2023 · 11 comments · Fixed by #3031
Closed

eventbus: log a warning if an event channel is full (and continue to block) #2361

MarcoPolo opened this issue Jun 13, 2023 · 11 comments · Fixed by #3031

Comments

@MarcoPolo
Copy link
Collaborator

We had another case pop up where it wasn't obvious that an event channel from the eventbus was full and things were getting blocked. We could be nice here and log something for the user with the tradeoff of paying for an extra timer and select. I'm thinking something like this here.

Before

sink.ch <- evt

After

timer.Reset(time.Second)
select {
case <-timer:
	log.Errorf("Event to %s is being delayed because the event channel is full. Are you doing blocking work? If so, spawn a goroutine.", sink.name, len(sink.ch))
	sink.ch <- evt // still block
case sink.ch <- evt:
}

(and some extra logic around cleaning up and reseting the timer. Ideally keeping this timer as part of the *node)

@marten-seemann
Copy link
Contributor

Setting a timer is quite expensive. We discussed this a few months ago when debugging a similar case, and decided that event bus metrics are the cleaner solution.

Maybe event bus shouldn’t do be async? That would be a pretty big design change, but it would allow us to measure the time it takes to consume an event (subtracting two timestamps is cheap compared to arming a timer).

@Wondertan
Copy link
Contributor

#2026

@marten-seemann marten-seemann changed the title Log a warning if an event channel is full (and continue to block) eventbus: log a warning if an event channel is full (and continue to block) Jun 14, 2023
@marten-seemann marten-seemann moved this to Discuss in go-libp2p Jun 14, 2023
@Wondertan
Copy link
Contributor

Wondertan commented Jun 14, 2023

Btw, with solution above, the timer is only allocated once the subscription is overflowed, which won't happen in an app that reads notification correctly, so I don't think there is a price penalty for the timer when there are no bugs on app/protocol side.

@marten-seemann
Copy link
Contributor

How so? You always need to set the timer, because you don’t know when the sending on the channel would block.

Really, this is a problem that should be solved by monitoring. We already have a Grafana dashboard that immediately shows you the source of the problem, including line number. This sounds like an excellent use case for Grafana alerting.

@Wondertan
Copy link
Contributor

The block on the sending on the channel is identified with default

@sukunrt
Copy link
Member

sukunrt commented Jun 15, 2023

@MarcoPolo What more info would the log line provide over the libp2p_eventbus_subscriber_queue_length metric?

@Wondertan
Copy link
Contributor

Not all the nodes are running with metrics enabled, while every node is running with logs on. And there can be a case where the Kubo node hosted by PL with metrics allowed does not see any issues while some random node runner experiences it and has no data to bug report. The point is that these two are not exclusive but complementary.

@MarcoPolo
Copy link
Collaborator Author

These are not mutually exclusive. I think we can do something nicer here besides asking every user to setup a prometheus and grafana instance. Especially since this has bitten multiple people who are familiar with the stack.

We can avoid the timer by tracking how long it took to place an event on the channel and if that's over a threshold it'll log an error. That won't prevent against a consumer that never returns, but at least it would highlight a slow consumer.

@sukunrt
Copy link
Member

sukunrt commented Jun 19, 2023

@marten-seemann when we say resetting a timer is expensive, do you mean anything other than cpu time taken?
These benchmarks take similar time for me

func BenchmarkTimerReset(b *testing.B) {
	var durations []time.Duration
	for i := 0; i < 100; i++ {
		durations = append(durations, time.Hour+time.Duration(i)*time.Minute)
	}
	t := time.NewTimer(100 * time.Millisecond)
	for i := 0; i < b.N; i++ {
		t.Reset(durations[i%len(durations)])
	}
}

func BenchmarkCounterVecIncrement(b *testing.B) {
	counter := prometheus.NewCounterVec(prometheus.CounterOpts{
		Namespace: "lib",
		Name:      "counter",
	},
		[]string{"label1"},
	)
	label := []string{"hello"}
	for i := 0; i < b.N; i++ {
		counter.WithLabelValues(label...).Inc()
	}
}

both of them take about 30ns on m1 mac and 40-50ns on an old core i7(i7-8550U CPU @ 1.80GHz)

Looking at these numbers, I'm fine with a pattern like this, though it is rather ugly.

select {
	case sendch <- evt:
	default:
		t.Reset(time.Second)
		select {
			case <-t.C:
				log.Warn("queue full for %s", name)
				sendch <- evt
			case sendch <- evt:
		}
}

@marten-seemann
Copy link
Contributor

There's a bigger difference when you set the timer to an earlier instead of a later instance, since that will result in a syscall.

The thing I'm worried about is the underlying API question. If we built an API that's so easy to misuse that it requires metrics AND logging, maybe there's something wrong with the API. I suggested switching to a sync API earlier, but of course you can always block a sync callback as well. That applies to any API that uses callbacks.

@marten-seemann
Copy link
Contributor

#2383 will make it possible to spin up a local Grafana instance (with all our dashboards pre-installed!) by just running docker-compose up in the dashboards directory. Can't get any simpler than that!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🎉 Done
Development

Successfully merging a pull request may close this issue.

4 participants