stop writing to closed channel panic #6763

hexoscott · 2023-02-01T17:30:04Z

No description provided.

AlexeyAkhunov · 2023-02-01T22:10:12Z

turbo/engineapi/request_list.go

@@ -154,7 +154,11 @@ func (rl *RequestList) updateWaiting(val uint32) {
 	if val == 1 {
 		// something might be waiting to be notified of the waiting state being ready
 		for _, c := range rl.waiters {
-			c <- struct{}{}
+			// ensure the channel is still open before writing to it


Hmm, I did not actually found the place where these channels are closed, I was looking to see if closing happens under the same lock...

channels are thread-safe. Mutex will not protect against sending to closed channel. Need wait until all producers/consumers shutdown the close (or close by producer).

I introduced the waiters here so can confirm they are only closed in the stageLoopIsBusy func in ethbackend.go. This scenario could occur when the wait in stageLoopIsBusy times out after a second returning to the caller and the channel is closed before the loop over the waiters in request_list.go is triggered. I didn't anticipate this originally and tests / hive were happy with the change.... until I ran up a node and spotted the panics.

This seems the cleanest solution. Although I did consider returning a new channel from the call to WaitForWaiting so the channel creation and closing is the sole responsibility of the code in request_list.go. The stageLoopIsBusy func could just wait on that channel being closed as a signal, so no code would actually be writing to the channel in the first place. Any thoughts?

"The stageLoopIsBusy func could just wait on that channel being closed as a signal, so no code would actually be writing to the channel in the first place" - now I understand general idea.

My thoughts:

"create and close channel by producer" is general pattern to avoid "write to closed channel" problem (now you close channel by consumer).

WaitForWaiting doesn't need aquier write mutex in the beginning of func, because mutex protect only rl.waiters field, all other logic don't need mutex. You can create getter/setter methods for rl.waiters field and only this methods will use mutex. Because "lock mutex and then write to channel" is bad pattern - because "write to channel" has undetermenistic time and "lock mutex for as-short-as-possible" is generally-good idea.

make(chan struct{}) for notifications better use buffered channels like make(chan struct{}, 1) then notifier can notify all subscribers without waiting for subscribers. Also when use buffered channels for notifications with 'at-least one' guranties, can use non-blocking select with default because "if channel already full, then no difference - send there notification or drop it, subscribed will be notified 'at-least once' ".

Maybe WaitForWaiting can be implemented as loop with syncCond.Wait? Then it will be hard to time-out such wait. And actually this is your main problem: syncCond.Wait is not compatible with channels (with Select) in Go. So, maybe think about "not using syncCond".

Other:
5. "The stageLoopIsBusy func could just wait ..." - problem sounds similar to "send request to server, wait for response, but limit server-side work by timeout". In go we do solve this problem by context.WithTimeout(). I don't have clear advise here, just it sounds similar.

Thanks for the tip, I always had in mind that the code that creates a channel should be responsible for closing it. But the distinction between producer and consumer makes it clearer. In this case should we just close the channel from the request_list side (producer) as we're just waiting on the signal, and it will always be to a single consumer?

Great idea, I'll get the implemented!

Great advice on using a buffered channel, never considered the select with default pattern there to avoid waiting.

I tried to use sync.Cond when first attempting this, and ended up with deadlocks as the call to Broadcast could happen before the call to Wait, which then stops any processing happening. Also the timeout was really difficult to juggle as you mentioned. I went to a channel based implementation after finding this.

I think normally this would make a lot of sense, just wait on the context. But here we're just checking if some other process is busy or not, we haven't made a request to it, we're just checking its status, and want to make sure we wait a little while before confirming that it is ready or not.

For example WaitForWaiting can create and return channel or channel + bool (for optimization - when don’t need wait).

Or accept timeout as parameter and just block inside without returning channel. In this case can accept timeout as time or as context object (then it also may be canceled by Ctrl+C).

AskAlexSharov · 2023-02-02T03:15:42Z

Also I advise use uber/atomic package instead of atomic.LoadUint32 if you have small number of atomic variables - because it's more type-safe (can't modify it non-atomic way by mistake).
(FYI: go1.19 also has typed atomics, but we keep go1.18 compatibility yet).

hexoscott · 2023-02-02T07:43:40Z

Thanks for the pointers @AlexeyAkhunov & @AskAlexSharov. I'm modifying the code to use uber/atomic now, new code inbound!

hexoscott · 2023-02-05T13:16:03Z

I've pushed another implementation for this that works well locally and hive tests are happily passing as well.

All waiting, and channel creation/closing is now within request_list.go.

There a number of locks that might seem odd at first, but I need to take a lock on the waiting atomic before changing it and also when checking its state & creating the waiter so that we don't end up with the scenario of checking the status and it being updated immediately afterwards with no way of notifying. So in this case we take a lock to stop updates happening, add the waiter, then release the lock so the status can be modified.

AskAlexSharov · 2023-02-06T02:08:18Z

turbo/engineapi/request_list.go

+	isWaiting := rl.waiting.Load()
+	if isWaiting == 1 {
+		// we are already waiting so just return
+		rl.waiterMtx.Unlock()


I advise to unlock mutex by defer, otherwise panic() may cause deadlock

also: why need mutex lock here? rl.waiting is atomic already

This has been the problem I'm facing, if I check the state of waiting it can be changed immediately afterwards whilst the waiter is created and it will never be notified. So we end up waiting the full timeout time when we could have waited just 5ms or something. The lock here is to ensure that whilst the state of rl.waiting is being checked that it can't be updated until we've returned from this function or that the waiter is in place to monitor for the update

AskAlexSharov · 2023-02-06T02:11:03Z

turbo/engineapi/request_list.go

+			}
+		}()
+		rl.waiterMtx.Unlock()
+		wg.Wait()


Seems can replace this waitGroup and goroutine by select on 2 channels right here.

AskAlexSharov · 2023-02-06T02:11:29Z

turbo/engineapi/request_list.go

 	}
+	rl.waiters = make([]chan struct{}, 0)


this is not thread-safe

the function is only ever called within a locked context, but it's not all that clear that this is the case. You mentioned the other day about only public functions managing mutex locks to avoid double locking scenarios. Is this still desired? If not I can move the locks closer to this code

then it's ok

hexoscott requested a review from yperbasis February 1, 2023 17:30

AlexeyAkhunov reviewed Feb 1, 2023

View reviewed changes

hexoscott force-pushed the concurrency-fix-request-list branch from d269ed4 to 75322e1 Compare February 2, 2023 07:45

hexoscott requested review from AskAlexSharov and AlexeyAkhunov February 2, 2023 07:45

hexoscott force-pushed the concurrency-fix-request-list branch 2 times, most recently from edf824e to 8f5fd0d Compare February 2, 2023 08:27

stop writing to closed channel panic

5ea4e69

hexoscott force-pushed the concurrency-fix-request-list branch from 8f5fd0d to 3c2dad4 Compare February 5, 2023 13:09

AskAlexSharov reviewed Feb 6, 2023

View reviewed changes

AskAlexSharov mentioned this pull request Feb 6, 2023

v2.38.0 just spits error in a loop #6781

Closed

better wait on stage loop being busy

6726732

hexoscott force-pushed the concurrency-fix-request-list branch from 3c2dad4 to 6726732 Compare February 6, 2023 10:58

AskAlexSharov merged commit 80a37eb into erigontech:devel Feb 6, 2023

hexoscott deleted the concurrency-fix-request-list branch February 6, 2023 11:17

AlexeyAkhunov pushed a commit that referenced this pull request Feb 7, 2023

stop writing to closed channel panic (#6763)

eada7a3

yperbasis mentioned this pull request Feb 13, 2023

Revert to simple polling wait #6861

Closed

AskAlexSharov mentioned this pull request Feb 16, 2023

Low attestations success rate #6807

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stop writing to closed channel panic #6763

stop writing to closed channel panic #6763

hexoscott commented Feb 1, 2023

AlexeyAkhunov Feb 1, 2023

AskAlexSharov Feb 2, 2023

hexoscott Feb 2, 2023

AskAlexSharov Feb 2, 2023

hexoscott Feb 2, 2023

AskAlexSharov Feb 2, 2023 •

edited

Loading

AskAlexSharov Feb 2, 2023 •

edited

Loading

AskAlexSharov commented Feb 2, 2023

hexoscott commented Feb 2, 2023

hexoscott commented Feb 5, 2023

AskAlexSharov Feb 6, 2023

AskAlexSharov Feb 6, 2023

hexoscott Feb 6, 2023

AskAlexSharov Feb 6, 2023

AskAlexSharov Feb 6, 2023

hexoscott Feb 6, 2023

AskAlexSharov Feb 6, 2023

stop writing to closed channel panic #6763

stop writing to closed channel panic #6763

Conversation

hexoscott commented Feb 1, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AskAlexSharov Feb 2, 2023 • edited Loading

Choose a reason for hiding this comment

AskAlexSharov Feb 2, 2023 • edited Loading

Choose a reason for hiding this comment

AskAlexSharov commented Feb 2, 2023

hexoscott commented Feb 2, 2023

hexoscott commented Feb 5, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AskAlexSharov Feb 2, 2023 •

edited

Loading

AskAlexSharov Feb 2, 2023 •

edited

Loading