Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stop writing to closed channel panic #6763

Merged

Conversation

hexoscott
Copy link
Collaborator

No description provided.

@@ -154,7 +154,11 @@ func (rl *RequestList) updateWaiting(val uint32) {
if val == 1 {
// something might be waiting to be notified of the waiting state being ready
for _, c := range rl.waiters {
c <- struct{}{}
// ensure the channel is still open before writing to it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I did not actually found the place where these channels are closed, I was looking to see if closing happens under the same lock...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

channels are thread-safe. Mutex will not protect against sending to closed channel. Need wait until all producers/consumers shutdown the close (or close by producer).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I introduced the waiters here so can confirm they are only closed in the stageLoopIsBusy func in ethbackend.go. This scenario could occur when the wait in stageLoopIsBusy times out after a second returning to the caller and the channel is closed before the loop over the waiters in request_list.go is triggered. I didn't anticipate this originally and tests / hive were happy with the change.... until I ran up a node and spotted the panics.

This seems the cleanest solution. Although I did consider returning a new channel from the call to WaitForWaiting so the channel creation and closing is the sole responsibility of the code in request_list.go. The stageLoopIsBusy func could just wait on that channel being closed as a signal, so no code would actually be writing to the channel in the first place. Any thoughts?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"The stageLoopIsBusy func could just wait on that channel being closed as a signal, so no code would actually be writing to the channel in the first place" - now I understand general idea.

My thoughts:

  1. "create and close channel by producer" is general pattern to avoid "write to closed channel" problem (now you close channel by consumer).
  2. WaitForWaiting doesn't need aquier write mutex in the beginning of func, because mutex protect only rl.waiters field, all other logic don't need mutex. You can create getter/setter methods for rl.waiters field and only this methods will use mutex. Because "lock mutex and then write to channel" is bad pattern - because "write to channel" has undetermenistic time and "lock mutex for as-short-as-possible" is generally-good idea.
  3. make(chan struct{}) for notifications better use buffered channels like make(chan struct{}, 1) then notifier can notify all subscribers without waiting for subscribers. Also when use buffered channels for notifications with 'at-least one' guranties, can use non-blocking select with default because "if channel already full, then no difference - send there notification or drop it, subscribed will be notified 'at-least once' ".
  4. Maybe WaitForWaiting can be implemented as loop with syncCond.Wait? Then it will be hard to time-out such wait. And actually this is your main problem: syncCond.Wait is not compatible with channels (with Select) in Go. So, maybe think about "not using syncCond".

Other:
5. "The stageLoopIsBusy func could just wait ..." - problem sounds similar to "send request to server, wait for response, but limit server-side work by timeout". In go we do solve this problem by context.WithTimeout(). I don't have clear advise here, just it sounds similar.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Thanks for the tip, I always had in mind that the code that creates a channel should be responsible for closing it. But the distinction between producer and consumer makes it clearer. In this case should we just close the channel from the request_list side (producer) as we're just waiting on the signal, and it will always be to a single consumer?

  2. Great idea, I'll get the implemented!

  3. Great advice on using a buffered channel, never considered the select with default pattern there to avoid waiting.

  4. I tried to use sync.Cond when first attempting this, and ended up with deadlocks as the call to Broadcast could happen before the call to Wait, which then stops any processing happening. Also the timeout was really difficult to juggle as you mentioned. I went to a channel based implementation after finding this.

  5. I think normally this would make a lot of sense, just wait on the context. But here we're just checking if some other process is busy or not, we haven't made a request to it, we're just checking its status, and want to make sure we wait a little while before confirming that it is ready or not.

Copy link
Collaborator

@AskAlexSharov AskAlexSharov Feb 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. For example WaitForWaiting can create and return channel or channel + bool (for optimization - when don’t need wait).

Copy link
Collaborator

@AskAlexSharov AskAlexSharov Feb 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Or accept timeout as parameter and just block inside without returning channel. In this case can accept timeout as time or as context object (then it also may be canceled by Ctrl+C).

@AskAlexSharov
Copy link
Collaborator

Also I advise use uber/atomic package instead of atomic.LoadUint32 if you have small number of atomic variables - because it's more type-safe (can't modify it non-atomic way by mistake).
(FYI: go1.19 also has typed atomics, but we keep go1.18 compatibility yet).

@hexoscott
Copy link
Collaborator Author

Thanks for the pointers @AlexeyAkhunov & @AskAlexSharov. I'm modifying the code to use uber/atomic now, new code inbound!

@hexoscott hexoscott force-pushed the concurrency-fix-request-list branch 2 times, most recently from edf824e to 8f5fd0d Compare February 2, 2023 08:27
@hexoscott
Copy link
Collaborator Author

I've pushed another implementation for this that works well locally and hive tests are happily passing as well.

All waiting, and channel creation/closing is now within request_list.go.

There a number of locks that might seem odd at first, but I need to take a lock on the waiting atomic before changing it and also when checking its state & creating the waiter so that we don't end up with the scenario of checking the status and it being updated immediately afterwards with no way of notifying. So in this case we take a lock to stop updates happening, add the waiter, then release the lock so the status can be modified.

isWaiting := rl.waiting.Load()
if isWaiting == 1 {
// we are already waiting so just return
rl.waiterMtx.Unlock()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I advise to unlock mutex by defer, otherwise panic() may cause deadlock

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also: why need mutex lock here? rl.waiting is atomic already

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been the problem I'm facing, if I check the state of waiting it can be changed immediately afterwards whilst the waiter is created and it will never be notified. So we end up waiting the full timeout time when we could have waited just 5ms or something. The lock here is to ensure that whilst the state of rl.waiting is being checked that it can't be updated until we've returned from this function or that the waiter is in place to monitor for the update

}
}()
rl.waiterMtx.Unlock()
wg.Wait()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems can replace this waitGroup and goroutine by select on 2 channels right here.

}
rl.waiters = make([]chan struct{}, 0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not thread-safe

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the function is only ever called within a locked context, but it's not all that clear that this is the case. You mentioned the other day about only public functions managing mutex locks to avoid double locking scenarios. Is this still desired? If not I can move the locks closer to this code

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then it's ok

@AskAlexSharov AskAlexSharov merged commit 80a37eb into erigontech:devel Feb 6, 2023
@hexoscott hexoscott deleted the concurrency-fix-request-list branch February 6, 2023 11:17
AlexeyAkhunov pushed a commit that referenced this pull request Feb 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants