Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VA: Fix performRemoteValidation goroutine leak #7727

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

beautifulentropy
Copy link
Member

@beautifulentropy beautifulentropy commented Sep 27, 2024

PerformValidation goroutines write to a buffered results channel to prevent blocking and terminate early when an exit condition is met.

@beautifulentropy beautifulentropy marked this pull request as ready for review September 27, 2024 16:25
@beautifulentropy beautifulentropy requested a review from a team as a code owner September 27, 2024 16:25
@pavelnikolov
Copy link

No unit tests need to be changed?

Copy link
Contributor

@aarongable aarongable left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suppose that the parent gRPC context is cancelled.

Today, the resulting sequence of events is:

  1. the child rva.PerformValidation gRPC detect the cancellation and quit, writing their cancellation errors to the results channel
  2. the for result := range results loop reads errors from the results channel until the errors surpass the threshold
  3. it increments the remoteValidationFailures metric, modifies the first error to indicate it occurred during secondary validation, and returns
  4. the remaining goroutines hang trying to write to the results channel

With this change, the resulting sequence of events becomes racy and non-deterministic, depending on what order the go runtime schedules selects. It might behave like the above (with the exception of leaking the child goroutines), or:

  1. the for/select loop detects the cancellation and immediately quits, returning a different kind of error describing the cancellation and not incrementing the metric

I don't love this non-determinism.

I think a cleaner solution might be simply to make the results channel a buffered channel with capacity len(va.remoteVAs). We know that number is always going to be small -- likely never more than 6, even with the new BRs MPIC requirements -- and the items in it are small because they're all pointers.

This way all the rest of the logic can remain untouched: the child goroutines are guaranteed to be able to write to the results channel and exit, and the loop reading from the results channel can continue to early-return as soon as it gets enough results without having to worry about checking for cancellation itself.

That said, I think creating a new context for the child gRPCs and cancelling it when this function returns is still a good idea, as it will cause them to bail out faster.

va/va.go Outdated Show resolved Hide resolved
Comment on lines +483 to +486
}:
case <-ctx.Done():
// Context canceled, exit the goroutine.
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend removing this branch. Now that the results chan is buffered to the full length of possible outputs, writing to it (on line 479) is guaranteed not to block. Putting a case <-ctx.Done() in doesn't change anything.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was added to facilitate early termination of the goroutines, not prevent blocking on the channel.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I'm saying the goroutine will terminate at the same time even without the case <-ctx.Done(), because the only other thing that's happening here is the channel write, which doesn't block.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants