fix(DeadlockTest): Handle draining of closed channel, speed up test. #1957
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes DGRAPHCORE-153
Problem
We have had frequent problems with the TestPublisherDeadlockTest. Every once in a while, we would witness timeouts on the test itself. I witnessed that in at least one case, the problem could be attributed to a channel being "drained" but having incorrect handling in the case where the channel is closed. While this was not a majority root-cause for this issue, it is nevertheless something we should correctly handle. Additionally, we find that these tests can take quite a while -- most of the time simply waiting for go-routines to finish their time.Sleep(...) invocations. The use of time.Sleep to handle timing issues is typically a sign that the messaging has not been fully thought through.
Solution
For the issue with the "draining" not handling closed channels, the process is simple: we just add a check for the channel already being closed during the drain process. For the issue of the overuse of time.Sleep, and long-running invocations, I replaced the time.Sleep invocations with WaitGroup instances. After these changes, I am unable to prove that the full issue has been resolved, but I find that I have been unable to reproduce the original failures after these changes. Invoking
go test -race -v --count=1000 -run TestPublisherDeadlock
now completes in roughly 80 seconds (for the 1000 tests) with each test taking roughly 0.08 seconds as opposed to roughly 20 seconds (and multiple hours required to run 1000 tests). Moreover, on the 10 or so trials I have made to reproduced the original problem after this change, I get 100% test success.