-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tapgarden: fix races and deadlocks in caretaker #693
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good! got one Q on the batch lifecycle & handling of pending seedlings
From a peek at the coverage data, current unit tests don't cover:
These are addressed in the existing test
Those cancellation requests are rejected early because we only wait for confirmation after a TX has been broadcast, and we reject cancellation requests if the batch is any state after Committed. |
5ae0a56
to
83d5141
Compare
108861f
to
93ca166
Compare
Ready for review, coverage is up 1.6% for the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice! Just a couple of nits, otherwise LGTM 🎉
c19f2d6
to
2b6900c
Compare
@ffranr is starting review to unblock the merge queue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can imagine that this wasn't easy to debug! Nice work.
In this commit, we ensure that the planter clears the pending batch if batch finalization fails. This allows users to create a new batch and resubmit the assets from the failed batch, and ensures that caretakers are destroyed after failure.
In this commit, we remove an extra error broadcast during caretaker cancellation that could prevent graceful shutdown. If the caretaker state machine has not reached BatchStateBroadcast, it sends potential errors to the planter on a channel with capacity of one. If cancellation is requested before reaching BatchStateBroadcast and fails internally, sending that error to the planter prevents an error from being sent by the main caretaker goroutine. We also unify cancel request handling.
In this commit, we update the TX confirmation logic to continue after a failed batch cancellation. If the caretaker state machine has already reached BatchStateBroadcast, batch cancellation should fail, but we could still handle TX confirmation and complete asset minting. This fixes the flaky deadlock in the minter unit tests.
In this commit, we update the caretaker start logic to remove an unnecessary batch write. Before we start the caretaker, we write the batch with the Frozen state, but we don't update the in-memory pending batch to move from the Pending to Frozen state. This causes the caretaker to write the batch again on start. We can address this by updating the in-memory batch after a successful batch freeze.
In this commit, we update the TX confirmation handling logic to stop the caretaker if confirmation registration fails. At that point, the caretaker cannot successfully receive a confirmation for the broadcast batch, so it should shut down to allow caretaker restart for the same batch. Note that the planter will not actually delete the stopped caretaker, as the error is not sent on BroadcastErrChan.
In this commit, we update the mock ChainBridge to allow for certain calls to fail, including fee estimation, confirmation registration, and non-empty confirmation responses.
In this commit, we redefine the batch state reported in cancelResp to be a bool instead of an actual batch state. The provided batch state was not being used by any callers of Cancel(), including the planter.
In this commit, we add a new test for the minter to ensure that batch finalization errors are handled gracefully, including before and after TX broadcast.
2b6900c
to
05b82da
Compare
Fixes #360 and #562 .
I found a few problems in the caretaker cancellation error reporting and cancellation, which manifested as an intermittent deadlock in the caretaker when running the minter tests under the race detector / with the
unit-race
target. The other test that revealed issues was deliberately causing failures in the caretaker as part of testing fee estimation, but that itest will be in a separate PR.I verified these fixes locally by running 100 iterations of
TestBatchAssetIssuance
under standard conditions / withmake unit
, and under the race detector / withmake unit-race
.One possible TODO is adding extra coverage for when the caretaker is cancelled, but I think the existing unit tests already cover the main partitions of before and after
BatchStateBroadcast
.