Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tapgarden: fix races and deadlocks in caretaker #693

Merged
merged 10 commits into from
Dec 8, 2023
Merged

Conversation

jharveyb
Copy link
Contributor

@jharveyb jharveyb commented Nov 21, 2023

Fixes #360 and #562 .

I found a few problems in the caretaker cancellation error reporting and cancellation, which manifested as an intermittent deadlock in the caretaker when running the minter tests under the race detector / with the unit-race target. The other test that revealed issues was deliberately causing failures in the caretaker as part of testing fee estimation, but that itest will be in a separate PR.

I verified these fixes locally by running 100 iterations of TestBatchAssetIssuance under standard conditions / with make unit, and under the race detector / with make unit-race.

One possible TODO is adding extra coverage for when the caretaker is cancelled, but I think the existing unit tests already cover the main partitions of before and after BatchStateBroadcast.

@dstadulis dstadulis added this to the v0.3.2 milestone Nov 21, 2023
@jharveyb jharveyb modified the milestones: v0.3.2, v0.4 Nov 21, 2023
@jharveyb jharveyb self-assigned this Nov 21, 2023
Copy link
Member

@GeorgeTsagk GeorgeTsagk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good! got one Q on the batch lifecycle & handling of pending seedlings

tapgarden/planter.go Outdated Show resolved Hide resolved
@jharveyb
Copy link
Contributor Author

jharveyb commented Nov 22, 2023

From a peek at the coverage data, current unit tests don't cover:

  • Using the actual FinalizeBatch() exposed function, and instead trigger the batch ticker directly
  • Caretaker error after a planter request for batch finalization
  • error from conf registration
  • an empty confEvent

These are addressed in the existing test MintingCancelFinalize:

  • cancel request while waiting for confirmation
  • cancel request before forwarding confirmation
  • cancel request before processing confirmation

Those cancellation requests are rejected early because we only wait for confirmation after a TX has been broadcast, and we reject cancellation requests if the batch is any state after Committed.

@jharveyb
Copy link
Contributor Author

jharveyb commented Dec 1, 2023

Ready for review, coverage is up 1.6% for the tapgarden package and 6.5% for the planter. Ran this under 100 iterations of the race detector without issue.

@jharveyb jharveyb marked this pull request as ready for review December 1, 2023 19:04
Copy link
Member

@guggero guggero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! Just a couple of nits, otherwise LGTM 🎉

tapgarden/planter.go Show resolved Hide resolved
tapgarden/planter.go Outdated Show resolved Hide resolved
make/testing_flags.mk Outdated Show resolved Hide resolved
tapgarden/caretaker.go Show resolved Hide resolved
tapgarden/planter_test.go Show resolved Hide resolved
@jharveyb jharveyb force-pushed the caretaker_stop_fixes branch 2 times, most recently from c19f2d6 to 2b6900c Compare December 6, 2023 18:37
@dstadulis
Copy link
Collaborator

@ffranr is starting review to unblock the merge queue

Copy link
Contributor

@ffranr ffranr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can imagine that this wasn't easy to debug! Nice work.

tapgarden/caretaker.go Outdated Show resolved Hide resolved
tapgarden/caretaker.go Outdated Show resolved Hide resolved
In this commit, we ensure that the planter clears the pending batch if
batch finalization fails. This allows users to create a new batch and
resubmit the assets from the failed batch, and ensures that caretakers
are destroyed after failure.
In this commit, we remove an extra error broadcast during caretaker
cancellation that could prevent graceful shutdown. If the caretaker
state machine has not reached BatchStateBroadcast, it sends potential
errors to the planter on a channel with capacity of one. If cancellation
is requested before reaching BatchStateBroadcast and fails internally,
sending that error to the planter prevents an error from being sent by
the main caretaker goroutine. We also unify cancel request handling.
In this commit, we update the TX confirmation logic to continue after a
failed batch cancellation. If the caretaker state machine has already
reached BatchStateBroadcast, batch cancellation should fail, but we
could still handle TX confirmation and complete asset minting. This
fixes the flaky deadlock in the minter unit tests.
In this commit, we update the caretaker start logic to remove an
unnecessary batch write. Before we start the caretaker, we write the
batch with the Frozen state, but we don't update the in-memory pending
batch to move from the Pending to Frozen state. This causes the
caretaker to write the batch again on start. We can address this by
updating the in-memory batch after a successful batch freeze.
In this commit, we update the TX confirmation handling logic to stop the
caretaker if confirmation registration fails. At that point, the
caretaker cannot successfully receive a confirmation for the broadcast
batch, so it should shut down to allow caretaker restart for the same
batch. Note that the planter will not actually delete the stopped
caretaker, as the error is not sent on BroadcastErrChan.
In this commit, we update the mock ChainBridge to allow for certain
calls to fail, including fee estimation, confirmation registration, and
non-empty confirmation responses.
In this commit, we redefine the batch state reported in cancelResp to
be a bool instead of an actual batch state. The provided batch state
was not being used by any callers of Cancel(), including the planter.
In this commit, we add a new test for the minter to ensure that batch
finalization errors are handled gracefully, including before and after
TX broadcast.
@jharveyb jharveyb force-pushed the caretaker_stop_fixes branch from 2b6900c to 05b82da Compare December 8, 2023 19:15
@jharveyb jharveyb added this pull request to the merge queue Dec 8, 2023
Merged via the queue into main with commit 2335031 Dec 8, 2023
14 checks passed
@guggero guggero deleted the caretaker_stop_fixes branch January 8, 2024 14:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: ✅ Done
Development

Successfully merging this pull request may close these issues.

tapgarden: race in minting tests
5 participants