tapgarden: fix races and deadlocks in caretaker #693

jharveyb · 2023-11-21T14:28:36Z

Fixes #360 and #562 .

I found a few problems in the caretaker cancellation error reporting and cancellation, which manifested as an intermittent deadlock in the caretaker when running the minter tests under the race detector / with the unit-race target. The other test that revealed issues was deliberately causing failures in the caretaker as part of testing fee estimation, but that itest will be in a separate PR.

I verified these fixes locally by running 100 iterations of TestBatchAssetIssuance under standard conditions / with make unit, and under the race detector / with make unit-race.

One possible TODO is adding extra coverage for when the caretaker is cancelled, but I think the existing unit tests already cover the main partitions of before and after BatchStateBroadcast.

GeorgeTsagk

looks good! got one Q on the batch lifecycle & handling of pending seedlings

tapgarden/planter.go

jharveyb · 2023-11-22T18:22:36Z

From a peek at the coverage data, current unit tests don't cover:

Using the actual FinalizeBatch() exposed function, and instead trigger the batch ticker directly
Caretaker error after a planter request for batch finalization
error from conf registration
an empty confEvent

These are addressed in the existing test MintingCancelFinalize:

cancel request while waiting for confirmation
cancel request before forwarding confirmation
cancel request before processing confirmation

Those cancellation requests are rejected early because we only wait for confirmation after a TX has been broadcast, and we reject cancellation requests if the batch is any state after Committed.

jharveyb · 2023-12-01T19:04:04Z

Ready for review, coverage is up 1.6% for the tapgarden package and 6.5% for the planter. Ran this under 100 iterations of the race detector without issue.

guggero

Very nice! Just a couple of nits, otherwise LGTM 🎉

tapgarden/planter.go

make/testing_flags.mk

tapgarden/caretaker.go

tapgarden/planter_test.go

dstadulis · 2023-12-07T16:36:41Z

@ffranr is starting review to unblock the merge queue

ffranr

I can imagine that this wasn't easy to debug! Nice work.

tapgarden/caretaker.go

In this commit, we ensure that the planter clears the pending batch if batch finalization fails. This allows users to create a new batch and resubmit the assets from the failed batch, and ensures that caretakers are destroyed after failure.

In this commit, we remove an extra error broadcast during caretaker cancellation that could prevent graceful shutdown. If the caretaker state machine has not reached BatchStateBroadcast, it sends potential errors to the planter on a channel with capacity of one. If cancellation is requested before reaching BatchStateBroadcast and fails internally, sending that error to the planter prevents an error from being sent by the main caretaker goroutine. We also unify cancel request handling.

In this commit, we update the TX confirmation logic to continue after a failed batch cancellation. If the caretaker state machine has already reached BatchStateBroadcast, batch cancellation should fail, but we could still handle TX confirmation and complete asset minting. This fixes the flaky deadlock in the minter unit tests.

In this commit, we update the caretaker start logic to remove an unnecessary batch write. Before we start the caretaker, we write the batch with the Frozen state, but we don't update the in-memory pending batch to move from the Pending to Frozen state. This causes the caretaker to write the batch again on start. We can address this by updating the in-memory batch after a successful batch freeze.

In this commit, we update the TX confirmation handling logic to stop the caretaker if confirmation registration fails. At that point, the caretaker cannot successfully receive a confirmation for the broadcast batch, so it should shut down to allow caretaker restart for the same batch. Note that the planter will not actually delete the stopped caretaker, as the error is not sent on BroadcastErrChan.

In this commit, we update the mock ChainBridge to allow for certain calls to fail, including fee estimation, confirmation registration, and non-empty confirmation responses.

In this commit, we redefine the batch state reported in cancelResp to be a bool instead of an actual batch state. The provided batch state was not being used by any callers of Cancel(), including the planter.

In this commit, we add a new test for the minter to ensure that batch finalization errors are handled gracefully, including before and after TX broadcast.

dstadulis added this to the v0.3.2 milestone Nov 21, 2023

jharveyb modified the milestones: v0.3.2, v0.4 Nov 21, 2023

jharveyb self-assigned this Nov 21, 2023

GeorgeTsagk reviewed Nov 22, 2023

View reviewed changes

tapgarden/planter.go Outdated Show resolved Hide resolved

jharveyb marked this pull request as draft November 22, 2023 18:29

jharveyb mentioned this pull request Nov 22, 2023

multi: anchor fee test coverage #605

Merged

jharveyb force-pushed the caretaker_stop_fixes branch from 5ae0a56 to 83d5141 Compare November 22, 2023 18:49

This was referenced Nov 28, 2023

Race from batch cancellation #562

Closed

[bug]: Planter cannot reply to requests while caretaker is finalizing a batch #705

Open

jharveyb force-pushed the caretaker_stop_fixes branch 3 times, most recently from 108861f to 93ca166 Compare December 1, 2023 18:59

jharveyb marked this pull request as ready for review December 1, 2023 19:04

jharveyb requested review from guggero, ffranr and GeorgeTsagk December 1, 2023 19:04

guggero approved these changes Dec 4, 2023

View reviewed changes

tapgarden/planter.go Show resolved Hide resolved

tapgarden/planter.go Outdated Show resolved Hide resolved

make/testing_flags.mk Outdated Show resolved Hide resolved

tapgarden/caretaker.go Show resolved Hide resolved

tapgarden/planter_test.go Show resolved Hide resolved

jharveyb force-pushed the caretaker_stop_fixes branch 2 times, most recently from c19f2d6 to 2b6900c Compare December 6, 2023 18:37

ffranr approved these changes Dec 7, 2023

View reviewed changes

tapgarden/caretaker.go Outdated Show resolved Hide resolved

tapgarden/caretaker.go Outdated Show resolved Hide resolved

jharveyb added 6 commits December 8, 2023 11:40

tapgarden: stop and delete caretaker on err

d510039

In this commit, we ensure that the planter clears the pending batch if batch finalization fails. This allows users to create a new batch and resubmit the assets from the failed batch, and ensures that caretakers are destroyed after failure.

make: enable per-package race checking

24b6986

jharveyb added 4 commits December 8, 2023 14:08

tapgarden: enable fee estimate and TX conf failure

47568ee

In this commit, we update the mock ChainBridge to allow for certain calls to fail, including fee estimation, confirmation registration, and non-empty confirmation responses.

tapgarden: drop batch state from CancelResp

d8d5087

In this commit, we redefine the batch state reported in cancelResp to be a bool instead of an actual batch state. The provided batch state was not being used by any callers of Cancel(), including the planter.

fn: add Last helper function

e507890

tapgarden: test err handling in FinalizeBatch

05b82da

In this commit, we add a new test for the minter to ensure that batch finalization errors are handled gracefully, including before and after TX broadcast.

jharveyb force-pushed the caretaker_stop_fixes branch from 2b6900c to 05b82da Compare December 8, 2023 19:15

jharveyb added this pull request to the merge queue Dec 8, 2023

Merged via the queue into main with commit 2335031 Dec 8, 2023
14 checks passed

guggero deleted the caretaker_stop_fixes branch January 8, 2024 14:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tapgarden: fix races and deadlocks in caretaker #693

tapgarden: fix races and deadlocks in caretaker #693

jharveyb commented Nov 21, 2023 •

edited

Loading

GeorgeTsagk left a comment

jharveyb commented Nov 22, 2023 •

edited

Loading

jharveyb commented Dec 1, 2023

guggero left a comment

dstadulis commented Dec 7, 2023

ffranr left a comment

tapgarden: fix races and deadlocks in caretaker #693

tapgarden: fix races and deadlocks in caretaker #693

Conversation

jharveyb commented Nov 21, 2023 • edited Loading

GeorgeTsagk left a comment

Choose a reason for hiding this comment

jharveyb commented Nov 22, 2023 • edited Loading

jharveyb commented Dec 1, 2023

guggero left a comment

Choose a reason for hiding this comment

dstadulis commented Dec 7, 2023

ffranr left a comment

Choose a reason for hiding this comment

jharveyb commented Nov 21, 2023 •

edited

Loading

jharveyb commented Nov 22, 2023 •

edited

Loading