Replay events during restart to avoid tx missing #211

yzang2019 · 2024-03-14T10:15:21Z

Describe your changes and provide context

Problem:
This is an edge case, a node shutting down process can be triggered during ApplyBlock, and within ApplyBlock function, there are a few major steps:

FinalizeBlock
SaveFinalizeBlockResponses
blockExec.Commit
blockExec.store.Save
fireEvents -> eventbus service -> subscriber -> index tx events

The process can go down during any of these 5 steps, and if process went down between step 4 and 5, it would lead to the events not fired and txs not being indexed correctly, even though the blocks are successfully commited. After the node restart, when we replay blocks, we will not replay or re-fire those events because there's no block left to be replayed.

Solution:
It is a bit hard to make sure events always fire correctly during shutdown, because events publish is an async process, there's not really a way to make sure shutdown will always wait until events are all published and subscribed and processed.

So instead of fixing the shutdown logic, we choose to reindex the events after a node restart and during the replay/recover stage. Hence this PR add a function to replay the events even if there's no block to replay, which will then ensure the events are always correctly published regardless when the shutdown happens.

Testing performed to validate your change

Tested on atlantic-2 archive node, we manually add a sleep between each step to repro the ungraceful shutdown bug, and proves this PR does fix the edge case.

codecov-commenter · 2024-03-14T10:18:54Z

Codecov Report

Attention: Patch coverage is 50.00000% with 10 lines in your changes are missing coverage. Please review.

Project coverage is 58.14%. Comparing base (4269298) to head (2e3b5dc).

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #211      +/-   ##
==========================================
+ Coverage   57.92%   58.14%   +0.21%     
==========================================
  Files         249      249              
  Lines       33918    33935      +17     
==========================================
+ Hits        19646    19730      +84     
+ Misses      12710    12637      -73     
- Partials     1562     1568       +6

Files	Coverage Δ
internal/state/execution.go	`61.57% <50.00%> (ø)`
internal/consensus/replay.go	`68.70% <50.00%> (-0.98%)`	⬇️

... and 18 files with indirect coverage changes

* reformat logs to use simple concatenation with separators (#207) * Use write-lock in (*TxPriorityQueue).ReapMax funcs (#209) ReapMaxBytesMaxGas and ReapMaxTxs funcs in TxPriorityQueue claim > Transactions returned are not removed from the mempool transaction > store or indexes. However, they use a priority queue to accomplish the claim > Transaction are retrieved in priority order. This is accomplished by popping all items out of the whole heap, and then pushing then back in sequentially. A copy of the heap cannot be obtained otherwise. Both of the mentioned functions use a read-lock (RLock) when doing this. This results in a potential scenario where multiple executions of the ReapMax can be started in parallel, and both would be popping items out of the priority queue. In practice, this can be abused by executing the `unconfirmed_txs` RPC call repeatedly. Based on our observations, running it multiple times per millisecond results in multiple threads picking it up at the same time. Such a scenario can be obtained via the WebSocket interface, and spamming `unconfirmed_txs` calls there. The behavior that happens is a `Panic in WSJSONRPC handler` when a queue item unexpectedly disappears for `mempool.(*TxPriorityQueue).Swap`. (`runtime error: index out of range [0] with length 0`) This can additionally lead to a `CONSENSUS FAILURE!!!` if the race condition occurs for `internal/consensus.(*State).finalizeCommit` when it tries to do `mempool.(*TxPriorityQueue).RemoveTx`, but the ReapMax has already removed all elements from the underlying heap. (`runtime error: index out of range [-1]`) This commit switches the lock type to a write-lock (Lock) to ensure no parallel modifications take place. This commit additionally updates the tests to allow parallel execution of the func calls in testing, as to prevent regressions (in case someone wants to downgrade the locks without considering the implications from the underlying heap usage). * Fix root dir for tendermint reindex command (#210) * Replay events during restart to avoid tx missing (#211) --------- Co-authored-by: Denys S <150304777+dssei@users.noreply.github.com> Co-authored-by: Valters Jansons <sigv@users.noreply.github.com> Co-authored-by: Yiming Zang <50607998+yzang2019@users.noreply.github.com>

Replay events during restart to avoid tx missing

2e3b5dc

yzang2019 requested review from philipsu522, udpatil and Kbhat1 March 14, 2024 10:15

yzang2019 requested a review from stevenlanders March 14, 2024 10:56

Kbhat1 approved these changes Mar 14, 2024

View reviewed changes

stevenlanders approved these changes Mar 14, 2024

View reviewed changes

philipsu522 approved these changes Mar 14, 2024

View reviewed changes

yzang2019 merged commit 66ac407 into main Mar 15, 2024
22 checks passed

yzang2019 added the non-app-hash-breaking label Mar 15, 2024

yzang2019 added a commit that referenced this pull request Mar 18, 2024

Replay events during restart to avoid tx missing (#211)

fe87990

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replay events during restart to avoid tx missing #211

Replay events during restart to avoid tx missing #211

yzang2019 commented Mar 14, 2024 •

edited

Loading

codecov-commenter commented Mar 14, 2024

Replay events during restart to avoid tx missing #211

Replay events during restart to avoid tx missing #211

Conversation

yzang2019 commented Mar 14, 2024 • edited Loading

Describe your changes and provide context

Testing performed to validate your change

codecov-commenter commented Mar 14, 2024

Codecov Report

yzang2019 commented Mar 14, 2024 •

edited

Loading