Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

voting: Optimize poll locks #2619

Merged

Conversation

jamescowens
Copy link
Member

@jamescowens jamescowens commented Dec 25, 2022

This PR implements a new locking scheme for the poll registry and related structures/methods. Closes #2618.

Voting in Gridcoin is a high reliability, high integrity subsystem that is built up from polls that are published by wallet holders in accordance with protocol rules for polls, and votes by walletholders. Polls and voting were completely redone in Fern to address integrity issues and further enhanced in major milestones since then.

One thing that hasn't changed since Fern... due to the complexity of tallying the votes and the desire to conserve memory footprint, the results for polls are built dynamically on demand by either the GUI or rpc. The GUI offers a (re)fresh of the entire polling/results based on a single button push, while the rpc splits up the poll listing and poll results into separate rpc functions, as this is more amenable to scripting. During the GUI poll update, if there are a number of polls that have high participation, the GUI can experience lockups for minutes at a time on slow computers (i.e. a slow processor and/or especially a slow disk, such as a traditional spinning HDD, especially if it is connected via USB). This is because while the poll GUI refresh was implemented on a separate thread, a lock was taken on cs_main for the entire update. This lock eventually becomes blocking to the cs_main taken at regular intervals by the core (driven by the net code), and therefore the core and GUI both block until the poll refresh is done and the lock on cs_main by the poll refresh is released. While this is a simple approach to insuring absolute integrity to the poll refresh, because it locks the entire wallet state, it suffers from poor usability.

There are a number of things that can be done to address these issues and have already been enumerated in #2618. This PR represents the implementation of the 1st improvement.

This PR removes the cs_main lock that was being taken in VotingModel::buildPollTable and instead inverts the locking scheme. A new lock to protect the poll registry and related structures has been implemented, cs_poll_registry. This lock is used for critical structures in the poll registry for polls and voting and also the GUI code that is executed by the refresh thread. This ensures protection of the registry state.

In general it is dangerous to access transaction state from the core without cs_main taken, unless you really know what you are doing. In this instance I have implemented a reorg/fork detection scheme that occurs between the contract handlers, the registry itself, and the buildPollTable GUI method. Note that during the registry traversal for the poll refresh, the cs_poll_registry lock is only taken to generate the sequence for the range loop. The BuildPollItem and associated machinery, instead of taking and maintaining the cs_poll_registry lock, or the cs_main lock, deep within the call structure in the ProcessVoteCandidate, a check of the atomic boolean reorg_occurred_during_reg_traversal is made for each vote before it is processed. This reorg_occurred_during_reg_traversal is in the poll registry, but does not require the cs_poll_registry lock, because it is atomic. It is set by the contract handler virtual function Add and Delete for polls/votes as part of a reorg/fork detection routine. Note that when the ProcessVoteCandidate method encounters that flag as set, it immediately throws InvalidDuetoReorgFork(), a new error, which bubbles up to the try around the BuildPollItem() and immediately aborts the tally. Then a waiting loop is entered that checks at one second intervals to see if the reorg/fork has cleared.

Note that this PR does not address performance issues involving the tallying of votes on slow machines, but instead makes sure that the GUI and core are responsive during this process, which may take minutes on a slow computer, and that the final tally is accurate at least at the height incurred at the start of the poll refresh or higher. For polls and voting to scale to much larger participants we will have to implement a different approach, but this will involve a tradeoff in more memory usage.

Note that static thread safety analysis keywords were implemented here which allow clang to detect thread safety issues. On a related note, a full deadlock analysis due to possible lock order issues has not been conducted yet

@jamescowens jamescowens marked this pull request as draft December 25, 2022 04:29
@jamescowens jamescowens self-assigned this Dec 25, 2022
@jamescowens jamescowens added enhancement gui refactor This is for refactoring (if also an enhancement, use that label too). labels Dec 25, 2022
@jamescowens jamescowens added this to the LaVerne milestone Dec 25, 2022
@jamescowens jamescowens force-pushed the optimize_poll_locks branch 2 times, most recently from 974699a to a91f6cf Compare December 25, 2022 19:54
@jamescowens jamescowens marked this pull request as ready for review December 28, 2022 22:59
@jamescowens jamescowens marked this pull request as draft December 29, 2022 22:19
@jamescowens
Copy link
Member Author

I am going to have to go back to the drawing board on the locking, because I got a deadlock to occur in testing.

@jamescowens jamescowens force-pushed the optimize_poll_locks branch 10 times, most recently from bbe1d79 to 9115bf3 Compare January 4, 2023 23:43
@jamescowens
Copy link
Member Author

I am getting very close on this. I have been testing on my Odroid, doing the poll refresh, which nominally takes about 5 mins. During the most stressful test, I start the poll refresh, then about 150 seconds in start a 1100 block forced reorg while the refresh is going on. The poll traversal correctly detects the reorg, aborts the run, and then waits (with 1 sec intervals) for the reorg to be finished and then restarts. The poll results displayed agree with current production.

@jamescowens jamescowens force-pushed the optimize_poll_locks branch 2 times, most recently from 09941f6 to fcd4efb Compare January 5, 2023 17:49
@jamescowens jamescowens marked this pull request as ready for review January 6, 2023 00:04
src/rpc/voting.cpp Outdated Show resolved Hide resolved
src/gridcoin/voting/registry.h Outdated Show resolved Hide resolved
src/gridcoin/voting/registry.cpp Outdated Show resolved Hide resolved
@jamescowens jamescowens force-pushed the optimize_poll_locks branch 2 times, most recently from da310f6 to 3bddc50 Compare January 26, 2023 17:14
This removes the cs_main lock in VotingModel::buildPollTable and also
implements reorg/fork detector in the Poll Registry.
The holding of the cs_poll_registry lock is only required to
retrieve the reference.
@jamescowens
Copy link
Member Author

I am putting this back in draft mode until I have a chance to retest the changes to the reorg detection.

This implements an atomic boolean that is used by the core to
indicate a reorganization is in progress. This is in turn used
by modules such as poll tallies, which use minimal locking, to
determine when core state has changed and take actions accordingly.
@jamescowens jamescowens force-pushed the optimize_poll_locks branch 3 times, most recently from 6dafbc1 to 94b4215 Compare January 29, 2023 18:48
This addresses the insertion of extra votes in poll references during
a contract replay after reorg in the middle of an active poll.

These can be removed when the rest of the contracts are converted to state machines
similar to the beacon db.
@jamescowens jamescowens marked this pull request as ready for review January 30, 2023 00:24
@jamescowens
Copy link
Member Author

jamescowens commented Jan 30, 2023

Final PR testing

Pressure testing this PR was quite difficult and deserves some documentation in its own right.

Note that the conditions you need here are polls with a lot of votes and a slow computer with a relatively slow disk to ensure that the poll vote tally takes a long time. Given that recently we had a number of polls on mainnet with relatively large vote counts (over 600 for the just the Folding@home poll), this provides an ideal set of conditions for the test when using my Odroid XU4, which is relatively slow, with limited memory and a relatively slow disk. (I am using an NFS mount for the wallet.) Testnet does not have enough votes to make a meaningful test for this PR. Note that mainnet testing for this is safe, because no votes are being cast or polls being created.

The goal of the test is to force a deep reorg at the same time a poll tally refresh is going on in the GUI to force the activation of the abort and wait code and see if it resumes properly and the final displayed results are correct. Note that a force reorganization to a hash that is much lower in the chain will result in less trust than before, and the wallet will then almost immediately self reorganize back to the original height. The issue with this method of testing a reorganization is that it is in two segments. The initial reorg is considered ended when the force reorganization gets to the provided block hash. This resets the global atomic g_reorg_in_progress boolean and ends the GUI wait and the tally proceeds. This is different than an automatic reorg which allows disconnects and reconnects to switch to a block of higher trust in one action. So in an automatic reorg, the GUI poll tally code will wait until the entire reorg is finished and block trust is higher than before. The reason I mention this is that the method of testing here with the reorg in two distinct segments (one manual, the other automatic), will allow the tally to resume earlier than in a totally automatic reorg, and since the tally will "outstrip" the readdition of the contracts via the block connection during the automatic reorg, some votes will be missed as a result. This should be seen as a slightly lower than normal tally result for that first test.

This PR also adds quite a bit of LogPrint based instrumentation to facilitate testing, which is only turned on when the VOTE logging category is enabled.

Testing Procedure

  1. Remove the blockchain database and sync to 2795228, which is the end of the Folding@home project listing poll.
  2. Initiate a poll refresh in the GUI while the wallet is syncing to head. Verify neither blocks each other.
  3. Reorg to 2794220 ceb2e1cd47400f299a403d71f465226984f47cfdbd77b4f886004cca5d47471e during poll refresh (approximately 1000 blocks). This is a much higher than normal reorg depth. (The usual depth of reorgs is 2 blocks, and the biggest we have seen in two years is about 50.)
  4. Verify the poll tally aborts and waits for reorg completion. (Remember unlike a totally automatic reorg, this will be when the rollback successfully reaches ceb2e1cd47400f299a403d71f465226984f47cfdbd77b4f886004cca5d47471e .)
  5. Verify the poll tally resumes when the reorg rollback is finished. Verity poll tally is not blocked by the core or blocking the core.
  6. Check the poll tally results in the GUI. They should be lower than what is reported in a wallet at the current head, since under this scenario some votes will not have been counted.
  7. Refresh again after the wallet completes the phase 2 reorg (automatic portion) and finishes syncing to the head of the chain without restarting the wallet. Check the poll tally results are in agreement with other mainnet wallets that did not have a reorg during the GUI poll refresh.
  8. Reboot and refresh again and verify poll tally results match

@jamescowens
Copy link
Member Author

The wallet passed the test above.

Copy link
Member

@div72 div72 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK 8441bda.

@jamescowens jamescowens merged commit 62316c2 into gridcoin-community:development Feb 2, 2023
@jamescowens jamescowens deleted the optimize_poll_locks branch March 3, 2023 21:01
jamescowens added a commit to jamescowens/Gridcoin-Research that referenced this pull request Mar 26, 2023
Added
 - key, wallet: HD wallets gridcoin-community#2540 (@div72)
 - ARMv8 SHA2 Intrinsics gridcoin-community#2612 (@barton2526)
 - build: vendor bdb 5.3 gridcoin-community#2620 (@div72)
 - scraper, gui: Add external adapter projects indication gridcoin-community#2625 (@jamescowens)
 - gui: implement INSUFFICIENT_MATURE_FUNDS status for the mrcmodel gridcoin-community#2628 (@jamescowens)
 - gui, accrual: Implement accrual limit warning gridcoin-community#2636 (@jamescowens)
 - rpc: add `getnodeaddresses` gridcoin-community#2646 (@Pythonix)
 - consensus: Add new checkpoints gridcoin-community#2651 (@barton2526)

Changed
 - voting: Optimize poll locks gridcoin-community#2619 (@jamescowens)
 - util: move threadinterrupt.{cpp,h} to util gridcoin-community#2613 (@Pythonix)
 - gui, voting: Update pool cpids and avw rules gridcoin-community#2624 (@jamescowens)
 - ci: bump python and setup-python action version gridcoin-community#2626 (@div72)
 - gui: Change text from username to name (real name or nickname) gridcoin-community#2633 (@jamescowens)
 - locale: Translation update, phase 1 gridcoin-community#2637 (@jamescowens)
 - gui: Change MRC too soon to submit error to be less confusing gridcoin-community#2645 (@jamescowens)
 - locale: Update translations prior to release (phase 2/2) gridcoin-community#2658 (@jamescowens)
 - gui: Enhance MRC request form to avoid fee boost field confusion gridcoin-community#2659 (@jamescowens)

Removed
none

Fixed
 - net: Turn net structures into dumb storage classes (backport) gridcoin-community#2561 (@Pythonix)
 - build: Include native_X.mk before X.mk gridcoin-community#2609 (@barton2526)
 - depends: fix OpenSSL for Darwin builds gridcoin-community#2610 (@div72)
 - build: Change actions runner image to Focal, Force Lint to use 22.04, Change cd runner version gridcoin-community#2611 (@barton2526)
 - gui: don't show datadir error msgbox if arg isn't specified gridcoin-community#2617 (@div72)
 - rpc: Repair auditsnapshotaccrual rpc function gridcoin-community#2621 (@jamescowens)
 - gui: Correct updateBeaconIcon() function in bitcoingui.cpp gridcoin-community#2622 (@jamescowens)
 - wallet: Strengthen CWalletTx::RevalidateTransactions gridcoin-community#2627 (@jamescowens)
 - test: Fix Wambiguous-reversed-operator compiler warning, drop boost::assign gridcoin-community#2632 (@barton2526)
 - gui: Fix wallet overview displaying lower-case poll name gridcoin-community#2640 (@delta1513)
 - Fix and optimize ResendWalletTransactions gridcoin-community#2642 (@jamescowens)
 - build(nsis): Write registry keys to HKLM instead of HKCU, Install shortcuts for all users, Fix INSTALLDIR removal bug gridcoin-community#2643 (@sitiom)
 - gui: Fix TransactionRecord::decomposeTransaction to properly display self-sidestake gridcoin-community#2647 (@jamescowens)
 - rpc: Fixed the RPC error when running `help voting` while syncing gridcoin-community#2649 (@delta1513)
 - build: Fix compilation with GCC 13 gridcoin-community#2653 (@theMarix)
 - rpc: Formatting - typo correction rpc help for listresearcheraccounts gridcoin-community#2654 (@PrestackI)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug enhancement gui refactor This is for refactoring (if also an enhancement, use that label too).
Projects
None yet
Development

Successfully merging this pull request may close these issues.

voting: Poll tab locks up GUI and takes unacceptably long on slower machines
2 participants