-
Notifications
You must be signed in to change notification settings - Fork 172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
voting: Optimize poll locks #2619
voting: Optimize poll locks #2619
Conversation
974699a
to
a91f6cf
Compare
I am going to have to go back to the drawing board on the locking, because I got a deadlock to occur in testing. |
bbe1d79
to
9115bf3
Compare
I am getting very close on this. I have been testing on my Odroid, doing the poll refresh, which nominally takes about 5 mins. During the most stressful test, I start the poll refresh, then about 150 seconds in start a 1100 block forced reorg while the refresh is going on. The poll traversal correctly detects the reorg, aborts the run, and then waits (with 1 sec intervals) for the reorg to be finished and then restarts. The poll results displayed agree with current production. |
09941f6
to
fcd4efb
Compare
da310f6
to
3bddc50
Compare
This removes the cs_main lock in VotingModel::buildPollTable and also implements reorg/fork detector in the Poll Registry.
The holding of the cs_poll_registry lock is only required to retrieve the reference.
e2f7e83
to
fda492b
Compare
I am putting this back in draft mode until I have a chance to retest the changes to the reorg detection. |
This implements an atomic boolean that is used by the core to indicate a reorganization is in progress. This is in turn used by modules such as poll tallies, which use minimal locking, to determine when core state has changed and take actions accordingly.
6dafbc1
to
94b4215
Compare
This addresses the insertion of extra votes in poll references during a contract replay after reorg in the middle of an active poll. These can be removed when the rest of the contracts are converted to state machines similar to the beacon db.
94b4215
to
abd5816
Compare
Final PR testing Pressure testing this PR was quite difficult and deserves some documentation in its own right. Note that the conditions you need here are polls with a lot of votes and a slow computer with a relatively slow disk to ensure that the poll vote tally takes a long time. Given that recently we had a number of polls on mainnet with relatively large vote counts (over 600 for the just the Folding@home poll), this provides an ideal set of conditions for the test when using my Odroid XU4, which is relatively slow, with limited memory and a relatively slow disk. (I am using an NFS mount for the wallet.) Testnet does not have enough votes to make a meaningful test for this PR. Note that mainnet testing for this is safe, because no votes are being cast or polls being created. The goal of the test is to force a deep reorg at the same time a poll tally refresh is going on in the GUI to force the activation of the abort and wait code and see if it resumes properly and the final displayed results are correct. Note that a force reorganization to a hash that is much lower in the chain will result in less trust than before, and the wallet will then almost immediately self reorganize back to the original height. The issue with this method of testing a reorganization is that it is in two segments. The initial reorg is considered ended when the force reorganization gets to the provided block hash. This resets the global atomic g_reorg_in_progress boolean and ends the GUI wait and the tally proceeds. This is different than an automatic reorg which allows disconnects and reconnects to switch to a block of higher trust in one action. So in an automatic reorg, the GUI poll tally code will wait until the entire reorg is finished and block trust is higher than before. The reason I mention this is that the method of testing here with the reorg in two distinct segments (one manual, the other automatic), will allow the tally to resume earlier than in a totally automatic reorg, and since the tally will "outstrip" the readdition of the contracts via the block connection during the automatic reorg, some votes will be missed as a result. This should be seen as a slightly lower than normal tally result for that first test. This PR also adds quite a bit of LogPrint based instrumentation to facilitate testing, which is only turned on when the VOTE logging category is enabled. Testing Procedure
|
The wallet passed the test above. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK 8441bda.
Added - key, wallet: HD wallets gridcoin-community#2540 (@div72) - ARMv8 SHA2 Intrinsics gridcoin-community#2612 (@barton2526) - build: vendor bdb 5.3 gridcoin-community#2620 (@div72) - scraper, gui: Add external adapter projects indication gridcoin-community#2625 (@jamescowens) - gui: implement INSUFFICIENT_MATURE_FUNDS status for the mrcmodel gridcoin-community#2628 (@jamescowens) - gui, accrual: Implement accrual limit warning gridcoin-community#2636 (@jamescowens) - rpc: add `getnodeaddresses` gridcoin-community#2646 (@Pythonix) - consensus: Add new checkpoints gridcoin-community#2651 (@barton2526) Changed - voting: Optimize poll locks gridcoin-community#2619 (@jamescowens) - util: move threadinterrupt.{cpp,h} to util gridcoin-community#2613 (@Pythonix) - gui, voting: Update pool cpids and avw rules gridcoin-community#2624 (@jamescowens) - ci: bump python and setup-python action version gridcoin-community#2626 (@div72) - gui: Change text from username to name (real name or nickname) gridcoin-community#2633 (@jamescowens) - locale: Translation update, phase 1 gridcoin-community#2637 (@jamescowens) - gui: Change MRC too soon to submit error to be less confusing gridcoin-community#2645 (@jamescowens) - locale: Update translations prior to release (phase 2/2) gridcoin-community#2658 (@jamescowens) - gui: Enhance MRC request form to avoid fee boost field confusion gridcoin-community#2659 (@jamescowens) Removed none Fixed - net: Turn net structures into dumb storage classes (backport) gridcoin-community#2561 (@Pythonix) - build: Include native_X.mk before X.mk gridcoin-community#2609 (@barton2526) - depends: fix OpenSSL for Darwin builds gridcoin-community#2610 (@div72) - build: Change actions runner image to Focal, Force Lint to use 22.04, Change cd runner version gridcoin-community#2611 (@barton2526) - gui: don't show datadir error msgbox if arg isn't specified gridcoin-community#2617 (@div72) - rpc: Repair auditsnapshotaccrual rpc function gridcoin-community#2621 (@jamescowens) - gui: Correct updateBeaconIcon() function in bitcoingui.cpp gridcoin-community#2622 (@jamescowens) - wallet: Strengthen CWalletTx::RevalidateTransactions gridcoin-community#2627 (@jamescowens) - test: Fix Wambiguous-reversed-operator compiler warning, drop boost::assign gridcoin-community#2632 (@barton2526) - gui: Fix wallet overview displaying lower-case poll name gridcoin-community#2640 (@delta1513) - Fix and optimize ResendWalletTransactions gridcoin-community#2642 (@jamescowens) - build(nsis): Write registry keys to HKLM instead of HKCU, Install shortcuts for all users, Fix INSTALLDIR removal bug gridcoin-community#2643 (@sitiom) - gui: Fix TransactionRecord::decomposeTransaction to properly display self-sidestake gridcoin-community#2647 (@jamescowens) - rpc: Fixed the RPC error when running `help voting` while syncing gridcoin-community#2649 (@delta1513) - build: Fix compilation with GCC 13 gridcoin-community#2653 (@theMarix) - rpc: Formatting - typo correction rpc help for listresearcheraccounts gridcoin-community#2654 (@PrestackI)
This PR implements a new locking scheme for the poll registry and related structures/methods. Closes #2618.
Voting in Gridcoin is a high reliability, high integrity subsystem that is built up from polls that are published by wallet holders in accordance with protocol rules for polls, and votes by walletholders. Polls and voting were completely redone in Fern to address integrity issues and further enhanced in major milestones since then.
One thing that hasn't changed since Fern... due to the complexity of tallying the votes and the desire to conserve memory footprint, the results for polls are built dynamically on demand by either the GUI or rpc. The GUI offers a (re)fresh of the entire polling/results based on a single button push, while the rpc splits up the poll listing and poll results into separate rpc functions, as this is more amenable to scripting. During the GUI poll update, if there are a number of polls that have high participation, the GUI can experience lockups for minutes at a time on slow computers (i.e. a slow processor and/or especially a slow disk, such as a traditional spinning HDD, especially if it is connected via USB). This is because while the poll GUI refresh was implemented on a separate thread, a lock was taken on cs_main for the entire update. This lock eventually becomes blocking to the cs_main taken at regular intervals by the core (driven by the net code), and therefore the core and GUI both block until the poll refresh is done and the lock on cs_main by the poll refresh is released. While this is a simple approach to insuring absolute integrity to the poll refresh, because it locks the entire wallet state, it suffers from poor usability.
There are a number of things that can be done to address these issues and have already been enumerated in #2618. This PR represents the implementation of the 1st improvement.
This PR removes the cs_main lock that was being taken in VotingModel::buildPollTable and instead inverts the locking scheme. A new lock to protect the poll registry and related structures has been implemented, cs_poll_registry. This lock is used for critical structures in the poll registry for polls and voting and also the GUI code that is executed by the refresh thread. This ensures protection of the registry state.
In general it is dangerous to access transaction state from the core without cs_main taken, unless you really know what you are doing. In this instance I have implemented a reorg/fork detection scheme that occurs between the contract handlers, the registry itself, and the buildPollTable GUI method. Note that during the registry traversal for the poll refresh, the cs_poll_registry lock is only taken to generate the sequence for the range loop. The BuildPollItem and associated machinery, instead of taking and maintaining the cs_poll_registry lock, or the cs_main lock, deep within the call structure in the ProcessVoteCandidate, a check of the atomic boolean reorg_occurred_during_reg_traversal is made for each vote before it is processed. This reorg_occurred_during_reg_traversal is in the poll registry, but does not require the cs_poll_registry lock, because it is atomic. It is set by the contract handler virtual function Add and Delete for polls/votes as part of a reorg/fork detection routine. Note that when the ProcessVoteCandidate method encounters that flag as set, it immediately throws InvalidDuetoReorgFork(), a new error, which bubbles up to the try around the BuildPollItem() and immediately aborts the tally. Then a waiting loop is entered that checks at one second intervals to see if the reorg/fork has cleared.
Note that this PR does not address performance issues involving the tallying of votes on slow machines, but instead makes sure that the GUI and core are responsive during this process, which may take minutes on a slow computer, and that the final tally is accurate at least at the height incurred at the start of the poll refresh or higher. For polls and voting to scale to much larger participants we will have to implement a different approach, but this will involve a tradeoff in more memory usage.
Note that static thread safety analysis keywords were implemented here which allow clang to detect thread safety issues. On a related note, a full deadlock analysis due to possible lock order issues has not been conducted yet