-
Notifications
You must be signed in to change notification settings - Fork 292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resolve harmony-one/bounties#90: Add revert mechanism for UpdateValidatorWrapper #3939
Resolve harmony-one/bounties#90: Add revert mechanism for UpdateValidatorWrapper #3939
Conversation
3d3a5de
to
6e8c849
Compare
The problem is that, if you simply disable the cache, RLP decoding validatorWrapper from code field will definitely take a lot of CPU and memory resource. That's why a benchmark comparison is expected. The revert shall revert the cache as well. |
Your assessment is correct; the memory usage had more than doubled. I will modify the code to revert the cache, and get back to you. |
6e8c849
to
c2db259
Compare
Closes harmony-one/bounties#90 (1) Use LRU for ValidatorWrapper objects in stateDB to plug a potential memory leak (2) Merge ValidatorWrapper and ValidatorWrapperCopy to let callers ask for either a copy, or a pointer to the cached object. Additionally, give callers the option to not deep copy delegations (which is a heavy process). Copies need to be explicitly committed (and thus can be reverted), while the pointers are committed when Finalise is called. (3) Add a UpdateValidatorWrapperWithRevert function, which is used by staking txs `Delegate`, `Undelegate`, and `CollectRewards`. Other 2 types of staking txs and `db.Finalize` continue to use UpdateValidateWrapper without revert, again, to save memoery (4) Add unit tests which check a) Revert goes through b) Wrapper is as expected after revert c) State is as expected after revert
c2db259
to
0a32a00
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
core/state/statedb.go
Outdated
@@ -78,7 +83,7 @@ type DB struct { | |||
stateObjects map[common.Address]*Object | |||
stateObjectsPending map[common.Address]struct{} // State objects finalized but not yet written to the trie | |||
stateObjectsDirty map[common.Address]struct{} | |||
stateValidators map[common.Address]*stk.ValidatorWrapper | |||
stateValidators *lru.Cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This stateValidators is like a staged representation of the validatorWrappers in memory for easy modification without having to serialize/deserialize from the account.code field everything it's modified. This map keeps track of all the modifications which will needs to be committed to stateDB eventually. Changing it to cache with a limit could potentially lose modificaiton data for the validatorWrappers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I propose setting validatorWrapperCacheLimit
to 4,000 (currently the PR has it set to 1,000) to work around this issue. The rationale for doing so is that the block gas limit is 80,000,000, and the minimum cost of a staking transaction can be ~20,000 (conservatively; it is often higher). This means that at most 4,000 validator wrapper modifications can occur in a block. Alternatively, I am happy to change it back to a dictionary format. Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please change stateValidators to a data structure with something like
type validatorCache struct {
dirty map[common.Address]*ValidatorWrapper
cache *lru.Cache
}
This will make it more readable and will not add unnecessary number of caches to the data structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the new profiling, seems there isn't much benefit in using LRU. Blockchain system requires strong deterministic guarantee, please revert to the old way of using maps. Since each block state will be cleared after the block is processed, there shouldn't be memory leak issue on the old way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw you added the dirty/cache struct. Since it's not improving any memory performance. Let's not complicate the existing code (may introduce new bugs and it's risky without extensive testing). Sorry for letting you change the code back and forth. (And also it's too complicated to have three boolean in the ValidatorWrapper() method.)
Since the memory / CPU usage saved is not significantly different when using an LRU + map structure, go back to the original dictionary structure to keep code easy to read and have limited modifications.
944b001
to
fb88e3b
Compare
Please see below the profiling. MemoryCPU |
core/state/statedb_test.go
Outdated
@@ -984,3 +988,72 @@ func makeBLSPubSigPair() blsPubSigPair { | |||
|
|||
return blsPubSigPair{shardPub, shardSig} | |||
} | |||
|
|||
func TestValidatorRevert(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add more test cases? To cover more situations like modify,modify,revert; modify,modify,revert,revert. Or updating other fields rather than just the delegations. The test coverage right now is not high.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added more tests, please see TestValidatorMultipleReverts
in particular. The coverage for new code in statedb.go
has now increased.
Please also run a mainnet node with this new code; you can use the rcloned blockchain and let the node running to make sure it can synchronize all the new blocks without problems. |
As requested by @rlan35, add tests beyond just adding and reverting a delegation. The tests are successive in the sense that we do multiple modifications to the wrapper, save a snapshot before each modification and revert to each of them to confirm everything works well. This change improves test coverage of statedb.go to 66.7% from 64.8% and that of core/state to 71.9% from 70.8%, and covers all the code that has been modified by this PR in statedb.go. For clarity, the modifications to the wrapper include (1) creation of wrapper in state, (2) adding a delegation to the wrapper, (3) increasing the blocks signed, and (4) a change in the validator Name and the BlockReward. Two additional tests have been added to cover the `panic` and the `GetCode` cases.
The results with the memory and CPU usage are from a mainnet node, as specified in the bounty requirements. Do you need anything else from these runs? |
Ok, that's good. Are the nodes being able to sync to latest block and stay in sync all the time? |
Yes, although I had to merge into my build the main branch and #3976 to get the sync to catch up from the rclone base. The "catching up" lasted ~5.5 hours to cover a difference of ~18,250 blocks between the rcloned database and the mainnet. The node stayed in sync (according to the block number as well as For the record, I used a storage optimized Digital Ocean droplet (not dedicated) with 8 cores, 64 GB RAM and 1.17 TB SSD. This decision was made because Harmony requirements recommend using an 8-core server if it's shared, and I needed at least 750 GB for the rclone. |
} | ||
// a copy of the existing store can be used for revert | ||
// since we are replacing the existing with the new anyway | ||
prev, err := db.ValidatorWrapper(addr, true, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be "db.ValidatorWrapper(addr, false, true)"? Since you want a copy to be stored as the change journal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because the caller is sending a copy, which is being added to the db
, while the original is in the journal
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. make sense, the original is replaced with the new one, so it's safe to directly use it without copying. thanks
Delegate
,Undelegate
, andCollectRewards
. Other 2 types of staking txs anddb.Finalize
continue to use UpdateValidateWrapper without revert, again, to save memoryIssue
harmony-one/bounties#90
Test
Unit Test Coverage
Before:
After:
Test/Run Logs
Operational Checklist
Does this PR introduce backward-incompatible changes to the on-disk data structure and/or the over-the-wire protocol?. (If no, skip to question 8.)
No.
Describe the migration plan.. For each flag epoch, describe what changes take place at the flag epoch, the anticipated interactions between upgraded/non-upgraded nodes, and any special operational considerations for the migration.
Describe how the plan was tested.
How much minimum baking period after the last flag epoch should we allow on Pangaea before promotion onto mainnet?
What are the planned flag epoch numbers and their ETAs on Pangaea?
What are the planned flag epoch numbers and their ETAs on mainnet?
Note that this must be enough to cover baking period on Pangaea.
What should node operators know about this planned change?
Does this PR introduce backward-incompatible changes NOT related to on-disk data structure and/or over-the-wire protocol? (If no, continue to question 11.)
No.
Does the existing
node.sh
continue to work with this change?What should node operators know about this change?
Does this PR introduce significant changes to the operational requirements of the node software, such as >20% increase in CPU, memory, and/or disk usage?
No. See comment.