Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to lazy state balance cache #9822

Merged
merged 34 commits into from
Nov 19, 2021
Merged

Switch to lazy state balance cache #9822

merged 34 commits into from
Nov 19, 2021

Conversation

kasey
Copy link
Contributor

@kasey kasey commented Oct 27, 2021

What type of PR is this?

Other

(enhancement / code health, eventually part of WSS)

What does this PR do? Why is it needed?
This PR adds a modified state balance cache implementation. Instead of any code that modifies the copy of the justified root recorded by blockchain.Service needing to update the balance state cache, the cache is updated on read based on the requested block root.

This reduces the need for code to explicitly update the cache when updating the copy of the justified state. This simplifies several areas of the code, consolidates responsibility for the cache into the cache type, and decouples the cache from the service somewhat, enabling further refactoring of service initialization in support of weak subjectivity sync.

Other notes for review

Previously the update cache method also triggered a sync of the init sync blocks: s.cfg.BeaconDB.SaveBlocks(ctx, s.getInitSyncBlocks()). This has been removed based on the assumption that justified states are always saved to the database because they lie on epoch boundaries (which is our minimum state persistence interval). To strengthen this invariant we also changed the order of operations in updateJustifiedInitSync, ensuring that SaveJustifiedCheckpoint completes successfully before changing Service.justifiedCheckpoint. SaveJustifiedCheckpoint will fail if the corresponding state hasn't been saved to the db, so doing the swap after the save ensures a more atomic update of both db and runtime variable.

Marking this as WIP for now because I may want to add some additional test coverage to this type. Currently it is only tested transitively through service tests. On the topic of tests, edit: added some test coverage directly on the cache. this PR also contains updates to every test that calls NewService, because NewService has been modified to initialize the cache. Previously many tests set the .cfg field on the Service after calling NewService, but since the cache is initialized inside NewService, it needs stategen to be set first. This was accomplished by updating all these call sites to use the new functional opts ([]Options param to NewService). Some helpers were added to cover common test initialization patterns and some care was taken to not add test db initialization where not necessary.

@kasey kasey requested a review from a team as a code owner October 27, 2021 02:28
beacon-chain/blockchain/process_block.go Outdated Show resolved Hide resolved

var justifiedState state.BeaconState
var err error
if justifiedRoot == s.genesisRoot {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this condition necessary in the new scheme?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we have a cache miss and go back to stategen for the lookup, it will attempt to read the state from the db. If the root is equal to genesisRoot, I believe that we can assume it will be present in the database. correct? In other words I believe that get state by root using the genesis root is equivalent to calling db.GenesisState.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My instinct says this exists because for some reason at genesis both justifiedRoot and s.genesisRoot can be 0x000....

To be safe, maybe we should apply ensureRootNotZeros to c.stateGen.StateByRoot(ctx, justifiedRoot) in get ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot this detail -- stategen.StateByRoot internally checks for the zero hash and in that case loads from the DB:
https://github.com/prysmaticlabs/prysm/blob/develop/beacon-chain/state/stategen/getter.go#L45

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah perfect. Thanks!

@@ -359,9 +359,6 @@ func (s *Service) finalizedImpliesNewJustified(ctx context.Context, state state.
}
if !bytes.Equal(anc, s.finalizedCheckpt.Root) {
s.justifiedCheckpt = state.CurrentJustifiedCheckpoint()
if err := s.cacheJustifiedStateBalances(ctx, bytesutil.ToBytes32(s.justifiedCheckpt.Root)); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure it's ok to remove this? Some of these are in place to defend against fork choice bouncing attacks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% confident. Would appreciate your help formulating a test case to increase our confidence that this is not an issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking into this today

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these deletions are fine because we still updated s.justifiedCheckpt

kasey and others added 2 commits October 27, 2021 08:10
@@ -359,9 +359,6 @@ func (s *Service) finalizedImpliesNewJustified(ctx context.Context, state state.
}
if !bytes.Equal(anc, s.finalizedCheckpt.Root) {
s.justifiedCheckpt = state.CurrentJustifiedCheckpoint()
if err := s.cacheJustifiedStateBalances(ctx, bytesutil.ToBytes32(s.justifiedCheckpt.Root)); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these deletions are fine because we still updated s.justifiedCheckpt

beacon-chain/blockchain/state_balance_cache.go Outdated Show resolved Hide resolved
@kasey kasey changed the title WIP: Switch to lazy state balance cache Switch to lazy state balance cache Nov 2, 2021
kasey added 9 commits November 3, 2021 21:37
service_test brings in a ton of dependencies that make bazel rules
for blockchain complex, so just sticking these mocks in their own
file simplifies things.
this test established that the zero root can't be used to look up the
state, resulting in a change in another PR to update stategen to use the
GenesisState db method instead when the zero root is detected.
@@ -130,6 +130,14 @@ var (
Name: "sync_head_state_hit",
Help: "The number of sync head state requests that are present in the cache.",
})
stateBalanceCacheHit = promauto.NewCounter(prometheus.CounterOpts{
Name: "balance_cache_hit",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there names are too similar to the other metrics like total_effective_balance_cache_hit

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why dont define these metrics closer to its usages inside state_balance_cache.go?

Copy link
Contributor Author

@kasey kasey Nov 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why dont define these metrics closer to its usages

I'm following the conventions of the blockchain package here, all the prom metric values are defined in metrics.go.

there names are too similar to the other metrics

how about state_balance_cache_(hit|miss)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for some reason i thought the cache resides in cache pkg, it makes sense that the metric is here, i dont want to go into another debate where the cache should live ha!

state_balance_cache is fine, justified_state_balance_cache is also fine

cfg: &config{},
}
for _, opt := range opts {
if err := opt(srv); err != nil {
return nil, err
}
}
if srv.justifiedBalances == nil {
if srv.cfg.StateGen == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reasons not having the nil check inside newStateBalanceCache and change the signatures to
newStateBalanceCache(sg *stategen.State) (*stateBalanceCache, error)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like that idea, the whole point of the constructor is to hint that the zero-value for the cache doesn't work, so it makes sense to make it more defensive

// read path can connect to the upstream cancellation/timeout chain.
func (c *stateBalanceCache) get(ctx context.Context, justifiedRoot [32]byte) ([]uint64, error) {
c.Lock()
defer c.Unlock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The locks can be further optimized by having read locks around this condition and write locks when miss

It's a trade off between readability, not sure if it's worth it, just wanted to point out this option

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that would be fine, and actually wrote it that way at first. Problem was, when you have the same thread holding a read lock, then a write lock, you have to briefly let go of the read lock before acquiring the write lock, and then another thread could get a stale read and try to concurrently update. I like the single mutex because it gets rid of that edge case and makes this simpler to reason about.

@prylabs-bulldozer prylabs-bulldozer bot merged commit 39c33b8 into develop Nov 19, 2021
@delete-merged-branch delete-merged-branch bot deleted the lazy-cache-poc branch November 19, 2021 15:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants