-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch to lazy state balance cache #9822
Conversation
reduces the number of test cases that require db setup
|
||
var justifiedState state.BeaconState | ||
var err error | ||
if justifiedRoot == s.genesisRoot { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this condition necessary in the new scheme?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when we have a cache miss and go back to stategen for the lookup, it will attempt to read the state from the db. If the root is equal to genesisRoot, I believe that we can assume it will be present in the database. correct? In other words I believe that get state by root using the genesis root is equivalent to calling db.GenesisState.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My instinct says this exists because for some reason at genesis both justifiedRoot
and s.genesisRoot
can be 0x000...
.
To be safe, maybe we should apply ensureRootNotZeros
to c.stateGen.StateByRoot(ctx, justifiedRoot)
in get
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot this detail -- stategen.StateByRoot
internally checks for the zero hash and in that case loads from the DB:
https://github.com/prysmaticlabs/prysm/blob/develop/beacon-chain/state/stategen/getter.go#L45
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah perfect. Thanks!
@@ -359,9 +359,6 @@ func (s *Service) finalizedImpliesNewJustified(ctx context.Context, state state. | |||
} | |||
if !bytes.Equal(anc, s.finalizedCheckpt.Root) { | |||
s.justifiedCheckpt = state.CurrentJustifiedCheckpoint() | |||
if err := s.cacheJustifiedStateBalances(ctx, bytesutil.ToBytes32(s.justifiedCheckpt.Root)); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure it's ok to remove this? Some of these are in place to defend against fork choice bouncing attacks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% confident. Would appreciate your help formulating a test case to increase our confidence that this is not an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking into this today
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All these deletions are fine because we still updated s.justifiedCheckpt
Co-authored-by: terence tsao <terence@prysmaticlabs.com>
@@ -359,9 +359,6 @@ func (s *Service) finalizedImpliesNewJustified(ctx context.Context, state state. | |||
} | |||
if !bytes.Equal(anc, s.finalizedCheckpt.Root) { | |||
s.justifiedCheckpt = state.CurrentJustifiedCheckpoint() | |||
if err := s.cacheJustifiedStateBalances(ctx, bytesutil.ToBytes32(s.justifiedCheckpt.Root)); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All these deletions are fine because we still updated s.justifiedCheckpt
service_test brings in a ton of dependencies that make bazel rules for blockchain complex, so just sticking these mocks in their own file simplifies things.
this test established that the zero root can't be used to look up the state, resulting in a change in another PR to update stategen to use the GenesisState db method instead when the zero root is detected.
beacon-chain/blockchain/metrics.go
Outdated
@@ -130,6 +130,14 @@ var ( | |||
Name: "sync_head_state_hit", | |||
Help: "The number of sync head state requests that are present in the cache.", | |||
}) | |||
stateBalanceCacheHit = promauto.NewCounter(prometheus.CounterOpts{ | |||
Name: "balance_cache_hit", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there names are too similar to the other metrics like total_effective_balance_cache_hit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why dont define these metrics closer to its usages inside state_balance_cache.go
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why dont define these metrics closer to its usages
I'm following the conventions of the blockchain package here, all the prom metric values are defined in metrics.go
.
there names are too similar to the other metrics
how about state_balance_cache_(hit|miss)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for some reason i thought the cache resides in cache
pkg, it makes sense that the metric is here, i dont want to go into another debate where the cache should live ha!
state_balance_cache
is fine, justified_state_balance_cache
is also fine
beacon-chain/blockchain/service.go
Outdated
cfg: &config{}, | ||
} | ||
for _, opt := range opts { | ||
if err := opt(srv); err != nil { | ||
return nil, err | ||
} | ||
} | ||
if srv.justifiedBalances == nil { | ||
if srv.cfg.StateGen == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reasons not having the nil check inside newStateBalanceCache
and change the signatures to
newStateBalanceCache(sg *stategen.State) (*stateBalanceCache, error)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i like that idea, the whole point of the constructor is to hint that the zero-value for the cache doesn't work, so it makes sense to make it more defensive
// read path can connect to the upstream cancellation/timeout chain. | ||
func (c *stateBalanceCache) get(ctx context.Context, justifiedRoot [32]byte) ([]uint64, error) { | ||
c.Lock() | ||
defer c.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The locks can be further optimized by having read locks around this condition and write locks when miss
It's a trade off between readability, not sure if it's worth it, just wanted to point out this option
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that would be fine, and actually wrote it that way at first. Problem was, when you have the same thread holding a read lock, then a write lock, you have to briefly let go of the read lock before acquiring the write lock, and then another thread could get a stale read and try to concurrently update. I like the single mutex because it gets rid of that edge case and makes this simpler to reason about.
What type of PR is this?
Other
(enhancement / code health, eventually part of WSS)
What does this PR do? Why is it needed?
This PR adds a modified state balance cache implementation. Instead of any code that modifies the copy of the justified root recorded by
blockchain.Service
needing to update the balance state cache, the cache is updated on read based on the requested block root.This reduces the need for code to explicitly update the cache when updating the copy of the justified state. This simplifies several areas of the code, consolidates responsibility for the cache into the cache type, and decouples the cache from the service somewhat, enabling further refactoring of service initialization in support of weak subjectivity sync.
Other notes for review
Previously the update cache method also triggered a sync of the init sync blocks:
s.cfg.BeaconDB.SaveBlocks(ctx, s.getInitSyncBlocks())
. This has been removed based on the assumption that justified states are always saved to the database because they lie on epoch boundaries (which is our minimum state persistence interval). To strengthen this invariant we also changed the order of operations inupdateJustifiedInitSync
, ensuring thatSaveJustifiedCheckpoint
completes successfully before changingService.justifiedCheckpoint
.SaveJustifiedCheckpoint
will fail if the corresponding state hasn't been saved to the db, so doing the swap after the save ensures a more atomic update of both db and runtime variable.Marking this as WIP for now because I may want to add some additional test coverage to this type. Currently it is only tested transitively through service tests. On the topic of tests,edit: added some test coverage directly on the cache. this PR also contains updates to every test that callsNewService
, because NewService has been modified to initialize the cache. Previously many tests set the.cfg
field on the Service after calling NewService, but since the cache is initialized inside NewService, it needs stategen to be set first. This was accomplished by updating all these call sites to use the new functional opts ([]Options
param to NewService). Some helpers were added to cover common test initialization patterns and some care was taken to not add test db initialization where not necessary.