Review regen state cache strategy #2846

dapplion · 2021-07-15T13:13:57Z

From Paul (Lighthouse) in normal network conditions:

The state cache should only 1 state (the head state) attached to beacon_chain.canonical_head
To process attestations, the shuffling cache should only contain 3 entries: previous, current and next epochs. They take 64bits * VALIDATOR_COUNT so not much either.

Our state cache can reference up to 96 states. Thanks to structural sharing the total memory required is not 96 * state size tho it's a very high number.

Review the strategy to ensure that at least in normal conditions we keep just one state in memory.

The text was updated successfully, but these errors were encountered:

dapplion · 2021-07-15T17:36:57Z

For lightclient proof serving we could use a dedicated cache that is enabled with a flag --enable-state-proof-cache which tracks one state per block in the canonical chain up to some configurable limit.

dapplion · 2021-07-15T18:27:53Z

Current regen usage:

getPreState: (block: allForks.BeaconBlock): Return a valid pre-state for a beacon block
- Used in chain state transition to process blocks, processBlock() and processChainSegment()
  - processBlock(): Used in range sync, API publishBlock, gossip beacon_block handler. Last two will need a head, which could be many.
  - processChainSegment(): Used in range sync. Will start from latest finalized block, no need for cache?
getCheckpointState: (cp: phase0.Checkpoint): Return a valid checkpoint state
- Used in onForkChoiceFinalized by lightclientUpdater. Could be refactored to use the state directly after the state transition
- Used in getPreState. See above.
- Used in validateGossipAggregateAndProof. Can be replaced by attester shufflings cache. We can fetch the attestation target block from the forkChoice and use it (or it's parent) to get the dependantRoot.
- Used in validateGossipAttestation. Can be replaced by attester shufflings cache. Same as above.
- Used in validateGossipVoluntaryExit. Can we just use the HEAD state? We can use the HEAD state. If performance is good, dial to the current epoch. Note: Lighthouse uses the clock state.
getBlockSlotState: (blockRoot: Root, slot: Slot): Return the state of blockRoot processed to slot slot
- Used in produceAttestationData. Can we just use the HEAD state? Lighthouse uses the head and may do a partial state advance. We should just allow the HEAD to be dial forward with the exclusive purpose of maybe updating the current justified checkpoint (consider having a light state transition to just compute that).
- Used in getHeadStateAtCurrentEpoch:
  - getProposerDuties: Use block proposer shuffling cache, from the HEAD.
  - getAttesterDuties: Use attester shufflings cache, from the HEAD.
- Used in getHeadStateAtCurrentSlot: NOT USED
- Used in assembleBlock. We need the HEAD state at the slot we are proposing. May trigger an epoch transition. This is why we use the regen for its cache.
- Used in getCheckpointState. See above.
- Used in validateGossipBlock. We may just use a block proposer shuffling cache. And use the forkChoice block summary target to get the dependant root quick.
getState: (stateRoot: Root): Return the exact state with stateRoot
- Used in beacon state + debug API. Use the HEAD, throw for everything else. Use a separate cache if this node wants to beacon a block explorer or proof server.
- Used in getStateByBlockRoot: NOT USED
- Used in getPreState. See above.
- Used in getBlockSlotState. See above.

twoeths · 2021-07-16T03:31:43Z

I agree this is a good time to review our cache and regen module, this was designed at the time testnets were unstable so it's quite conservative.

Regen module

It's there to deal with abnormal network condition, currently the worse case is to generate a state from a finalized checkpoint, which never happen with the current good network condition. Hence I propose we generate from the justified checkpoint instead:

Per LMDGhost (forkchoice), the HEAD is the best descendant of the justified checkpoint
If a peer has a conflict HEAD to us, we'll disconnect each other

so I think it's safe enough to generate state from a justified checkpoint instead of a finalized checkpoint. Also we should review every use cases of it as detailed above.

The cache

CheckpointStateCache: MAX_EPOCHS = 10 which is big but right now we do pruneFinalized(). With the current good network condition, it keeps only 3 checkpoint states (1 finalized, 1 justified, 1 latest) in cache.

StateContextCache: the 96 cached states span 4 epochs most of the time, not 3 epochs.

Since states of the same epoch are shared in memory, I propose to cache exactly 2 epochs (to be aligned with the regen module) for both caches leveraging the shared memory of persistent-merkle-tree

CheckpointStateCache: we could reduce MAX_EPOCHS, normally it should only store 1 justified checkpoint state and 1 latest checkpoint state
StateContextCache: reduce to 32 cached states, it'd span 2 epochs to be aligned with CheckpointStateCache

dapplion · 2021-07-16T12:42:58Z

I've added some quick metrics to regen and run altair-devnet-1.

The StateContextCache is usually full all the time. During sync it's consistently at 96 states, when synced it varies between 70 and 96 based on the epoch's slot. CheckpointStateCache has 4 states, which makes sense too. Node heap is ~660MB.

Then I've changed the cache params to MAX_STATES = 2 and MAX_EPOCHS = 1 (14:30 on the chart below). Some things break but I just want to do a quick take on memory. It's ~615MB a 7% reduction, which seems tinny. I guess the structural sharing of states is very powerful.

dapplion · 2021-07-16T15:57:37Z

A small analysis of state sizes. I'm using the performance states which are maxed out states with 250_000 validators

start                rss  0 B             heapTotal  0 B             heapUsed +4.8 KB          external +40 B            arrayBuffers  0 B            
getPubkeys()         rss +73.39 MB        heapTotal +38.91 MB        heapUsed +41.32 MB        external +11.87 MB        arrayBuffers +11.87 MB       
.defaultValue()      rss +11.83 MB        heapTotal +11.5 MB         heapUsed +12.61 MB        external -520.44 KB       arrayBuffers -520.44 KB      
build raw state      rss +100.47 MB       heapTotal +100.82 MB       heapUsed +100.43 MB       external  0 B             arrayBuffers  0 B            
addPendingAtt        rss +123.2 MB        heapTotal +128.25 MB       heapUsed +128.3 MB        external  0 B             arrayBuffers  0 B            
toTreeBacked         rss +684.8 MB        heapTotal +675.74 MB       heapUsed +675.85 MB       external -13.3 KB         arrayBuffers -12.69 KB       
CachedBeaconState    rss +624.05 MB       heapTotal +537.52 MB       heapUsed +529.14 MB       external +7.65 MB         arrayBuffers +7.65 MB

Source code:

async function analyzeStateMemory(): Promise<void> {
  await init("blst-native");

  const tracker = new MemoryTracker();
  tracker.logDiff("start");

  const pubkeys = getPubkeys().pubkeys;
  tracker.logDiff("getPubkeys()");

  const defaultState = ssz.phase0.BeaconState.defaultValue();
  tracker.logDiff(".defaultValue()");

  const state = buildPerformanceStateAllForks(defaultState, pubkeys);
  tracker.logDiff("build raw state");

  addPendingAttestations(state as phase0.BeaconState);
  tracker.logDiff("addPendingAtt");

  const stateTB = ssz.phase0.BeaconState.createTreeBackedFromStruct(state as phase0.BeaconState);
  tracker.logDiff("toTreeBacked");

  const cached = allForks.createCachedBeaconState(config, stateTB);
  tracker.logDiff("CachedBeaconState");
}

class MemoryTracker {
  prev = process.memoryUsage();

  logDiff(id: string): void {
    const curr = process.memoryUsage();
    const parts: string[] = [];
    for (const key of Object.keys(this.prev) as (keyof NodeJS.MemoryUsage)[]) {
      const prevVal = this.prev[key];
      const currVal = curr[key];
      const bytesDiff = currVal - prevVal;
      const sign = bytesDiff < 0 ? "-" : bytesDiff > 0 ? "+" : " ";
      parts.push(`${key} ${sign}${formatBytes(Math.abs(bytesDiff)).padEnd(15)}`);
    }
    this.prev = curr;
    console.log(id.padEnd(20), parts.join(" "));
  }
}

stale · 2021-09-19T20:17:05Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

dapplion · 2022-05-10T18:14:53Z

Closing for Regen refactor > safer resilient strategy #4005

dapplion added the scope-performance Performance issue and ideas to improve performance. label Jul 15, 2021

dapplion mentioned this issue Jul 26, 2021

Research reducing memory footprint #2885

Closed

stale bot added the meta-stale Label for stale issues applied by the stale bot. label Sep 19, 2021

dapplion added prio-high Resolve issues as soon as possible. and removed meta-stale Label for stale issues applied by the stale bot. labels Sep 20, 2021

dapplion mentioned this issue Nov 29, 2021

Performance Optimizations Tracker #3466

Closed

26 tasks

dapplion added Epic Issues used as milestones and tracking multiple issues. and removed Epic Issues used as milestones and tracking multiple issues. labels May 10, 2022

dapplion mentioned this issue May 10, 2022

Regen refactor > safer resilient strategy #4005

Open

dapplion closed this as completed May 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Review regen state cache strategy #2846

Review regen state cache strategy #2846

dapplion commented Jul 15, 2021

dapplion commented Jul 15, 2021

dapplion commented Jul 15, 2021 •

edited

Loading

twoeths commented Jul 16, 2021

dapplion commented Jul 16, 2021 •

edited

Loading

dapplion commented Jul 16, 2021 •

edited

Loading

stale bot commented Sep 19, 2021

dapplion commented May 10, 2022 •

edited

Loading

Review regen state cache strategy #2846

Review regen state cache strategy #2846

Comments

dapplion commented Jul 15, 2021

dapplion commented Jul 15, 2021

dapplion commented Jul 15, 2021 • edited Loading

twoeths commented Jul 16, 2021

Regen module

The cache

dapplion commented Jul 16, 2021 • edited Loading

dapplion commented Jul 16, 2021 • edited Loading

stale bot commented Sep 19, 2021

dapplion commented May 10, 2022 • edited Loading

dapplion commented Jul 15, 2021 •

edited

Loading

dapplion commented Jul 16, 2021 •

edited

Loading

dapplion commented Jul 16, 2021 •

edited

Loading

dapplion commented May 10, 2022 •

edited

Loading