You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Intermittently, EN used an older-than-expected checkpoint file during startup which caused 40 extra WAL segments to be replayed, causing a long delay before execution.
The perception was 16-20 minute startup times (for this activity) and fixing this bug can bring that down to 9-10 minutes.
On Jan 14, EN startup duration included:
5m20s - to load checkpoint.1912
10m34s - to replay 67 WAL segments (because old checkpoint file was loaded instead of latest)
---------- 15m54s
Expected:
5m20s (approx) - to load checkpoint.1953
4m15s (instead of 10m34s) - to replay 40 fewer WAL segments after loading checkpoint.1953.
--------- 9m35s (approx)
Logs show creation of checkpoint.1953 began on Jan 13 but it was not used on Jan 14.
Recent logs indicate the loading of older-than-expected checkpoint file happened on Dec 22, Jan 7, and Jan 14.
If the checkpoint files were not deleted, there's overlap in some upfront tasks for this issue and #1750. On Friday, Jan 21, I suggested some ways to reduce number of WAL segments and some near-term optimizations to checkpoint creation that can allow us to further reduce WAL segments by allowing more frequent checkpoint creation.
fxamacker
changed the title
EN startup was sometimes delayed because older checkpoint file was loaded, causing 40 extra WAL segments to replay
[Execution State] EN startup was sometimes delayed because older checkpoint file was loaded, causing 40 extra WAL segments to replay
Feb 28, 2022
Root cause of this problem is checkpoint creation taking 12-15+ hours, which can accumulate enough WAL segments to trigger another checkpoint creation immediately after the current checkpoint finishes.
Given the long duration of checkpoint creation, shutdowns would likely interrupt checkpoint creation which causes more than 40 WAL segments to accumulate and they need to be replayed during EN startup.
Additionally, there wasn't sufficient logging so the root cause wasn't as easy to identify as it should've been.
🐞 Bug Report
Intermittently, EN used an older-than-expected checkpoint file during startup which caused 40 extra WAL segments to be replayed, causing a long delay before execution.
The perception was 16-20 minute startup times (for this activity) and fixing this bug can bring that down to 9-10 minutes.
On Jan 14, EN startup duration included:
----------
15m54s
Expected:
---------
9m35s (approx)
Logs show creation of checkpoint.1953 began on Jan 13 but it was not used on Jan 14.
Recent logs indicate the loading of older-than-expected checkpoint file happened on Dec 22, Jan 7, and Jan 14.
Updates epic #1744 because it uses more memory.
Additional context
See https://github.com/dapperlabs/flow-go/issues/6114.
I think some possible non-startup causes can include scenarios like:
The text was updated successfully, but these errors were encountered: