Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Execution State] EN startup was sometimes delayed because older checkpoint file was loaded, causing 40 extra WAL segments to replay #1884

Closed
fxamacker opened this issue Jan 23, 2022 · 2 comments · Fixed by #1944
Assignees
Labels
Bug Something isn't working Execution Cadence Execution Team

Comments

@fxamacker
Copy link
Member

fxamacker commented Jan 23, 2022

🐞 Bug Report

Intermittently, EN used an older-than-expected checkpoint file during startup which caused 40 extra WAL segments to be replayed, causing a long delay before execution.

The perception was 16-20 minute startup times (for this activity) and fixing this bug can bring that down to 9-10 minutes.

On Jan 14, EN startup duration included:

  •   5m20s - to load checkpoint.1912
  • 10m34s - to replay 67 WAL segments (because old checkpoint file was loaded instead of latest)
    ----------
    15m54s

Expected:

  • 5m20s (approx) - to load checkpoint.1953
  • 4m15s (instead of 10m34s) - to replay 40 fewer WAL segments after loading checkpoint.1953.
    ---------
    9m35s (approx)

Logs show creation of checkpoint.1953 began on Jan 13 but it was not used on Jan 14.

Recent logs indicate the loading of older-than-expected checkpoint file happened on Dec 22, Jan 7, and Jan 14.

Updates epic #1744 because it uses more memory.

Additional context

See https://github.com/dapperlabs/flow-go/issues/6114.

I think some possible non-startup causes can include scenarios like:

  • latest checkpoint file was deleted before startup begins looking for latest checkpoint file, or
  • latest checkpoint file didn't successfully finish being written prior to startup
@fxamacker fxamacker added the Bug Something isn't working label Jan 24, 2022
@fxamacker
Copy link
Member Author

If the checkpoint files were not deleted, there's overlap in some upfront tasks for this issue and #1750. On Friday, Jan 21, I suggested some ways to reduce number of WAL segments and some near-term optimizations to checkpoint creation that can allow us to further reduce WAL segments by allowing more frequent checkpoint creation.

@fxamacker fxamacker changed the title EN startup was sometimes delayed because older checkpoint file was loaded, causing 40 extra WAL segments to replay [Execution State] EN startup was sometimes delayed because older checkpoint file was loaded, causing 40 extra WAL segments to replay Feb 28, 2022
@fxamacker
Copy link
Member Author

fxamacker commented Mar 25, 2022

Root cause of this problem is checkpoint creation taking 12-15+ hours, which can accumulate enough WAL segments to trigger another checkpoint creation immediately after the current checkpoint finishes.

Given the long duration of checkpoint creation, shutdowns would likely interrupt checkpoint creation which causes more than 40 WAL segments to accumulate and they need to be replayed during EN startup.

Additionally, there wasn't sufficient logging so the root cause wasn't as easy to identify as it should've been.

This is closed by PR #1944 #2126 #2165 #2119 #2187

@fxamacker fxamacker self-assigned this Jul 15, 2022
@fxamacker fxamacker added the Execution Cadence Execution Team label Jul 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working Execution Cadence Execution Team
Projects
None yet
1 participant