Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Beacon node exits uncontrolled (too early) during graceful shutdown #5317

Closed
nflaig opened this issue Mar 29, 2023 · 3 comments · Fixed by #5330
Closed

Beacon node exits uncontrolled (too early) during graceful shutdown #5317

nflaig opened this issue Mar 29, 2023 · 3 comments · Fixed by #5330
Labels
prio-high Resolve issues as soon as possible.
Milestone

Comments

@nflaig
Copy link
Member

nflaig commented Mar 29, 2023

Problem

Lodestar BN does not wait for all jobs to be finished during graceful shutdown.
This is problematic as some operations are not executed such as writing BLSToExecutionChanges to disk (see discord discussion)

Root cause

Lodestar uses threads package which installs SIGTERM signal listeners that exit the main process once all workers are terminated, see terminateWorkersAndMaster.

Related

@nflaig nflaig added the prio-high Resolve issues as soon as possible. label Mar 29, 2023
@g11tech
Copy link
Contributor

g11tech commented Mar 29, 2023

yes i confirm this to be issue as pointed by @nflaig : https://github.com/andywer/threads.js/blob/0d1a882f09273cb8482902ae2601fe4ffed6fbab/src/master/implementation.node.ts#L142

as soon as i comment out the lines in the local nodemodules, node close follow all close steps (but obviously workers keep going on still), else it seems to be processing this event on the first true async await call (await network.close)

So we might need to export the terminateWorkerAndMaster and call it explicity and remove these

process.on("SIGINT", () => terminateWorkersAndMaster())
  process.on("SIGTERM", () => terminateWorkersAndMaster())

``

@nflaig
Copy link
Member Author

nflaig commented Apr 4, 2023

As discussed in Apr 4 standup we want to push a fix for this before the Shapella hardfork to address the BLSToExecutionChanges issue and other possible issues related to not shutting down properly.

For the quickfix we will have to

  • release a new version of @chainsafe/threads @wemeetagain
  • explicitly call process.exit after node.close succeeds (code: 0) or fails (code: 1) @nflaig
  • retest that BLSToExecutionChanges are correctly persistet, best if @g11tech can confirm this once patch is implemented

In #5330 a cleaner solution will be implemented to make sure all active handlers are actually terminated and we don't need to rely on process.exit. I documented how to work with the branch in case anyone is bored and wants to go on an active handles hunt

@philknows
Copy link
Member

Due to not being able to make the timeline for the Shapella hard fork (wasn't super urgent since other clients were able to handle the load as well), we can wait for #5330 and fix this optimally rather than implementing a workaround.

@nflaig nflaig modified the milestones: v1.8.0, v1.9.0 May 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
prio-high Resolve issues as soon as possible.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants