Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rework server bootstrap completion #800

Merged
merged 1 commit into from
Oct 30, 2023
Merged

Conversation

MichaelBrim
Copy link
Collaborator

Description

  • replace server pids pthread mutex/cond with ABT versions
  • add margo_state_dump() on client-server or server-server failures (currently commented out)
  • add a 'bootstrap complete' broadcast rpc after rank 0 sees all servers have reported
  • fix function declaration for unifyfs_invoke_broadcast_extents()

Motivation and Context

Flash-X on OLCF Summit was consistently failing to bootstrap at 12 nodes. Some of the servers were ready for client connections while others were still in the bootstrap phase.

Previously, servers assumed a successful "server bootstrap" phase when they received a response to the RPC used to report their pid to rank 0. These changes ensure that every server reaches consensus on bootstrap completion before we generate the unifyfs_server.pids file that is used by the command-line utility to report success in a user job.

How Has This Been Tested?

Tested on OLCF Summit. Getting to a working config required use of margo 0.13.1, mercury 2.2, argobots 1.1, and libfabric 1.14.1

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Testing (addition of new tests or update to current tests)
  • Documentation (a change to man pages or other documentation)

Checklist:

  • My code follows the UnifyFS code style requirements.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.
  • All commit messages are properly formatted.

* replace server pids pthread mutex/cond with ABT versions
* add margo_state_dump() on client-server or server-server failures
  (currently commented out)
* add a 'bootstrap complete' broadcast rpc after rank 0 sees all
  servers have reported
* fix function declaration for unifyfs_invoke_broadcast_extents()
@adammoody adammoody merged commit 2627be4 into LLNL:dev Oct 30, 2023
6 checks passed
@MichaelBrim MichaelBrim deleted the bootstrap-bug branch October 30, 2023 19:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants