Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Head HA #503

Open
weiquanlee opened this issue Feb 14, 2025 · 0 comments
Open

[Core] Head HA #503

weiquanlee opened this issue Feb 14, 2025 · 0 comments
Labels
enhancement New feature or request

Comments

@weiquanlee
Copy link
Collaborator

weiquanlee commented Feb 14, 2025

Description

Head High-Availability Feature, which reduces the impact of Head FO in ray clusters.

Implementation:

  1. Start two or more head nodes at the same time.
  2. The startup process is before initializing the node and starting the head node process. It connects to redis and compete for the leadership through redis's distributed lock.
  3. Only the node that successfully competes for the leadership will execute the subsequent gcs_server/dashboard process startup normally.
  4. The standby node will be stuck in the competition process until the original leader node fails.
  5. After normal startup, the startup process of the leader node will periodically renew the distributed lock of redis to maintain the leader status. Then the startup process will run as a daemon process to check the leadership of this head node.
  6. If the entire pod of the leader node fails or the lease renewal fails, it considers itself as a standby node and kills all processes and itself and then exit the startup process. Exit of the startup process will cause the pod to restart, which is done by kuberay.
  7. The standby node will terminate the competition process when it finds itself as the leader, starting the gcs and dashboard processes, etc.
  8. Then the newly started process in step 6 will be stuck in the competition process as a standby node until the current leader node in step 7 fails.

Use case

Set the environment variable RAY_ENABLE_HEAD_HA to True to enable it.

Dependency

Related Kuberay modification for creating multi head nodes.
Worker nodes must access the head node through the domain name provided by Kuberay.

@weiquanlee weiquanlee added the enhancement New feature or request label Feb 14, 2025
@weiquanlee weiquanlee mentioned this issue Feb 14, 2025
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant