Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[XLA:GPU] Cancel launched GPU executables in PJRT #16801

Open
yliu120 opened this issue Sep 4, 2024 · 2 comments
Open

[XLA:GPU] Cancel launched GPU executables in PJRT #16801

yliu120 opened this issue Sep 4, 2024 · 2 comments

Comments

@yliu120
Copy link
Contributor

yliu120 commented Sep 4, 2024

Hi,

After chatting with @hawkinsp offline, we would like to track this feature request with an issue.

The major goal of this feature request to have an API exposed in JAX and PJRT client that can does the following:

  1. Cancel all executables/thunks scheduled/launched to the device quickly. For instance, a very strong version could be issuing a cudaDeviceReset() call.
  2. While cancelling, don't kill the nccl communicators we initialized previously when we first execute the executables. The initialization of those nccl communicatiors are expensive. The devices are still good and we just want to cancel the running computes.
  3. Don't kill the executables and keep them around in the memory.

This features could be very useful in the following scenario:
(Copied over from Peter's summary offline)

  • you detect a failure somehow elsewhere in the cluster (you have some runtime of your own here?)
  • you cancel all the other workers that are still up
  • you form new nccl communicators once you sub in a new machine
  • you resume training.
@yliu120
Copy link
Contributor Author

yliu120 commented Sep 4, 2024

Also loop in @ezhulenev

@nouiz
Copy link
Contributor

nouiz commented Sep 4, 2024

The description tell:
While cancelling, don't kill the nccl communicators
But also:
you form new nccl communicators once you sub in a new machine
That seem to contract itself, or I'm missing something?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants