Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Part 1: Introduce multi-node SPMD support for Neuron #8204

Conversation

rpsilva-aws
Copy link
Contributor

In this PR, we adapt to account for a new initialization path that supports multi-node SPMD in Neuron. In order to minimize this change, we retain the xla.init() API, but introduce a reinitialization for PJRT alone once SPMD is enabled. Since enabling SPMD follows the initial Neuron initialization, we require reconfiguring once this is enabled, and if the user did not explicitly set XLA_USE_SPMD (via is_spmd(), as it is currently recommended). Under the hood, both APIs will guarantee that the environment is correctly configured when SPMD is enabled.

In a follow-up, the reinitialization path is moved to torch-xla.

@rpsilva-aws rpsilva-aws changed the title Part 1: Introduce multi-node SPMD support in 2.5 Part 1: Introduce multi-node SPMD support Oct 2, 2024
@rpsilva-aws
Copy link
Contributor Author

Note, same as: #8046 (review)
Using #8046 as a follow-up for 2.6.

@rpsilva-aws rpsilva-aws changed the title Part 1: Introduce multi-node SPMD support Part 1: Introduce multi-node SPMD support for Neuron Oct 2, 2024
@rpsilva-aws rpsilva-aws marked this pull request as ready for review October 2, 2024 19:39
@rpsilva-aws
Copy link
Contributor Author

@will-cromar, fyi: #8204 (comment)

torch_xla/runtime.py Outdated Show resolved Hide resolved
@rpsilva-aws rpsilva-aws force-pushed the rpsilva-aws_neuron_multi_node_spmd_pt25 branch from 7934779 to f221f59 Compare October 2, 2024 19:58
@JackCaoG
Copy link
Collaborator

JackCaoG commented Oct 2, 2024

please fix the linter

@rpsilva-aws rpsilva-aws force-pushed the rpsilva-aws_neuron_multi_node_spmd_pt25 branch from f221f59 to 0571c47 Compare October 2, 2024 20:53
@jeffhataws jeffhataws merged commit 0b19374 into pytorch:master Oct 4, 2024
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants