Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raise if fsdp plugin is unset #30

Conversation

alex-jw-brooks
Copy link
Collaborator

@alex-jw-brooks alex-jw-brooks commented Feb 1, 2024

When we try to run multigpu training, we explicitly set:

trainer.accelerator.state.fsdp_plugin.auto_wrap_policy = fsdp_auto_wrap_policy(model)

before launching the trainer. However, the Accelerator state doesn't always define the fsdp_plugin attribute, so this can crash in a cryptic way if we launch with Torchrun for multigpu with the wrong settings.

The fsdp_plugin of the Accelerator state for multigpu configurations is set here in the Accelerate source. If you don't set ACCELERATE_USE_FSDP, it won't be defined.

This PR explicitly checks to see if the accelerator state has an fsdp_plugin before trying to update the auto wrap policy; if it doesn't, it raises an attribute error with a hint that you probably need to set ACCELERATE_USE_FSDP to run with a multigpu configuration.

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
@alex-jw-brooks alex-jw-brooks marked this pull request as ready for review February 1, 2024 15:21
@Ssukriti Ssukriti requested a review from lchu6 February 1, 2024 17:20
@fabianlim
Copy link
Collaborator

@alex-jw-brooks @Ssukriti I dont think his change is needed if #53 is merged. This is because we now explicitly check trainer.is_fsdp_enabled, which will be True only if FSDP is enabled.

@alex-jw-brooks
Copy link
Collaborator Author

Hey @fabianlim, that sounds good - I'm not super familiar with the Accelerate codebase around the plugin, but I assume that will handle the current attribute error and that the fsdp_plugin will always be defined.

I know that a lot of default args are set to None in Accelerate, including the fsdp_plugin in the AcceleratorState, so it might be a good idea to take a closer look at some point to make sure that we can't end up in a situation where fsdp_plugin is a defined state attribute that is set to None, since that would break similarly if it were possible. For now, let's close this PR in favor of the other one though!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants