[fix] Fix the activation checkpointing when using SwiGLUPackedFusedOp #1127

warpuv · 2024-10-11T17:07:55Z

According to the docs (https://pytorch.org/docs/stable/autograd.html#torch.autograd.Function) forward() method should not be called directly, apply() method have to be used instead. After removing forward call, activation checkpointing starts working.

What does this PR do?

Fixes #1126

Before submitting

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

According to the docs (https://pytorch.org/docs/stable/autograd.html#torch.autograd.Function) forward() method should not be called directly, apply() method have to be used instead. After removing forward call, activation checkpointing starts working.

pansershrek · 2024-10-16T08:40:27Z

@danthe3rd Can you review this PR please? This change fixes the integration with FSDP + activation_checkpointing

lw

Thanks for the PR!

It seems you're basically trying to revert the change introduced in #706. I don't have full context on that old PR, but I wonder whether there are ways to achieve both goals at once.

For example, we could take your PR and also replace the check if (ctx == nullptr) above with if (x.required_grad). I believe this should get us everything we want?

The IF conditional on the x.requires_grad state (to change the behavior between inference/training modes) changes behavior of the recomputation of the forward() method which breaks activation checkpointing (as on recomputation phase x is detached with requires_grad==False, and different number of tensors are saved in the save_for_backward() method).

warpuv · 2024-10-17T17:45:35Z

Thanks for the PR!

It seems you're basically trying to revert the change introduced in #706. I don't have full context on that old PR, but I wonder whether there are ways to achieve both goals at once.

For example, we could take your PR and also replace the check if (ctx == nullptr) above with if (x.required_grad). I believe this should get us everything we want?

@lw thank you for your review and suggestions.
The “if” conditional on x.requires_grad changes the behavior of the recomputation of the forward since x.requires_grad has different value as it is detached on recomputation phase, and in turn save_for_backward is not called.
I have pushed an alternative solution using torch::GradMode::is_enabled(), I believe both goals are achieved this way.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 11, 2024

lw reviewed Oct 16, 2024

View reviewed changes

warpuv force-pushed the main branch 2 times, most recently from 6033c6c to bb5ef21 Compare October 17, 2024 17:27

warpuv force-pushed the main branch from bb5ef21 to 4829d7e Compare October 17, 2024 17:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix] Fix the activation checkpointing when using SwiGLUPackedFusedOp #1127

[fix] Fix the activation checkpointing when using SwiGLUPackedFusedOp #1127

warpuv commented Oct 11, 2024 •

edited

Loading

pansershrek commented Oct 16, 2024

lw left a comment

warpuv commented Oct 17, 2024 •

edited

Loading

[fix] Fix the activation checkpointing when using SwiGLUPackedFusedOp #1127

Are you sure you want to change the base?

[fix] Fix the activation checkpointing when using SwiGLUPackedFusedOp #1127

Conversation

warpuv commented Oct 11, 2024 • edited Loading

What does this PR do?

Before submitting

PR review

pansershrek commented Oct 16, 2024

lw left a comment

Choose a reason for hiding this comment

warpuv commented Oct 17, 2024 • edited Loading

warpuv commented Oct 11, 2024 •

edited

Loading

warpuv commented Oct 17, 2024 •

edited

Loading