Fix excessive CPU memory usage with FSDP and cpu_ram_efficient_loading #33154

matthewdouglas · 2024-08-27T20:47:24Z

What does this PR do?

This PR fixes an issue with FSDP + CPU_RAM_EFFICIENT_LOADING where a copy of the parameters are loaded into CPU memory for each rank. The change offloads to CPU only for rank 0, and the rest on the meta device. On a typical 8x node this will dramatically decrease the system RAM overhead required to load a large model.

This is split from a previously reverted PR #32276 originally contributed by @winglian. The revert was due to issues we had with validating the change that have since been resolved.

The issue we encountered was specific to our cluster environment on AWS. With the AWS EFI plugin for NCCL, we encountered consistent hangs. If we upgrade NCCL from the version bundled with PyTorch (2.20.5) to NCCL 2.22.3 via pip install nvidia-nccl-cu12==2.22.3, this issue is resolved. (Internal discussion)

Fixes #31721, #31577

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@ArthurZucker @LysandreJik

HuggingFaceDocBuilderDev · 2024-08-27T21:06:19Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker · 2024-08-28T09:45:05Z

Ping me when this is ready for review!

matthewdouglas · 2024-08-28T14:03:55Z

@ArthurZucker Ready!

winglian · 2024-08-28T15:21:10Z

thanks @matthewdouglas !

ArthurZucker

Sorry as the changes are exactly the same as what we had in #32276, could you explain what was resolved on main that no longer fails?

matthewdouglas · 2024-09-04T16:30:12Z

@ArthurZucker I've added more background to the description.

The issue we encountered was specific to our cluster environment on AWS. With the AWS EFI plugin for NCCL, we encountered consistent hangs. If we upgrade NCCL from the version bundled with PyTorch (2.20.5) to NCCL 2.22.3 via pip install nvidia-nccl-cu12==2.22.3, this issue is resolved. (Internal discussion)

ArthurZucker

Thanks a lot all for clarifying!

fabianlim · 2024-09-13T01:23:26Z

@ArthurZucker @matthewdouglas I tried this fix but im having similar NCCL issues as what you had. Unfortunately your suggestion to upgrade to latest is not working. I understand you have some internal debugging discussions on this topic. Is it possible to share NCCL env settings and other package versions, that might shed light on the root cause?

Update: found the root cause and it was not an NCCL issue. Have submitted a fix to TRL for it

huggingface#33154)

Fix excessive CPU memory usage with FSDP and cpu_ram_efficient_loading

1b07bd3

matthewdouglas added the PyTorch FSDP label Aug 27, 2024

matthewdouglas mentioned this pull request Aug 27, 2024

OOM when loading 300B models with AutoModelForCausalLM.from_pretrained and BitsAndBytesConfig quantization. #31577

Closed

4 tasks

matthewdouglas requested a review from ArthurZucker August 28, 2024 14:02

ArthurZucker reviewed Sep 4, 2024

View reviewed changes

ArthurZucker approved these changes Sep 4, 2024

View reviewed changes

ArthurZucker merged commit b390998 into main Sep 4, 2024
22 checks passed

ArthurZucker deleted the restore-fsdp-meta-sharding branch September 4, 2024 16:37

achew010 mentioned this pull request Sep 11, 2024

Distributed Training Problems for QLoRA models with Transformers pre-release 4.45 foundation-model-stack/fms-acceleration#83

Open

fabianlim mentioned this pull request Sep 20, 2024

Fix Inconsistency with IsShardedQLoRA Setting huggingface/trl#2089

Merged

5 tasks

itazap pushed a commit to NielsRogge/transformers that referenced this pull request Sep 20, 2024

Fix excessive CPU memory usage with FSDP and cpu_ram_efficient_loading (

c107096

huggingface#33154)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix excessive CPU memory usage with FSDP and cpu_ram_efficient_loading #33154

Fix excessive CPU memory usage with FSDP and cpu_ram_efficient_loading #33154

matthewdouglas commented Aug 27, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Aug 27, 2024

ArthurZucker commented Aug 28, 2024

matthewdouglas commented Aug 28, 2024

winglian commented Aug 28, 2024

ArthurZucker left a comment

matthewdouglas commented Sep 4, 2024

ArthurZucker left a comment

fabianlim commented Sep 13, 2024 •

edited

Loading

Fix excessive CPU memory usage with FSDP and cpu_ram_efficient_loading #33154

Fix excessive CPU memory usage with FSDP and cpu_ram_efficient_loading #33154

Conversation

matthewdouglas commented Aug 27, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented Aug 27, 2024

ArthurZucker commented Aug 28, 2024

matthewdouglas commented Aug 28, 2024

winglian commented Aug 28, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

matthewdouglas commented Sep 4, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

fabianlim commented Sep 13, 2024 • edited Loading

matthewdouglas commented Aug 27, 2024 •

edited

Loading

fabianlim commented Sep 13, 2024 •

edited

Loading