Introduce shard-merging util for FSDP #2772

muellerzr · 2024-05-14T00:09:19Z

What does this PR do?

This PR brings a shard-merging util from PyTorch into accelerate. The reason for doing so is we can specifically save these weights as safetensors rather than .bin.

Applicable to users who use SHARDED_STATE_DICT during FSDP

Note that this is a cpu-bound process. So you just need enough RAM to load the model into memory you want to merge.

New API

Command Line

accelerate merge-weights {checkpoint_dir} {output_dir}

You can specify if you want to remove the checkpoint_dir via --remove_checkpoint_dir
By default we save these as safetensors. If you want to disable it, pass in --use_pytorch to save as .bin

Python API

from accelerate.utils import merge_fsdp_weights
output_dir = merge_fsdp_weights(
    checkpoint_dir="my_fsdp_checkpoints",
    output_path="somepath",
    use_safetensors=True,
    remove_checkpoint_dir=True
)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@pacman100 @SunMarc

TODO:

Finish saving logic
Write tests
Write documentation

HuggingFaceDocBuilderDev · 2024-05-14T00:13:50Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

pacman100

Thank you, @muellerzr, for this nice utility function to merge the sharded FSDP state dicts! It would be great to have an accompanying test for this.

muellerzr · 2024-05-15T15:54:58Z

@pacman100 hope the tests are to your liking :)

pacman100

Thank you @muellerzr for iterating! Very useful feature.

SunMarc

Thanks for adding this and creating a great test for that ! Left a few minor comments

docs/source/usage_guides/fsdp.md

src/accelerate/commands/merge.py

src/accelerate/utils/fsdp_utils.py

SunMarc · 2024-05-16T16:04:57Z

src/accelerate/test_utils/scripts/test_merge_weights.py

+def test_merge_weights_safetensors(model, path):
+    # Should now be saved at `path/merged.safetensors`
+    merge_fsdp_weights(path / "pytorch_model_fsdp_0", path, use_safetensors=True)
+
+    safe_state_dict = load_file(path / "merged.safetensors")
+    safe_loaded_model = TinyModel()
+    check_weights("diff", model.state_dict(), safe_loaded_model.state_dict())
+    safe_loaded_model.load_state_dict(safe_state_dict)
+    check_weights("same", model.state_dict(), safe_loaded_model.state_dict())


muellerzr added 3 commits May 13, 2024 19:12

Initial commit

6eb9aa3

Now to test

0c2be34

Store false

9077798

muellerzr requested review from pacman100 and SunMarc May 14, 2024 00:09

muellerzr added 2 commits May 13, 2024 20:14

Slight tweaks

a7fa9aa

Fix naming

5a4adca

muellerzr changed the title ~~Introduce shard-merging util~~ Introduce shard-merging util for FSDP May 14, 2024

pacman100 reviewed May 14, 2024

View reviewed changes

Got it all working with tests

4e6155a

muellerzr requested a review from pacman100 May 15, 2024 15:53

Use not for safetensors arg

3d03354

rm change

acb1fd6

pacman100 approved these changes May 16, 2024

View reviewed changes

Add docs

94c221d

SunMarc approved these changes May 16, 2024

View reviewed changes

muellerzr added 6 commits May 16, 2024 12:53

Adjust based on Marc's feedback

5dd1eed

Specify just weights

eddbf2d

Update tests to include CLI and swap namings

35bb76a

Fin

7cd2fa4

Rm unused

68d1cb8

Rm again

5153795

muellerzr merged commit 4ba436e into main May 16, 2024
25 checks passed

muellerzr deleted the fsdp-util branch May 16, 2024 17:49

younesbelkada mentioned this pull request May 21, 2024

FIX / FSDP : Guard fsdp utils for earlier PyTorch versions #2794

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce shard-merging util for FSDP #2772

Introduce shard-merging util for FSDP #2772

muellerzr commented May 14, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented May 14, 2024

pacman100 left a comment

muellerzr commented May 15, 2024

pacman100 left a comment

SunMarc left a comment

SunMarc May 16, 2024

Introduce shard-merging util for FSDP #2772

Introduce shard-merging util for FSDP #2772

Conversation

muellerzr commented May 14, 2024 • edited Loading

What does this PR do?

New API

Command Line

Python API

Before submitting

Who can review?

TODO:

HuggingFaceDocBuilderDev commented May 14, 2024

pacman100 left a comment

Choose a reason for hiding this comment

muellerzr commented May 15, 2024

pacman100 left a comment

Choose a reason for hiding this comment

SunMarc left a comment

Choose a reason for hiding this comment

SunMarc May 16, 2024

Choose a reason for hiding this comment

muellerzr commented May 14, 2024 •

edited

Loading