Prevent OOM during checkpoint save on colab for llama3-8b qlora recipe #1315

mikaylagawarecki · 2024-08-12T22:14:39Z

Context

This PR prevents OOM during checkpoint save on colab for the following recipe,

tune run recipes/lora_finetune_single_device.py --config recipes/configs/llama3/8B_qlora_single_device.yaml

I do believe it is possible to do better (perf improvements for this colab case) if we refactor the checkpointing logic, but that would be a more invasive refactor imo for now this is the most minimally invasive change that unblocks this use case.

Changelog

Added a new hook _low_ram_reparametrize_as_dtype_state_dict_post_hook and _register_reparametrize_state_dict_hooks that toggles between the regular reparametrize state_dict hook and this one
added a model config low_cpu_ram -- when low_cpu_ram isTrue _register_reparametrize_state_dict_hooks toggles to the low_ram version of the hook
PyTorch side changes were landed in Add torch.serialization.skip_data context manager pytorch#134504 [was reverted, remerging right now as of 9/05] and Make torch.serialization.set_default_mmap_options usable as a context manager pytorch#134371 both should be available in the 20240830 nightly

Old changelog (keeping around for posterity)

~~Set a seed in 8B_qlora_single_device.yaml to make dataloader samples (and hence weights) deterministic~~
~~[for testing velocity purposes]Changed number of epochs to 8, reduced max_steps_per_epoch --> 20, gradient_accumulation_steps --> 2 to make save_checkpoint be called sooner~~
~~[for loss curves] some changes to lora_finetune_single_device.py to mimic a user doing resume_from_checkpoint after each epoch (while preserving logger)~~
[Not to be landed in torchtune, for PoC purpose] patched FakeTensor.__reduce_ex__ which is needed to ensure the write_record_metadata utility to create empty checkpoints is called (requires changes in Prototype changes to create fake checkpoints with empty storages pytorch#133272) I need to figure out how to land this piece :)
~~Skip registration of reparametrize_as_dtype_state_dict_post_hook for llama3~~
~~- Showed an example of how to do the corresponding (using mmap to prevent OOM) in lora_finetune_single_device.py:save_checkpoint~~

Test plan

Sanity check

Ran tune run command on devgpu (with only the changes in 8B_qlora_single_device.yaml to set seed and decrease steps per epoch) and verified that meta_model_0.pt generated is the same before and after the changes in this PR with small snippet

	Before ckpt save time (s)	After ckpt save time (s)
devgpu	178s	~200s (ranged from 190 to 220ish)
colab	OOM (no baseline)	510s (ranged from 421s to 610s)

Verified that colab does not OOM
https://colab.research.google.com/drive/1y7Az78ATauK7gkewZkcMO3cNgVWm1233?usp=sharing

Loss Curves

The validation was run on commit abdbd7 which has special logic to mimic resume_from_checkpoint for each epoch

Config is per the changes in recipes/configs/llama3/8B_qlora_single_device.yaml (8 epochs with 20 steps per epoch and gradient accumulation every 2 steps) on 6cf31b6. For the reloading checkpoint case I modified lora_fine_tune_single_device.py:recipe_main to mimic resume_from_checkpoint after each epoch

Devgpu:

Colab:

pytorch-bot · 2024-08-12T22:14:42Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1315

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit dc8c6b0 with merge base 66590b4 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ebsmothers

This is awesome! Left a few basic questions, but overall really excited to see that we'll be able to provide proper end-to-end Colab support with these changes. One high-level comment is that we should think about writing a utility for the portion in ~L520-L550. Then we can gate behind a config like low_memory_save or something like that. But that's more of a UX thing, happy to help out there if needed.

recipes/configs/llama3/8B_qlora_single_device.yaml

recipes/lora_finetune_single_device.py

ebsmothers · 2024-08-21T19:42:05Z

recipes/lora_finetune_single_device.py


        # Construct the full state dict with LoRA weights merged into base LLM weights
        merged_state_dict = get_merged_lora_ckpt(
            state_dict,
            rank=self._lora_rank,
            alpha=self._lora_alpha,
+            dest_state_dict=dest_state_dict,


(another noob q:) so after this, we can save the checkpoint normally (e.g. in the call to save_checkpoint on L580), even though we are now saving something like {"model": dest_state_dict, ... (other stuff)} to a separate file from the mmapped one?

We should think of dest_state_dict as being backed by disk (and yes we are writing it once to disk in a torch.save format in this whole bit)

The final save_checkpoint on L580 is re-saving a new checkpoint with {"model": merged_state_dict, ... (other stuff)} to a new file checkpoint file. So we are saving dest_state_dict twice

We could potentially do something smarter than what we're doing right now to avoid this "re-save" but the code changes would be more invasive than they are now, given that the checkpointers seem to do some remapping though I refrained from doing this as a v0, wdyt?

mikaylagawarecki · 2024-08-29T17:11:15Z

recipes/lora_finetune_single_device.py

+        # Do this using the state_dict to avoid running upcast and H2D in state_dict post hook twice
+        # Must be before get_merged_lora_ckpt because get_merged_lora_ckpt will remove lora keys
+        adapter_key_filter = lambda x: x in self.adapter_params
+        adapter_state_dict = {
+            k: v for k, v in state_dict.items() if adapter_key_filter(k)
+        }


Ideally we want to only run state_dict post hooks once, so we should reuse the state_dict

this does change the semantic slightly though -- before adapter_*.pt contained weights tagged with CUDA, but now it contains weights tagged with CPU.

Not sure whether the old behavior was intended/whether this change is ok (but no CI seems to fail :D) Also when loading, we map_location="cpu" regardless

Sorry I missed this comment before. But yeah I think this change makes sense, I don't think there's any reason we need to require CUDA weights. And as you point out since we load on CPU when resuming it was probably never really an issue. Plus not re-running the state dict post hooks is a nice bonus.

ebsmothers · 2024-08-30T00:20:02Z

recipes/configs/llama3/8B_qlora_single_device.yaml

@@ -23,6 +23,7 @@ model:
  apply_lora_to_output: False
  lora_rank: 8
  lora_alpha: 16
+  low_cpu_ram: False


Given that we have to modify the state dict hook, I see why it makes sense to put this in the model config. But it does feel a bit weird to me since it's not really a property of the model (more just the model is a convenient place for us to know that we're gonna have to upcast NF4 tensors).

I wonder if we can instead define a standalone config, parse it in the recipe with e.g. low_cpu_ram = cfg.get("low_cpu_ram", False), then use that to overwrite the reparametrize_as_dtype_state_dict_post_hook. Maybe a bit hacky, but we can at least assert that the expected state dict hook is there before replacing it to ensure that we aren't adding this onto any old non-QLoRA model. Is that obviously worse? (Mainly I want to avoid our model classes having to know or care about low-level details of how they're gonna be checkpointed)

Hmm, I think to manually remove and reregister the hook we would need the handle returned by _register_state_dict_hook

Is the way I updated this to patch the hook ok with you

ebsmothers · 2024-08-30T00:21:34Z

torchtune/models/llama3/_component_builders.py

+            if sys.platform == "win32":
+                raise RuntimeError(
+                    "low_cpu_ram=True not supported on Windows."
+                )
+            else:
+                raise RuntimeError("low_cpu_ram=True requires torch.__version__ >= 2.5.0.dev20240830.")


Similar comment here: ideally we can do these checks in the recipe (or in a utility) rather than in the builder of the model

ebsmothers · 2024-08-30T00:22:50Z

torchtune/modules/common_utils.py

+
+
+# mmap.MAP_SHARED is not supported on Windows but this change targets colab.
+if hasattr(torch.serialization, "skip_data") and not sys.platform == "win32":


Is the first half of this check just a proxy for a particular torch version? If so maybe better to just directly gate on that (with torch_version_ge or something)

torch_version_ge seems to cause circular import when imported in this file :/ so just using __torch_version__

Oh yeah we are working to fix that, __torch_version__ is good too

ebsmothers

A few more small comments and questions, but otherwise this looks good to go!

ebsmothers · 2024-09-09T16:11:31Z

recipes/lora_finetune_single_device.py

+        # Do this using the state_dict to avoid running upcast and H2D in state_dict post hook twice
+        # Must be before get_merged_lora_ckpt because get_merged_lora_ckpt will remove lora keys
+        adapter_key_filter = lambda x: x in self.adapter_params
+        adapter_state_dict = {
+            k: v for k, v in state_dict.items() if adapter_key_filter(k)
+        }


Sorry I missed this comment before. But yeah I think this change makes sense, I don't think there's any reason we need to require CUDA weights. And as you point out since we load on CPU when resuming it was probably never really an issue. Plus not re-running the state dict post hooks is a nice bonus.

ebsmothers · 2024-09-09T16:13:25Z

recipes/lora_finetune_single_device.py

+    if cfg.get("low_cpu_ram", False):
+        common_utils._use_low_cpu_ram = True


Does this have to be in recipe_main? Can we instead do it somewhere inside the recipe class (before the model gets instantiated)? Also would add a one-line comment explaining this

ebsmothers · 2024-09-09T16:16:57Z

torchtune/modules/common_utils.py

+
+
+# mmap.MAP_SHARED is not supported on Windows but this change targets colab.
+if torch.__version__ >= "2.5.0.dev20240906" and not sys.platform == "win32":


Just a sanity check here: is this if/else just to ensure that no one tries to directly import the _low_ram_reparametrize_as_dtype_state_dict_post_hook API on an unsupported environment? Mainly asking because we have the equivalent checks in _register_reparametrize_state_dict_hooks now

You're right, removing this if else

ebsmothers · 2024-09-09T16:18:34Z

torchtune/models/llama3/_component_builders.py

-        model._register_state_dict_hook(
-            partial(reparametrize_as_dtype_state_dict_post_hook, offload_to_cpu=True)
-        )
+        _register_reparametrize_state_dict_hooks(model)


Doesn't have to be done in this PR, but we can think about adding this for other models that would have a similar memory situation when running QLoRA (Llama 3.1 8B is an obvious choice, but there are a handful of other similarly-sized models supported in our repo that could benefit from this)

will do in followup

recipes/lora_finetune_single_device.py

ebsmothers · 2024-09-09T16:34:41Z

torchtune/modules/common_utils.py

+        # Create a state_dict on disk with space reserved for storage bytes
+        # Then load with mmap and MAP_SHARED (can writeback to disk file)
+        dest_state_dict_path = "/tmp/fake_state_dict.pt"
+        with torch.serialization.skip_data(materialize_fake_tensors=True):


noob q: what does materialize_fake_tensors mean in this context?

It means that FakeTensors in the object passed to torch.save will be treated as if they were real tensors. The implication is that torch.load will load a tensor (not FakeTensor) on the FakeTensor's device with storage allocated but uninitialized (0s)

torchtune/modules/common_utils.py

ebsmothers · 2024-09-09T16:43:01Z

torchtune/modules/common_utils.py

+        # In place update original state_dict object. Although the private state dict
+        # post hook supports out of place behavior, the semantic actually buggy. We eventually want
+        # to use the public state_dict post hook which does not support out of place behavior.
+        for k in state_dict.keys():
+            state_dict[k] = dest_state_dict[k]


I think I have some misunderstanding here. If we inplace update the state dict to the upcasted version of the weights, why won't it cause an OOM?

When we do sd = torch.load(..., mmap=True), the storages of the tensors in sd are mmap-backed

state_dict[k] = dest_state_dict[k] does not access any pages of the storage of the tensor given by dest_state_dict[k], so the storage is not materialized, and no OOM will happen

Oh yeah nvm I get it now, I was not thinking it through carefully enough. Thanks for the explanation!

…t save happens earlier and result is deterministic

This reverts commit abdbd7f.

codecov-commenter · 2024-09-09T19:58:57Z

Codecov Report

Attention: Patch coverage is 17.64706% with 42 lines in your changes missing coverage. Please review.

Project coverage is 27.18%. Comparing base (66590b4) to head (dc8c6b0).

Files with missing lines	Patch %	Lines
torchtune/modules/common_utils.py	22.22%	28 Missing ⚠️
recipes/lora_finetune_single_device.py	0.00%	7 Missing ⚠️
torchtune/modules/peft/_utils.py	0.00%	6 Missing ⚠️
torchtune/models/llama3/_component_builders.py	50.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1315      +/-   ##
==========================================
- Coverage   27.22%   27.18%   -0.04%     
==========================================
  Files         286      286              
  Lines       13828    13869      +41     
==========================================
+ Hits         3764     3770       +6     
- Misses      10064    10099      +35

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ebsmothers

Thank you for enabling this, can't wait to put together some torchtune Colab notebooks for our users!

This reverts commit eba1ffa.

mikaylagawarecki · 2024-09-10T19:29:28Z

Added changes from #1535 + one more round of loss curves comparing the fake resume_from_checkpoint in eba1ffa + same config on base

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 12, 2024

mikaylagawarecki mentioned this pull request Aug 12, 2024

Prototype changes to create fake checkpoints with empty storages pytorch/pytorch#133272

Closed

mikaylagawarecki force-pushed the oom_fix branch from 0976c41 to 8e9add3 Compare August 20, 2024 22:42

RdoubleA mentioned this pull request Aug 21, 2024

Running LoRA + QLoRA on Colab Notebooks #828

Open

ebsmothers reviewed Aug 21, 2024

View reviewed changes

mikaylagawarecki force-pushed the oom_fix branch from 054b6fd to b8dbcfd Compare August 29, 2024 00:26

mikaylagawarecki requested a review from ebsmothers August 29, 2024 16:45

mikaylagawarecki commented Aug 29, 2024

View reviewed changes

mikaylagawarecki marked this pull request as ready for review August 29, 2024 17:11

mikaylagawarecki changed the title ~~[PoC] Prevent OOM during checkpoint save on colab~~ Prevent OOM during checkpoint save on colab for llama38b qlora recipe Aug 29, 2024

mikaylagawarecki changed the title ~~Prevent OOM during checkpoint save on colab for llama38b qlora recipe~~ Prevent OOM during checkpoint save on colab for llama3-8b qlora recipe Aug 29, 2024

ebsmothers reviewed Aug 30, 2024

View reviewed changes

mikaylagawarecki force-pushed the oom_fix branch from b6f74d1 to 2c45831 Compare August 30, 2024 18:01

mikaylagawarecki marked this pull request as draft September 3, 2024 14:51

mikaylagawarecki requested a review from ebsmothers September 5, 2024 14:58

mikaylagawarecki marked this pull request as ready for review September 5, 2024 15:05

ebsmothers reviewed Sep 9, 2024

View reviewed changes

mikaylagawarecki added 12 commits September 9, 2024 12:35

For testing: Set seed, shuffle=False, decrease max_steps so checkpoin…

9194409

…t save happens earlier and result is deterministic

Add wandb logger + change number of epochs/max_steps_per_epoch

c3beb8a

temp commit

7a825d8

Remove unecessary imports

b0b9031

Changes for loss curves

327cb10

Use core APIs, refactor into state_dict post hook

923b3b3

Fix bad merge

14a147b

lint

30f3b1e

Remove breakpoint, add docstring

731d062

fix return

6dffac5

Remove changes to reload checkpoint every epoch + add gate on versioning

1106e47

Update torch version in runtime error

2a4f19b

mikaylagawarecki added 15 commits September 9, 2024 12:38

lint

d17098c

Make suggested fixes

8f1f633

Fix botched rebase

f87ca63

Fix circular import

8390868

lint

a119a4d

Fix bad rebase

fc15271

lint

278e89e

Changes to not use monkeypatching

f309bca

validate with fake resume_from_checkpoint again

244896b

Revert "validate with fake resume_from_checkpoint again"

7b094df

This reverts commit abdbd7f.

lint

147b8b4

set low_cpu_ram to False

05250f5

fix doc

c2829db

bump version

8a16c76

address comments

91f7d43

mikaylagawarecki force-pushed the oom_fix branch from a02ccd6 to 91f7d43 Compare September 9, 2024 19:39

mikaylagawarecki added 2 commits September 9, 2024 12:41

fix rebase bug

b3364f6

lint

645891b

ebsmothers approved these changes Sep 9, 2024

View reviewed changes

mikaylagawarecki added 6 commits September 9, 2024 15:31

fix bad rebase

e44507a

fix bad rebase again

31f8a20

Add changes from pytorch#1535

c808a60

lora_magnitude is not None

922c73a

changes for testing

eba1ffa

Revert "changes for testing"

dc8c6b0

This reverts commit eba1ffa.

ebsmothers merged commit 515efbe into pytorch:main Sep 10, 2024
17 checks passed

ebsmothers mentioned this pull request Sep 10, 2024

Split DoRA and LoRA weight merging logic #1535

Closed

mikaylagawarecki mentioned this pull request Sep 13, 2024

Add low_cpu_ram config to qlora recipes configs (excluding 2B/13B/70B configs) #1580

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent OOM during checkpoint save on colab for llama3-8b qlora recipe #1315

Prevent OOM during checkpoint save on colab for llama3-8b qlora recipe #1315

mikaylagawarecki commented Aug 12, 2024 •

edited

Loading

pytorch-bot bot commented Aug 12, 2024 •

edited

Loading

ebsmothers left a comment

ebsmothers Aug 21, 2024

mikaylagawarecki Aug 22, 2024 •

edited

Loading

mikaylagawarecki Aug 29, 2024 •

edited

Loading

ebsmothers Sep 9, 2024

ebsmothers Aug 30, 2024

mikaylagawarecki Aug 30, 2024

ebsmothers Aug 30, 2024

ebsmothers Aug 30, 2024

mikaylagawarecki Aug 30, 2024 •

edited

Loading

ebsmothers Aug 30, 2024

ebsmothers left a comment

ebsmothers Sep 9, 2024

ebsmothers Sep 9, 2024

ebsmothers Sep 9, 2024

mikaylagawarecki Sep 9, 2024

ebsmothers Sep 9, 2024

mikaylagawarecki Sep 9, 2024

ebsmothers Sep 9, 2024

mikaylagawarecki Sep 9, 2024 •

edited

Loading

ebsmothers Sep 9, 2024

mikaylagawarecki Sep 9, 2024

ebsmothers Sep 9, 2024

codecov-commenter commented Sep 9, 2024 •

edited

Loading

ebsmothers left a comment

mikaylagawarecki commented Sep 10, 2024 •

edited

Loading



		# mmap.MAP_SHARED is not supported on Windows but this change targets colab.
		if hasattr(torch.serialization, "skip_data") and not sys.platform == "win32":

		if cfg.get("low_cpu_ram", False):
		common_utils._use_low_cpu_ram = True



		# mmap.MAP_SHARED is not supported on Windows but this change targets colab.
		if torch.__version__ >= "2.5.0.dev20240906" and not sys.platform == "win32":

Prevent OOM during checkpoint save on colab for llama3-8b qlora recipe #1315

Prevent OOM during checkpoint save on colab for llama3-8b qlora recipe #1315

Conversation

mikaylagawarecki commented Aug 12, 2024 • edited Loading

Context

Changelog

Test plan

Sanity check

Loss Curves

pytorch-bot bot commented Aug 12, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1315

✅ No Failures

ebsmothers left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikaylagawarecki Aug 22, 2024 • edited Loading

Choose a reason for hiding this comment

mikaylagawarecki Aug 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikaylagawarecki Aug 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ebsmothers left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikaylagawarecki Sep 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Sep 9, 2024 • edited Loading

Codecov Report

ebsmothers left a comment

Choose a reason for hiding this comment

mikaylagawarecki commented Sep 10, 2024 • edited Loading

mikaylagawarecki commented Aug 12, 2024 •

edited

Loading

pytorch-bot bot commented Aug 12, 2024 •

edited

Loading

mikaylagawarecki Aug 22, 2024 •

edited

Loading

mikaylagawarecki Aug 29, 2024 •

edited

Loading

mikaylagawarecki Aug 30, 2024 •

edited

Loading

mikaylagawarecki Sep 9, 2024 •

edited

Loading

codecov-commenter commented Sep 9, 2024 •

edited

Loading

mikaylagawarecki commented Sep 10, 2024 •

edited

Loading