Llama3-8b memory efficient full finetune #990

rohan-varma · 2024-05-16T20:41:33Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

TL;DR: This PR saves ~46% peak memory for llama3-8b single device full finetune while keeping performance at parity to current offering, just by switching Adamw8bit -> PagedAdamw8bit. @ebsmothers reminded me that this exists after I took a much more complicated approach.

Changelog

Previous experiments using PagedOptimizer from bnb for llama3 workloads resulted in prohibitvely slow QPS (> 6s/it, compared to using paged optim in llama2 workload still providing > 1 it/s QPS). After some debugging, this is primarily due to paging in and out large optimizer states associated with the embedding and output projection.
After a chat with @ebsmothers, we realized that we can just try PagedAdamW8bit, which reduces the size of the optimizer states for the output projection and embedding, and experiments using this proved to benefit memory usage at no cost to QPS. A previous version of this PR made output projection and embedding use Adamw8bit and the others use PagedOptimizer separately, but this approach is much simpler.

Test plan

Current 8B full single device:

Step 132 | loss:0.8253529071807861 lr:2e-05 tokens_per_second_per_gpu:335.27384473126125 peak_memory_active:33.644823552 peak_memory_alloc:33.644823552 peak_memory_reserved:36.320575488
1.27it/s

8B_full_single_device using PagedAdamW:

Step 44 | loss:0.7166110873222351 lr:2e-05 tokens_per_second_per_gpu:23.2328866347064 peak_memory_active:17.456531968 peak_memory_alloc:17.456531968 peak_memory_reserved:19.302187008
7.64s/it

8B full single device with this PR (PagedAdamW8bit):

Step 44 | loss:0.7101777195930481 lr:2e-05 tokens_per_second_per_gpu:223.38999360355982 peak_memory_active:17.486964224 peak_memory_alloc:17.486964224 peak_memory_reserved:19.333644288

For comparison, current llama2-7b 7B_full_low_memory:

Step 44 | loss:0.6592501997947693 lr:2e-05 tokens_per_second_per_gpu:326.61038570142404 peak_memory_active:13.924915712 peak_memory_alloc:13.924915712 peak_memory_reserved:14.845739008
1.41it/s

TL;DR: This PR reduces peak memory by ~46% while maintaining approximately the same perf, getting us to a < 24 GB full finetune.

Loss curves are the same (comparing today's baseline versus with these changes) -

Follow-ups

Documentation for optimizer in backward and using bnb optimizers - this documentation is sparse, AFAIk we barely mention bitsandbytes in our docs and don't explain running optimizer in backward at all. We should add comprehensive documentation around these full finetuning memory optimizations.
Update numbers in README table (though I think these are for llama2 at the moment).

pytorch-bot · 2024-05-16T20:41:37Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/990

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit c99b4d7 with merge base 3883081 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

RdoubleA · 2024-05-17T00:10:30Z

Few high level questions:

These memory tricks are awesome but it comes with the tradeoff of making the recipe harder to understand. But we should support more of these and have a place for them. What do you think about having separate configs/recipes for low memory optimizations? We have this for llama2. Or is that the intention for the single device recipes / configs?
I am not too keen on having three different optimizer fields in the config. If we have a separate low memory recipe maybe this is ok, but if we use the existing recipe what are your thoughts on hardcoding the AdamW8bit optimizers for embedding and output projection and using the same learning rate as the main optimizer? Do you think we need to expose these as configurable flags to the user? That way you don't have to do all the config gymnastics in the recipe

rohan-varma · 2024-05-17T00:14:37Z

what are your thoughts on hardcoding the AdamW8bit optimizers for embedding and output projection and using the same learning rate as the main optimizer

I'd like to avoid this especially if we stick with the current recipe. Since this recipe is used for other workloads, users who change optimizer via config for those workloads would be surprised that their changes don't have effect since we hardcode here.

rohan-varma · 2024-05-17T00:46:38Z

@RdoubleA Yeah the UX concerns definitely make sense. I think this config can just be renamed appropriately to match what we have for llama2.

Open to authoring a separate recipe or moving this to a helper function - feel free to let me know what you and @ebsmothers think or if any additional input is needed from me, thanks!

rohan-varma · 2024-05-17T07:26:34Z

Refactored to simply use PagedAdamW8bit after @ebsmothers suggestion!

ebsmothers

This is great! Really happy to see we were able to get good memory with PagedAdamW8bit. We should see if it helps on the Llama2 memory-efficient config at all too (ofc not as urgent since we already have reasonable peak memory there)

RdoubleA

dang how did all that become one line

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 16, 2024

rohan-varma requested a review from ebsmothers May 16, 2024 21:26

upd

c99b4d7

rohan-varma force-pushed the 8b_lowmem branch from 9b47a92 to c99b4d7 Compare May 17, 2024 01:35

ebsmothers approved these changes May 17, 2024

View reviewed changes

RdoubleA approved these changes May 17, 2024

View reviewed changes

RdoubleA merged commit 46d7c83 into main May 17, 2024
29 checks passed

joecummings deleted the 8b_lowmem branch May 17, 2024 13:54

weifengpy pushed a commit to weifengpy/torchtune that referenced this pull request Jun 4, 2024

Llama3-8b memory efficient full finetune (pytorch#990)

79ef995

maximegmd pushed a commit to maximegmd/torchtune that referenced this pull request Jul 13, 2024

Llama3-8b memory efficient full finetune (pytorch#990)

203bf9a

ebsmothers mentioned this pull request Aug 5, 2024

[RFC][DOCS] Recipe [DOCS] ([DOC]umentation) #1230

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama3-8b memory efficient full finetune #990

Llama3-8b memory efficient full finetune #990

rohan-varma commented May 16, 2024 •

edited

Loading

pytorch-bot bot commented May 16, 2024 •

edited

Loading

RdoubleA commented May 17, 2024

rohan-varma commented May 17, 2024

rohan-varma commented May 17, 2024

rohan-varma commented May 17, 2024

ebsmothers left a comment

RdoubleA left a comment

Llama3-8b memory efficient full finetune #990

Llama3-8b memory efficient full finetune #990

Conversation

rohan-varma commented May 16, 2024 • edited Loading

Context

Changelog

Test plan

Follow-ups

pytorch-bot bot commented May 16, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/990

✅ No Failures

RdoubleA commented May 17, 2024

rohan-varma commented May 17, 2024

rohan-varma commented May 17, 2024

rohan-varma commented May 17, 2024

ebsmothers left a comment

Choose a reason for hiding this comment

RdoubleA left a comment

Choose a reason for hiding this comment

rohan-varma commented May 16, 2024 •

edited

Loading

pytorch-bot bot commented May 16, 2024 •

edited

Loading