[Low-bit optim] Add Llama2-7B finetune benchmarks #746

gau-nernst · 2024-08-25T13:53:28Z

Update: change Llama3.1-8B-instruct to Llama2-7B

Fine-tune Llama2-7B on Alpaca dataset. Full BF16, 1 epoch, A100, fixed random seed. Benchmark is done with torchtune.

Summary

AdamW impl	Max memory (GB)	toks/s	`truthfulqa_mc2` acc	Compile time
Not fine-tuned	-	-	38.95	-
PyTorch (fused)	52	~4500	42.12	~4 min
bnb 8-bit	39	~4000	41.98	~4 min
ao 8-bit	39	~4000	42.41	~12 min
ao 4-bit	33	~3600	42.34	~4 min

NOTE:

lpmm's 4-bit AdamW does not support BF16 weights -> not include in benchmark
A100 does not support FP8 -> not include FP8 AdamW

Observations

The reduction in peak memory looks correct: going from 16-bit to 8-bit -> 52 - 39 = 13GB reduction. Going from 8-bit to 4-bit -> 39 - 33 = 6GB reduction.
Our 8-bit AdamW is only slightly slower than bnb, which is nice.
The compile time for 8-bit is HUGE (12 min). Might need to find ways to mitigate this.
Our 4-bit AdamW is quite slow, but compile fast. This is expected because we compile w/ dynamic shape for each param, while in 8-bit AdamW, we compile w/ static shape for all params. We do this way because 4-bit AdamW will have a memory bug when compiling for all params.

Command used (change optimizer and checkpointer.output_dir across runs)

tune run full_finetune_single_device --config llama2/7B_full optimizer=torch.optim.AdamW optimizer.fused=True optimizer_in_bwd=False compile=True metric_logger=torchtune.utils.metric_logging.WandBLogger log_peak_memory_stats=True batch_size=16 log_every_n_steps=10 epochs=1 seed=2024 checkpointer.output_dir=experiments/adamw_baseline

Fancy graphs!

Compare across different n-bit optimizers

Compare 8-bit AdamW between ao and bnb. The fact that the two graphs overlap show that our implementation is correct and competitive in speed (except compile time 😭)!

pytorch-bot · 2024-08-25T13:53:31Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/746

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 5 Unrelated Failures

As of commit 2de6df0 with merge base ba2d3b1 ():

NEW FAILURE - The following job has failed:

Run Regression Tests / test (CPU Nightly, linux.4xlarge, --pre torch --index-url https://download.pytorch.org/whl/nightl... / linux-job (gh)
test/dtypes/test_fpx.py::TestFpxTensorCoreAQTLayout::test_to_scaled_tc_fpx_compile_ebits_3_mbits_2_device_cpu

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

Run Float8 Tests / test (SM-89, amz2023.linux.g6.4xlarge.experimental.nvidia.gpu, --pre torch --index-url https://do... / linux-job (gh) (detected as infra flaky with no log or failing log classifier)
Run Regression Tests / test (CUDA 2.2.2, linux.g5.12xlarge.nvidia.gpu, torch==2.2.2 "numpy<2" , cuda, 12.1) / linux-job (gh) (detected as infra flaky with no log or failing log classifier)
Run Regression Tests / test (CUDA 2.3, linux.g5.12xlarge.nvidia.gpu, torch==2.3.0, cuda, 12.1) / linux-job (gh) (detected as infra flaky with no log or failing log classifier)
Run Regression Tests / test (CUDA 2.4, linux.g5.12xlarge.nvidia.gpu, torch==2.4.0, cuda, 12.1) / linux-job (gh) (detected as infra flaky with no log or failing log classifier)
Run Regression Tests / test (CUDA Nightly, linux.g5.12xlarge.nvidia.gpu, --pre torch --index-url https://download.pytorc... / linux-job (gh) (detected as infra flaky with no log or failing log classifier)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

msaroufim

cc @mlazos who was looking at large compile times

gau-nernst · 2024-09-02T13:59:25Z

@msaroufim Any blockers to merge this? The failing CPU test is unrelated, though I'm probably in charge of it since it's FP6-LLM 🌚. Seems like something change with CPU inductor.

Some thoughts on reducing compile time. There are 2 approaches to compile optimizer step in low-bit optim:

Compile optim step for single param i.e. torch.compile(single_param_adam)
Compile optim step for all params i.e. torch.compile(param_groups_adam)

Currently Adam8bit and AdamFp8 use approach (2) (with static shape) since it is faster (but compile much slower), while Adam4bit uses approach (1) (with dynamic shape) since there are excessive memory usage for "Adam4bit + approach (2)". Approach (1) requires dynamic shape to avoid hitting recompiles limit.

Now looking back, perhaps we can do approach (1) with static shape + temporarily remove recompile limit? I have seen FlexAttention doing this

https://github.com/pytorch/pytorch/blob/76710d4f95d1f920bdf56e4db4d6d71ef6c9aea2/torch/nn/attention/flex_attention.py#L989

It's probably safe to do so, since for a given model, the number of recompiles for single_param_adam() is fixed, though some models may have more recompiles than others (e.g. ViT vs LLM).

msaroufim · 2024-09-02T15:50:46Z

I'm gonna add some of your comments here to the README since they're helpful

* add Llama3.1-8B finetune bench * update doc * Update README.md --------- Co-authored-by: Mark Saroufim <marksaroufim@gmail.com>

add Llama3.1-8B finetune bench

f3ad199

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 25, 2024

msaroufim mentioned this pull request Aug 27, 2024

Slow compile times tracker #754

Open

3 tasks

gau-nernst changed the title ~~[Low-bit optim] Add Llama3.1-8B finetune benchmarks~~ [Low-bit optim] Add Llama2-7B finetune benchmarks Aug 27, 2024

gau-nernst added 2 commits August 27, 2024 23:33

Merge branch 'main' into update_optim_bench

1497b3d

update doc

06f0cc9

gau-nernst marked this pull request as ready for review August 27, 2024 15:36

gau-nernst requested a review from msaroufim August 28, 2024 00:33

msaroufim approved these changes Aug 28, 2024

View reviewed changes

Merge branch 'pytorch:main' into update_optim_bench

9d56a8c

Update README.md

2de6df0

msaroufim merged commit e5246fc into pytorch:main Sep 2, 2024
8 of 14 checks passed

gau-nernst deleted the update_optim_bench branch September 2, 2024 19:25

jerryzh168 pushed a commit to jerryzh168/ao that referenced this pull request Sep 4, 2024

[Low-bit optim] Add Llama2-7B finetune benchmarks (pytorch#746)

97274ad

* add Llama3.1-8B finetune bench * update doc * Update README.md --------- Co-authored-by: Mark Saroufim <marksaroufim@gmail.com>

gau-nernst mentioned this pull request Sep 5, 2024

[Low-bit optim] Improve compile time + Fix PyTorch 2.3 support for 4-bit optim #812

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Low-bit optim] Add Llama2-7B finetune benchmarks #746

[Low-bit optim] Add Llama2-7B finetune benchmarks #746

gau-nernst commented Aug 25, 2024 •

edited

Loading

pytorch-bot bot commented Aug 25, 2024 •

edited

Loading

msaroufim left a comment

gau-nernst commented Sep 2, 2024

msaroufim commented Sep 2, 2024

[Low-bit optim] Add Llama2-7B finetune benchmarks #746

[Low-bit optim] Add Llama2-7B finetune benchmarks #746

Conversation

gau-nernst commented Aug 25, 2024 • edited Loading

Summary

Fancy graphs!

pytorch-bot bot commented Aug 25, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/746

❌ 1 New Failure, 5 Unrelated Failures

msaroufim left a comment

Choose a reason for hiding this comment

gau-nernst commented Sep 2, 2024

msaroufim commented Sep 2, 2024

gau-nernst commented Aug 25, 2024 •

edited

Loading

pytorch-bot bot commented Aug 25, 2024 •

edited

Loading