[Draft]Support Optimizer-in-the-backward #1530

mori360 · 2024-09-10T00:57:11Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

Please link to any issues this PR addresses.

Changelog

What are the changes made in this PR?
In FullFinetuneDistributed, switch self._optimizer to self._optimizer_in_bwd that runs in backwards
Running in the backwards could save the peak memory cost during loss.backward()
By the local testing with llama2/7B_full, peak memory is optimized from 31.5GB to 21.2GB, saves 32.6%

Test plan

Please make sure to do each of the following if applicable to your PR. (If you're not sure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.)

run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
add unit tests for any new functionality
update docstrings for any new or updated methods or classes
run unit tests via pytest tests
run recipe tests via pytest tests -m integration_test
manually run any new or modified recipes with sufficient proof of correctness
include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

UX

If your function changed a public API, please add a dummy example of what the user experience will look like when calling it.
Example of docstring:

torchtune/torchtune/modules/vision_transformer.py

Line 285 in 6a7951f

Examples:

Example in our docs: https://pytorch.org/torchtune/main/tutorials/qat_finetune.html#applying-qat-to-llama3-models

I did not change any public API;
I have added an example to docs or docstrings;

pytorch-bot · 2024-09-10T00:57:15Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1530

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 0a3762d with merge base 0a3762d ():

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

Regression Tests / regression_test (3.11, nightly) (gh) (trunk failure)
tests/regression_tests/test_llama2_7b.py::TestLoRA7BDistributedFinetuneEval::test_finetune_and_eval
Regression Tests / regression_test (3.11, stable) (gh) (trunk failure)
tests/regression_tests/test_llama2_7b.py::TestLoRA7BDistributedFinetuneEval::test_finetune_and_eval

This comment was automatically generated by Dr. CI and updates every 15 minutes.

codecov-commenter · 2024-09-19T02:19:42Z

Codecov Report

Attention: Patch coverage is 0% with 32 lines in your changes missing coverage. Please review.

Project coverage is 72.13%. Comparing base (dd348ce) to head (5ec6f68).
Report is 5 commits behind head on main.

Files with missing lines	Patch %	Lines
recipes/full_finetune_distributed.py	0.00%	32 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1530      +/-   ##
==========================================
- Coverage   72.26%   72.13%   -0.13%     
==========================================
  Files         290      290              
  Lines       14554    14576      +22     
==========================================
- Hits        10517    10515       -2     
- Misses       4037     4061      +24

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

mori360 · 2024-09-20T18:10:23Z

recipes/full_finetune_distributed.py

@@ -670,7 +714,13 @@ def train(self) -> None:
                        time_per_step = time.perf_counter() - t0
                        log_dict = {
                            "loss": loss_to_log,
-                            "lr": self._optimizer.param_groups[0]["lr"],
+                            "lr": (


"lr" here are all the same

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 10, 2024

mori360 changed the title ~~Support Optimizer-in-the-backward~~ [Draft]Support Optimizer-in-the-backward Sep 10, 2024

mori360 commented Sep 20, 2024

View reviewed changes

mori360 closed this Oct 2, 2024

mori360 force-pushed the main branch from 622c965 to 0a3762d Compare October 2, 2024 23:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Draft]Support Optimizer-in-the-backward #1530

[Draft]Support Optimizer-in-the-backward #1530

mori360 commented Sep 10, 2024 •

edited

Loading

pytorch-bot bot commented Sep 10, 2024 •

edited

Loading

codecov-commenter commented Sep 19, 2024 •

edited

Loading

mori360 Sep 20, 2024

[Draft]Support Optimizer-in-the-backward #1530

[Draft]Support Optimizer-in-the-backward #1530

Conversation

mori360 commented Sep 10, 2024 • edited Loading

Context

Changelog

Test plan

UX

pytorch-bot bot commented Sep 10, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1530

✅ You can merge normally! (2 Unrelated Failures)

codecov-commenter commented Sep 19, 2024 • edited Loading

Codecov Report

mori360 Sep 20, 2024

Choose a reason for hiding this comment

mori360 commented Sep 10, 2024 •

edited

Loading

pytorch-bot bot commented Sep 10, 2024 •

edited

Loading

codecov-commenter commented Sep 19, 2024 •

edited

Loading