Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove lr_scheduler requirement in lora_dpo_single_device #1991

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

thomasjpfan
Copy link
Contributor

@thomasjpfan thomasjpfan commented Nov 12, 2024

Context

What is the purpose of this PR? Is it to

  • add a new feature
  • fix a bug
  • update tests and/or documentation
  • other (please add here)

Towards #1751

For testing, I initially wanted to monkeypatch the recipes module, but it was not designed to be imported from tests:

# This file mainly exists because we want to ensure that `recipes` aren't
# importable *from the tests*.
. So for this PR, I added a new integration test that makes a new config without a scheduling and run tune run for one epoch.

Changelog

What are the changes made in this PR?

  • Removes lr_scheduler requirement in lora_dpo_single_device

Test plan

Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.

  • run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
  • add unit tests for any new functionality
  • update docstrings for any new or updated methods or classes
  • run unit tests via pytest tests
  • run recipe tests via pytest tests -m integration_test
  • manually run any new or modified recipes with sufficient proof of correctness
  • include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

UX

If your function changed a public API, please add a dummy example of what the user experience will look like when calling it.
Here is a docstring example
and a tutorial example

  • I did not change any public API
  • I have added an example to docs or docstrings

Copy link

pytorch-bot bot commented Nov 12, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1991

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 12, 2024
Copy link
Contributor

@RdoubleA RdoubleA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making this update, had a few comments on implementation

@@ -243,7 +243,7 @@ def setup(self, cfg: DictConfig) -> None:
# Learning rate scheduler can only be set up after number of steps
# has been computed
self._lr_scheduler = self._setup_lr_scheduler(
cfg_lr_scheduler=cfg.lr_scheduler,
cfg_lr_scheduler=cfg.get("lr_scheduler", None),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be less indirection to just check directly if lr_scheduler exists and set to None here, instead of calling the setup method only to return None:

cfg_lr_scheduler = cfg.get("lr_scheduler", None)
self._lr_scheduler = self._setup_lr_scheduler(...) if cfg_lr_scheduler is not None else None

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense to me, Rafi, but upon second thought, i like the idea of handling everything inside of the setup_lr_scheduler, including the log_info. What do you think?

Copy link
Contributor

@RdoubleA RdoubleA Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, either works, no strong opinions

@@ -191,3 +191,49 @@ def test_save_and_load_merged_weights(self, tmpdir, monkeypatch):
llama2_model.load_state_dict(sd)
merged_ckpt_out = llama2_model(inputs)
torch.testing.assert_close(baseline_out, merged_ckpt_out, rtol=1e-5, atol=1e-5)

@pytest.mark.integration_test
def test_lr_scheduler_optional(self, tmpdir, monkeypatch):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a great test but it seems expensive just to check that the lr_scheduler is not created. Maybe you could just call the recipe's setup() method with a toy model and check that self._lr_scheduler == None?

Copy link
Contributor Author

@thomasjpfan thomasjpfan Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially wanted to go with this approach, but I do not think we are allowed to import anything in recipes (i.e. LoRADPORecipeSingleDevice) from tests. Specifically, running from recipes.lora_dpo_single_device import LoRADPORecipeSingleDevice raises this error:

raise ModuleNotFoundError(
"The torchtune recipes directory isn't a package and you should not import anything from here. "
"Refer to our docs for detailed instructions on how to use recipes: "
"https://pytorch.org/torchtune/main/deep_dives/recipe_deepdive.html"
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok with not adding a test for it. Any thoughts @ebsmothers ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, let's just remove it

@felipemello1
Copy link
Contributor

Thanks for the pr, @thomasjpfan! I left a couple of comments. After Rafi and Evan reply, i am good with merging it. PS: after the changes for lora_dpo_single_device are done, would you be willing to add it to the other recipes too? Its ok if the answer is no. We can do it as a follow up :)

@thomasjpfan
Copy link
Contributor Author

PS: after the changes for lora_dpo_single_device are done, would you be willing to add it to the other recipes too? Its ok if the answer is no. We can do it as a follow up :)

I'll be happy to add it to the other recipes as a followup. I started off with one recipe to see how much test converge is required.

@RdoubleA
Copy link
Contributor

@thomasjpfan let's go ahead and remove the test and this is good to go!

@felipemello1
Copy link
Contributor

felipemello1 commented Nov 14, 2024

Just before we approve it, do you mind running it with wandb logger (or any other logger) and capturing the learning rate with/without the scheduler, as a sanity check?

You can append this to your cli command:

tune run <...> \
metric_logger=torchtune.training.metric_logging.WandBLogger \
max_steps_per_epoch=10 \
epochs=1 \
lr_scheduler._component_=torchtune.training.lr_schedulers.get_cosine_schedule_with_warmup
lr_scheduler.num_warmup_steps=5
tune run <...> \
metric_logger=torchtune.training.metric_logging.WandBLogger \
max_steps_per_epoch=10 \
epochs=1 \
~lr_scheduler._component_
~lr_scheduler.num_warmup_steps

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants