-
Notifications
You must be signed in to change notification settings - Fork 430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add tune run
and refactor CLI
#586
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/586
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 67bde68 with merge base f6f6855 (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@@ -135,8 +135,7 @@ def setup(self) -> None: | |||
self._limit = self._cfg.limit | |||
self._tasks = list(self._cfg.tasks) | |||
|
|||
seed = utils.set_seed(seed=self._cfg.seed) | |||
logger.info(f"Random seed set to {seed}.") | |||
utils.set_seed(seed=self._cfg.seed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Utils.set_seed already logs the resulting seed, didn't need to do it twice :)
$ tune cp full_finetune_distributed.py ./my_custom_full_finetune.py | ||
$ tune cp full_finetune_distributed.py ./new_dir/my_custom_full_finetune.py --make-parents | ||
# Attach proper suffix if needed | ||
if destination.name != "" and destination.suffix != proper_suffix: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ebsmothers thanks for the discussion here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wanna make sure I understand: destination.name == ""
implies we are copying to a directory and so the filename will be inherited from the original config, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bingo
resume_download=True, | ||
token=args.hf_token, | ||
) | ||
except GatedRepoError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better error handling!
tune run
and big refactoring of CLItune run
and refactor CLI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once again, I fail to limit the scope of a PR.
I legitimately loled at this
torchtune/__init__.py
Outdated
"lora_finetune_single_device.py", | ||
"lora_finetune_distributed.py", | ||
"eleuther_eval.py", | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
o god
torchtune/__init__.py
Outdated
@dataclass | ||
class Config: | ||
name: str | ||
file_path: str | ||
|
||
|
||
@dataclass | ||
class Recipe: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I do like these though
torchtune/__init__.py
Outdated
def get_configs(self) -> List[Config]: | ||
return self.configs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This I don't like so much. Why is the configs field not sufficient? Or if we do need this, please just define as a separate function not on the dataclass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just good OOP design, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if configs
was a private attribute this getter would make sense, but as it is just calling Recipe.configs
is preferable than Recipe.get_configs()
imo
torchtune/__init__.py
Outdated
return self.configs | ||
|
||
|
||
_ALL_RECIPES = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a lot of LOC for our root init file, but I actually think the structure makes sense here. Nitpicking a bit I wonder if it's better to have all these in a separate file now though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeh __init__.py
is the last place I'd look for this. Why not just define this in a recipes.py or something like that and then just import ALL_RECIPES? Also it's a big confusing that this is a private dict in __init__.py
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these should all be hidden from the end user, in fact maybe we should make Config
and Recipe
private since this is primarily for us developers to track built-in recipe attributes and their associated configs... an internal registry if you will (@ebsmothers shudders)
Although, I'm not entirely convinced that we really need this. The primary attribute you need to check is supports_distributed
so you can run with torchrun or runpy on single device. Why the distinction? Why can't we just always use torchrun? And do you anticipate many more recipe attributes like this to be added? Because if we remove supports_distributed
, we can still do just a simple mapping as we had before which is easier to maintain than the dataclasses
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that these should all be hidden from the user, but I will push back on the second part.
I originally wanted to run everything through torchrun, but adding on a multiprocessing layer even for simple recipes proved troublesome. 1) It clogs up the logs 2) it was much harder to track down errors even for simple recipes and 3) testing became much more difficult b/c we can't directly observe output (must be logged to file first)
I'm definitely open to revisiting this issue in the future, but it should be noted that it's not a trivial change.
I don't imagine a ton of new recipes to be added to the core library, and I would argue that it's actually easier to maintain this than a separate list and dictionary for recipes and recipes -> configs, which is what we had before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I see your point about distributed logs. That's fair, I can get on board with that
@@ -168,12 +168,12 @@ def test_training_state_on_resume(self, tmpdir, monkeypatch): | |||
|
|||
cmd_1 = cmd_1 + self._get_test_config_overrides() + model_config | |||
monkeypatch.setattr(sys, "argv", cmd_1) | |||
with pytest.raises(SystemExit): | |||
with pytest.raises(SystemExit, match=""): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, is the point here just to check that we're not raising some unexpected exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, so a SystemExit error returns 0 or None for a successful exit and a positive int for an errored exit. This is essentially a sanity check that we have a successful exit.
# TODO (rohan-varma): Add check that nproc_per_node <= cuda device count. Currently, | ||
# we don't do this since we test on CPUs for distributed. Will update once multi GPU CI is supported. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Time to update this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep @rohan-varma
output = capsys.readouterr() | ||
assert "The '--config' argument is required" in output.err | ||
|
||
def test_run_succeeds_with_local_recipe_file_and_default_config( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I completely missed that you were able to do this in my first pass over run.py. Very sneaky.. and very nice. We should also think about the case of copying a config only (I don't think that will work?). But given CLI overrides are an option I think it shouldn't be as hi-pri anyways
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean - a user can definitely copy a config, modify it, and run it with a default recipe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah sorry, I was looking at this too late at night. I understand now. I like the way you've set it up, but the behavior of _get_recipe
and _get_config
and their subsequent usage in _run_cmd
is a bit non-obvious. Would be nice to document a bit: specifically that the first two methods return None when the recipe/config is not registered with TorchTune, and in that case we assume they refer to a local path.
runpy.run_path(TUNE_PATH, run_name="__main__") | ||
local_file_run.assert_called_once() | ||
|
||
def test_run_calls_local_file_run_for_local_file_recipe( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it's not really calling a separate run for the local file here, right? Like you are just patching _run_single_device and checking that gets called. Unless I am missing the point here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
: ( Stop calling me out
for action in torchrun_argparser._actions: | ||
if action.dest == "training_script": | ||
action.dest = "recipe" | ||
action.help = """ | ||
Name or path to recipe to be launched followed by args. | ||
For a list of all possible recipes, run `tune ls`.""" | ||
elif action.dest == "training_script_args": | ||
action.dest = "recipe_args" | ||
action.help = "Args to be passed to the recipe." | ||
elif action.dest == "help": | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I understand this? But still, might be nice to give a bit more detail in the docstring here
torchtune/_cli/run.py
Outdated
args.__dict__["training_script"] = args.__dict__.pop("recipe") | ||
args.__dict__["training_script_args"] = args.__dict__.pop("recipe_args") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually need to use __dict__
on the LHS of these two lines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nah, fixed.
tune --nnodes 1 --nproc_per_node 2 \ | ||
lora_finetune_distributed \ | ||
--config llama2/13B_lora | ||
tune run --nproc_per_node 2 full_finetune_distributed --config llama2/7B_full_distributed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where did --nnodes go? Are we not supporting it anymore or is it just hidden?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--nnodes
defaults to 1 already and we don't technically support multinode anyway so removing it for now.
@@ -39,5 +31,5 @@ def test_alpaca_generate(self, tmpdir, monkeypatch): | |||
cmd += model_config | |||
|
|||
monkeypatch.setattr(sys, "argv", cmd) | |||
with pytest.raises(SystemExit): | |||
with pytest.raises(SystemExit, match=""): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment above. This ensures the SystemExit was a successfully run command, rather than an error.
tests/recipes/test_eleuther_eval.py
Outdated
log_out = caplog.messages[-1] | ||
assert "'acc,none': 0.3" in log_out | ||
err_log = caplog.messages[-1] | ||
assert "'acc,none': 0.346" in err_log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, sorry to be a bit of a buzz kill, but I suspect there will be some variance between runs. Is this level of precision needed? Can we add some slack here? Maybe a +/- 0.5 to the value? Would love to not have to re-run these tests while landing a PR
@@ -83,8 +83,8 @@ def test_eval_recipe_errors_without_lm_eval(self, caplog, monkeypatch, tmpdir): | |||
""".split() | |||
|
|||
monkeypatch.setattr(sys, "argv", cmd) | |||
with pytest.raises(SystemExit): | |||
with pytest.raises(SystemExit, match="1"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can just look this up, but what are these magic values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Denotes SystemExits that contain errors.
@@ -61,7 +57,7 @@ def test_loss(self, config, tmpdir, monkeypatch): | |||
log_file = gen_log_file_name(tmpdir) | |||
|
|||
cmd = f""" | |||
tune full_finetune_single_device | |||
tune run full_finetune_single_device |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably add a '' to conform with what you've done above
help="Copy a built-in recipe or config to a local path.", | ||
description="Copy a built-in recipe or config to a local path.", | ||
epilog=textwrap.dedent( | ||
"""\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This indent is killing me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
"""Downloads a model from the HuggingFace Hub.""" | ||
# Download the tokenizer and PyTorch model files | ||
try: | ||
true_output_dir = snapshot_download( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add "ignore_patterns" and set it to by default ignore "*.safetensors"? I need to make this change everytime I run on runpod and it's really annoying. If a model ONLY has safetensors, the user can manually set it to an empty string. Just print out that you're ignoring this for space. It's also really annoying because the saved files are cached by default and I need to dig into how to get rid of something I don't want (unless I'm just being stupid).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the reference: https://huggingface.co/docs/huggingface_hub/en/guides/download
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure I was gunna do this in a follow-up, but can do it here.
In the future, I actually think we might want to by default support safetensors, but that's a discussion for another time.
torchtune/_cli/download.py
Outdated
true_output_dir = snapshot_download( | ||
args.repo_id, | ||
local_dir=args.output_dir, | ||
resume_download=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I don't understand what resume_download=True is doing? I also didnt find any info on this in https://huggingface.co/docs/huggingface_hub/en/guides/download
torchtune/_cli/download.py
Outdated
"You need to provide a HuggingFace API token to download gated models." | ||
"You can find your token by visiting https://huggingface.co/settings/tokens" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the only reason that download from a gated repo can fail? What if the user has not signed the consent form or something like that? In that case is this error misleading?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I can update this comment to be more descriptive.
ROOT = Path(torchtune.__file__).parent.parent | ||
|
||
|
||
class Run(Subcommand): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm staring at this and still not sure what tune run is getting us? Better UX than torchrun?
BTW can we add some examples of how I'd run multiple finetune jobs on the same host on different cuda devices? I know this needs some wrangling of the underlying torchrun params. But would be a huge QoL improvement if we can just call this out in the examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a more consistent UX (everything runs with subcommands of argparse) and in general, this PR adds more testing for running of recipes.
On the second point sure, will reach out to you directly on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tune run --nproc_per_node 2 --rdzv-backend=c10d --rdzv-endpoint:localhost:20000 full_finetune_distributed --config llama2/7B_full
# Then you can launch
tune run --nproc_per_node 2 --rdzv-backend=c10d --rdzv-endpoint:localhost:<OTHER-PORT> full_finetune_distributed --config llama2/7B_full
@joecummings I didn't follow the comment around |
@classmethod | ||
def create(cls, *args, **kwargs): | ||
return cls(*args, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting interesting... do you mind commenting what this is doing? IIUC it is adding the parsers to the tune command to enable a subcommand. Maybe you could rename this to something like create_parsers
to be more clear?
List.create(subparsers) | ||
Copy.create(subparsers) | ||
Run.create(subparsers) | ||
Validate.create(subparsers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so these are class methods, which will create the subcommand instances for you... but then how does tune
have access to the self._parser
in each of these if you're not assigning the created instance to anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I hate this too. I wish there was a way in which I could add a full subparser to a top level argparse, but that's not how they add_subparser API is designed. I finally gave up on trying to find a way around this when I saw that accelerate adopted this approach, too. https://github.com/huggingface/accelerate/blob/main/src/accelerate/commands/accelerate_cli.py
parser.description = "Torch Tune Recipe Launcher" | ||
parser.usage = "tune [options] <recipe> [recipe_args]" | ||
parser.formatter_class = argparse.RawDescriptionHelpFormatter | ||
class TuneCLIParser: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wonder if we should rename TuneArgumentParser
as well to make a clearer distinction between these two?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, would you like me to add this change to this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TuneRecipeParser
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you can that would be nice, but it's non-blocking
Typo on my part. Meant to call out the removal of |
Kill with fire! |
@@ -0,0 +1,95 @@ | |||
# Copyright (c) Meta Platforms, Inc. and affiliates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ebsmothers and @RdoubleA hows this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beautiful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work
Context
Once again, I fail to limit the scope of a PR. As I was adding the
tune run
CLI command which we would use to launch all our recipes, I noticed how clunky our current CLI code was and needed to take on the refactor. Please take a look at all the code as there are some changes to the UX of the CLI, as well. This should help us in the long run as it makes our code more debuggable, adds better error handling, adds more tests, and makes the CLI more extensible.Look through the changelog and the code for all the information and some FAQs right below.
Why is the CLI so slow?
Startup time for the tune CLI is slowwww:
Compare this to HuggingFace's
accelerate
:And the classic
git
:Okay, we have no chance of being as fast as
git
, which is written in C and has been optimized over years of development. Python itself is slow and argparse, as part of the stdlib, is no exception. We are roughly on par with accelerate, which also uses argparse under the hood, with the added overhead likely from incorporating all the commands from torchrun, which also uses argparse. There are some small optimizations like copying over all the torchrun commands rather than loading them from torchrun's argparser, but the best way to speed up our CLI would be to take a dependency on another library like click orrrrrrrr write our CLI in a language like Go. Neither of those options seems likely in the short term, so really I'm just calling this out as a known issue in case users are concerned.Why did you add new dataclasses to represent recipes and configs? Isn't that overkill?
Attributes of our recipes are getting more and more complex as the days progress so I think it's natural to start loosely formalizing this information. Specifically for this PR,
tune run
needed to know when a recipe was able to be run in a distributed-fashion. Now, I could check the string for "distributed", but that's a week assertion. Now, with recipes having an attribute calledsupports_distributed
, there's no guesswork. This adds very little to the overhead needed to add a new "core" recipe and gives us stronger reasoning about the recipes we're running.Changelog
convert_checkpoint
command. cc @kartikayk and @ebsmothers for the approval heredownload
andcp
tune run
commandtune run
command testsTesting
Tokenizer is initialized from file.
Optimizer and loss are initialized.
Loss is initialized.
Dataset and Sampler are initialized.
Learning rate scheduler is initialized.
1|11|Loss: 1.597842812538147: 0%|
Tokenizer is initialized from file.
Optimizer is initialized.
Loss is initialized.