-
Notifications
You must be signed in to change notification settings - Fork 430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recipe Redesign #251
Recipe Redesign #251
Conversation
❌ Deploy Preview for torchtune-preview failed.
|
recipes/full_finetune.py
Outdated
lr: float, | ||
output_dir: str, | ||
metric_logger_type: str, | ||
project: str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we always going to have a bunch of args like this or will it eventually be converted into configs? cc @RdoubleA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely a stop gap here. We should read from config/args and convert to a dataclass or something which has been validated.
@ebsmothers, @pbontrager made some changes, including supporting resuming from checkpoint in a lot cleaner way. |
torchtune/utils/checkpoint.py
Outdated
NOTE: `load_checkpoint` does NOT load model and optimizer states into the model and optimizer respectively. | ||
`load_checkpoint` handles the appropriate transformations (i.e. related to FSDP), but user is expected to | ||
call `load_state_dict` on the returned results. | ||
NOTE: `load_checkpoint` is a wrapper around ```torch.load``` and does not handle loading the state dict or other state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should hopefully automatically add a link to the torch docs
NOTE: `load_checkpoint` is a wrapper around ```torch.load``` and does not handle loading the state dict or other state | |
NOTE: `load_checkpoint` is a wrapper around :func:`torch.load` and does not handle loading the state dict or other state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point! Thank you for sharing!
from torchtune import datasets, models, utils | ||
|
||
|
||
def create_full_finetune_args(parser) -> argparse.ArgumentParser: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is stopgap until config system is revamped, yes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, wondering what the order of merging the config and recipe redesign should be
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Synced up with @RdoubleA offline on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joecummings sorry for the strong opinion, but vehemently disagree that structured recipes and training frameworks are the same thing. I have ton's of very strong opinions around this. Happy to chat offline. But definitely read through the context section of this PR! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for opening this. As we've discussed, I'm generally supportive of the change and personally think the improved UX is worth adding a bit of loose, non-binding structure to make recipe authoring easier.
Are we going back on the principle of "No Inheritance"?
Follow-up to this FAQ: I agree that we aren't going back on the no inheritance principle, but wonder if we aren't implicitly encouraging our users to go back on it. As I mentioned in our discussion, my mental model for writing e.g. my own recipe with full determinism would be to take the FullFinetune recipe class, extend it, and override the setup_environment and setup_data methods. We have to find a way to very strongly discourage this as otherwise recipe shareability can easily start to go out the window.
recipes/llm_recipe_interface.py
Outdated
def setup_environment(self, **kwargs) -> None: | ||
""" | ||
Setup the environment needed for this recipe. This includes | ||
initializing distributed, setting the right types, devices and seed, | ||
and initializing the logger. | ||
""" | ||
... | ||
|
||
def setup_model(self, **kwargs) -> None: | ||
""" | ||
Instantiate the model, including additional capabilities such as | ||
FSDP and activation checkpointing. This does not include loading the | ||
checkpoint. All checkpointing should be done in ```load_checkpoint```. | ||
""" | ||
... | ||
|
||
def setup_tokenizer(self, **kwargs) -> None: | ||
""" | ||
Instantiate the tokenizer. Unfortunately, given the current implementation | ||
for SentencePiece, this does include loading the tokenizer checkpoint. | ||
""" | ||
... | ||
|
||
def setup_optimizer_and_loss(self, **kwargs) -> None: | ||
""" | ||
Instantiates the optimizer and loss functions correctly. This includes making sure | ||
learing-rate schedulers are correctly setup. | ||
""" | ||
... | ||
|
||
def setup_data(self, **kwargs) -> None: | ||
""" | ||
Logic associated with setting up datasets, samplers and associated dataloaders should | ||
be added here. | ||
""" | ||
... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I would consolidate all of these into a single setup function, then leave it to our recipes to define individual setup_*
methods. This is for a couple reasons:
(1) Rn setup methods are more granular than train methods. One natural proposal to reduce this imbalance is "add more structure around the training loop" which I think we definitely do not want.
(2) I think there are some assumptions baked in to some of these. E.g. inference-only may not want optimizer+loss, multimodal may want more than tokenizer, etc. I think there is not a universal split
(3) We should distinguish between hard constraints (i.e. methods defined in this protocol) and guidance/best practices based on the recipes we provide. Imo the splitting of setup methods should fall into the latter category
Also (completely arbitrary) setup + run + teardown partition is inherently more pleasing somehow (maybe it's the rule of three?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ebsmothers great point! Let me share some of my thoughts on these, but I'll definitely admit these aren't the perfect split and we should evolve where needed.
setup methods are more granular than train methods
This is primarily because setup is also more standard than the train loop.
- You setup your env in similar-ish ways for distributed training on GPUs.
- For model fetch, we read the config and call the getter + have some fsdp logic which only applies to the model.
- The tokenizer is currently loading checkpoint, but we can consolidate this with the model once we update the tokenizer.
- Optimizer and loss can be split up but I want to keep these separate from the model.
I think there are some assumptions baked in to some of these
Yes, this assumes fine-tuning LLMs. Maybe I should make this clearer in the interface. Inference-only should not make use of this. It'll also likely be much simpler (we aren't going to support multi-GPU inference any time soon I think).
We should distinguish between hard constraints (i.e. methods defined in this protocol) and guidance/best practices based on the recipes we provide
Completely agree. But I'd rather err on the side of splitting stuff up while we're in the feel-out process. Taking everything and dumping into a single function is much easier than splitting stuff up later (esp if logic is deeply intertwined). Does this make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what you're saying makes sense here, just wanna clarify the last point.
Completely agree. But I'd rather err on the side of splitting stuff up while we're in the feel-out process. Taking everything and dumping into a single function is much easier than splitting stuff up later (esp if logic is deeply intertwined). Does this make sense?
My comment in (3) is not that we should use only a setup method in the FullFinetune
class, I like the way that's currently written. I am more wondering if we should change just the protocol to remove the other setup methods. The point about ease of removing methods in the future is valid, but is also kind of just a consequence of the fact that a protocol with more methods is inherently more constrained (for better or for worse).
That being said, if this protocol is tailored specifically towards LLM fine-tuning use cases, then at least my point (2) definitely becomes less of a concern. So I am good to take a "let's try it out and see" approach on this.
recipes/llm_recipe_interface.py
Outdated
def load_checkpoint(self, **kwargs) -> None: | ||
""" | ||
This method is responsible for loading ALL of the state for the recipe from the | ||
checkpoint, including state for the model, optimizer, dataloader and training | ||
parameters such as the epoch and seed. | ||
""" | ||
... | ||
|
||
def save_checkpoint(self, **kwargs) -> None: | ||
""" | ||
This method is responsible for saving ALL of the state for the recipe, | ||
including state for the model, optimizer, dataloader and training | ||
parameters such as the epoch and seed. | ||
""" | ||
... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q on these: why do we assume that the user wants a single load/save checkpoint API? E.g. when I run a fine-tune I may want to save optimizer and dataloader state at intermediate points, but at the end I probably only wanna save model weights. Do we encapsulate all of that in save_checkpoint?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when I run a fine-tune I may want to save optimizer and dataloader state at intermediate points
Why would you save these and not the model weights? You also don't want to output too many checkpoint files, basically everything should be written together.
Right now the ability to checkpoint the optimizer depends on whether you're in the middle of the training or not. I need to add logic to make sure we don't checkpoint opt state at the end of training. This is too expensive and continue-finetuning isn't a thing right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would you save these and not the model weights?
Oh sorry, I mean these in addition to model weights.
Right now the ability to checkpoint the optimizer depends on whether you're in the middle of the training or not. I need to add logic to make sure we don't checkpoint opt state at the end of training.
This is basically my question.. where are we doing this? Does save_checkpoint take an epoch arg or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeh the function in FullFinetune class has the following signature:
def save_checkpoint(self, epoch, rank)
@ebsmothers great question.
I think we need to
Basically we can only do this through a COMBINATION of existing examples, educating users through documentation and explicitly blocking PRs from merging which don't follow our principles. |
…eckpoint and add the associated tests. Creating separate utility for now to not block other PRs
|
||
|
||
@dataclass | ||
class FullFinetuneParams: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we break up this PR and add the dataclasses separately? then we can merge in the dataclass update, the config refactor, and finally this recipe redesign. what would you suggest is the best order?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does @RdoubleA 's work replace this?
dtype: str | ||
|
||
# Reproducability | ||
seed: int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be Optional[int] = None, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right, good catch
- This interface is not meant to add constraints. If the interface comes in the | ||
way of doing stuff, it needs to be updated or a new interface should be | ||
written to support what might be a new "family" of recipes. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you do all the setup in the init which seems it would be a standard operation for any recipe, wonder if you should add a setup method to the interface as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had setup as a function of the interface, but moved this out since I don't expect this to be a public API. Happy to think through the alternative though. What would be the case of making setup public to this class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imo there can be a distinction between the init of the recipe class and the actual "setup" stage of a finetune. E.g. I could imagine a world where someone wants to parse and export the config to a file as a form of validation when the recipe gets called. To me this is a bit different from setting up the model, tokenizer, dataloader, etc.
from torchtune import datasets, models, utils | ||
|
||
|
||
def create_full_finetune_args(parser) -> argparse.ArgumentParser: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, wondering what the order of merging the config and recipe redesign should be
Thinking about this redesign a lot, and the pros and cons that go with it. My main concern with the direction of this change is that it's moving the change in a risky direction towards making the recipes folder another library instead of a home for scripts that uses the torchtune library folder. I'm generally in favor of a lot of the structure suggested here, but do agree with others that the class structure will de-facto have users inheriting from it and abusing it as much or more than the script style recipes we're currently working with. I heavily prefer script recipes because I think it's much less heavy handed and comfortable to get in and edit. We don't have to bias our users towards particular use cases (setup, train, etc) but instead can let the type of recipes evolve naturally. For example we already have a generation recipe that we might have never made if this structure wasn't here. Finally I feel the structure ads one additional layer of abstraction that the user has to lookup and learn to work with a recipe. If they want to do something slightly outside of the interface signatures, they have to decide to abandon it entirely or hack their way around it. To encourage homogenous recipe designs and good practice, I'd be in much more favor of providing template structures that users can base their recipes on but don't directly restrict what users can do. These would be more like "documentation" for contributing. Below I provide an example structure, though a real template could include much more lines, I shortened it for brevity.
|
@pbontrager Thanks for the thoughtful discussion on this. I'm glad you spent some time thinking through this. Looking through your proposal and your concerns, I think the primary difference between our proposals is the ability for the recipe to maintain state. I don't buy some of the other differences (see my comments below). Happy to chat a bit more about this. Being able to store state is important IMO for a couple reasons (there are others, but I'm sticking to user facing):
See my comment to Evan above on this. I don't think this is a valid concern, at least not yet. I can easily clamp down on inheritance in the code base through examples, reviews and user education. Users inherit because lib maintainers allow it.
What does this mean? Why is this any harder to edit? It's essentially the same structure.
See the context section of this PR on why this is not a good idea - our current recipe is a mess.
This interface is only for LLM fine-tuning. In the next few changes, we can make this concrete.
Agsin, what does this mean? The structure of the recipe is pretty much the same, the difference comes from how the state is maintained. |
Thank you for breaking out these points, I'll try to give a better description of my reasoning and my thoughts on recipe state.
I say this, just because I don't believe those measures will be enough. I think the affordances of the class and biases python developers bring into this will be too hard to fight back against.
It's not harder to edit an existing recipe, but it's harder to write a new recipe from scratch as you have to figure out what's the correct place to put your code. With scripts this is more of a suggestion so it's not as much of a concern. Also you can freely modify function signatures in scripts.
Part of the disagreement here is over the product goal of our recipes. I don't believe that the current recipes are "a mess" but they do need to be further refined and cleaned up. What I believe we've been building here is a code platform to make it easy to share reproducible and robust PyTorch scripts. To that end we should let users' imagination take over and let them define what is most useful for PyTorch scripts to cover. Contributions to our library though shouldn't be any user script but a rewrite of one of our existing recipes with minimal changes to support the user's recipe. e.g. Is there a new feature that we need a utility/loss/optimizer/model for? Are these just a better config? If the recipe diverges dramatically from an existing strategy, than likely our classes wouldn't have supported that to begin with.
This inherently means that we'll make a small library of recipe types we support and we have to approve new recipe types instead of letting the community organically determine what recipes we support. Finally for the discussion on state, I agree it's very convenient to have an init and I like this innovation from this proposal. But I think there are other ways we could get this same benefits. One example, we could have a recipe util that's essentially a smart namespace similar to the output of argparse. (Just one idea, we could work on the name, e.g. args, state, ...)
|
@pbontrager few thoughts:
Again, we're proposing the same structure. So I'm not buying this. I think you're coming down to - classes are harder to write than scripts.
The interface isn't explicitly forcing a contract. Though maybe it should. You're viewing this flexibility as a pro. For where we are, the value propositions we're claiming to provide (stability and extensibility) and the users we're currently planning on supporting (hobbyists and then hackers), I don't view this flexibility as a pro at all. I'm happy to constrain new use cases and force users to have some design discussions with the core team before they make these changes. As an alpha, I think that's the right expectation to set. I'd rather do few things but do them well than boil the ocean.
I think we both agree that recipes is the product and our success will depend on us ensuring users can easily (and reliably) train the LLMs they want to. What I don't agree with is the fact that users have the best understanding of how to maintain a high bar with regards to all of this. We need to have a maintainable, well tested and well documented code base. As a team of 4 people we will need to scale ourselves to user support. I can't imagine opening this up to writing arbitrary scripts is the way that can happen (not that you're saying that either, but I want to set the expectation here). Again, I'd be ok with doing less but doing it well. I'd be ok with constraining user imagination in the early stages of the lib.
Why reinvent a programming concept that is well understood and reasoned about? We have enough things to innovate on :) |
…_checkpoint comment
One high-level comment I want to clarify. My understanding is that we would directly support a fairly limited number of very high quality recipes. Anything else, we would encourage forks. Actually a direct quote from Soumith was that we need to "bless" forks, making it easier for people to go wild with how they want to extend our utilities and recipes. This design seems to bake in the assumption that users will frequently be contributing recipes directly to our repository, which is new to me. To add to this, this model of thinking would heavily increase our load as maintainers and not necesarily in a way that helps further the community. Please correct me if I'm not understanding. |
@joecummings good question:
This makes sense AFTER folks buy into the value of this library. It's not a given that you'll get there i.e. folks will buy what you sell and start using/contributing code to the repo. The immediate priority is to highlight our design principles, make sure what we provide works well and users can see a direct line from there to extending capabilities. Let me know if that makes sense? |
help="Tensor dtype used for finetuning, lower precision types result in mixed precision training.", | ||
) | ||
|
||
group = parser.add_mutually_exclusive_group() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious about the rationale here. Personally I haven't seen --resume-from-checkpoint
split out into two separate args like this before
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just followed what seemed to be the convention for booleans in this list. Should I be doing this differently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm yeah I see the shuffle/no-shuffle above as well, tbh I don't love that either. Basically I don't think we should be exposing two separate CLI args for True/False, especially if one of them is not found anywhere else (since --no-resume
is just equivalent to setting resume_from_checkpoint=False
). In this case, why not just have a single --resume-from-checkpoint
arg defaulting to False?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ebsmothers this is also what I was complaining about the other day. But IIUC, @pbontrager and @NicolasHug mentioned it's the only way to set this up. I'll let @RdoubleA clean this up in one of his PRs.
self.device = utils.get_device(device=params.device) | ||
self.dtype = utils.get_dtype(dtype=params.dtype) | ||
self.seed = utils.set_seed(seed=params.seed) | ||
self.output_dir = params.output_dir | ||
|
||
_, rank = utils.get_world_size_and_rank() | ||
self.is_rank_zero = rank == 0 | ||
|
||
self.model = self._setup_model(model=params.model) | ||
self.tokenizer = self._setup_tokenizer( | ||
tokenizer=params.tokenizer, tokenizer_checkpoint=params.tokenizer_checkpoint | ||
) | ||
self.optimizer = self._setup_optimizer(optimizer=params.optimizer, lr=params.lr) | ||
self.loss_fn = self._setup_loss(loss=params.loss) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that's a bit unclear to me here is the split between calling utils functions and FullFinetune methods here. I know utils methods get called in the self.*
methods, but personally I feel it's nicer to keep them out of the raw __init__ as much as possible. I think you had a _setup_env
method previously, why not bring that back? (Just in this class, not in the interface)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point. So what I was optimizing for was to ensure all of the attributes are set in the __init__
instead of in helper functions, otherwise tracking these down becomes hard. The setup_environment
function just had a bunch of attributes which needed to be returned and also was a cluster of utility calls. After I removed init_process_group
it wasn't clear why that needed to be a different function. Does that make sense? Let me know if you disagree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah as you point out idk if there's a 100% perfect way to do this. I am fine with the approach you have here
|
||
def _setup_model(self, model: str) -> nn.Module: | ||
""" | ||
This assumes FSDP and activation checkpointing are enabled by default. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we plan to parametrize this later on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can have a discussion about AC (cc: @rohan-varma) but not sure if disabling FSDP makes sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can imagine cases where one would want to run without FSDP (though doesn't necessarily have to be in this recipe). But this is something we can decouple from this PR anyways
@@ -127,3 +127,79 @@ def load_checkpoint( | |||
ckpt_dict["optimizer"] = optim_state_dict_to_load | |||
|
|||
return ckpt_dict | |||
|
|||
|
|||
def load_checkpoint_updated( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add a todo here (or file an issue or something) so we actually change this name
should follow. Please note that the interface itself should not be a vehicle for | ||
code reuse. TorchTune strictly prohibits implementation inheritance in the codebase. | ||
|
||
TODO: Add link to design principle README |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this happening as part of this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "this" refer to here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, referring to the TODO. Are we gonna link to design principles?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll follow up.
- This interface is meant to help recipe-writers organize their code in a way | ||
which is easy to read, understand and extend. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should probably also add a link to FullFinetune as an example of how users should implement this interface
- This interface is not meant to add constraints. If the interface comes in the | ||
way of doing stuff, it needs to be updated or a new interface should be | ||
written to support what might be a new "family" of recipes. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imo there can be a distinction between the init of the recipe class and the actual "setup" stage of a finetune. E.g. I could imagine a world where someone wants to parse and export the config to a file as a form of validation when the recipe gets called. To me this is a bit different from setting up the model, tokenizer, dataloader, etc.
|
||
def _setup_data( | ||
self, dataset: str, shuffle: bool, batch_size: int | ||
) -> Tuple[DistributedSampler, DataLoader]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a standard way to handle sampler and dataloader? The return type of this method feels unintuitive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a constraint of our setup i.e. we only support DistributedSampler and Map-style datasets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh that makes sense, I just mean a return type tuple of sampler + dataloader is a bit weird
recipes/full_finetune.py
Outdated
|
||
# if training is in-progress, checkpoint the optimizer state as well | ||
if epoch + 1 < self.total_epochs: | ||
ckpt_dict.update({"optimzer": self.optimizer}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ckpt_dict.update({"optimzer": self.optimizer}) | |
ckpt_dict.update({"optimizer": self.optimizer}) |
init_process_group(backend="nccl") | ||
|
||
recipe = FullFinetune(params=recipe_params) | ||
recipe.load_checkpoint(model_checkpoint=recipe_params.model_checkpoint) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any particular reason this gets its own line in main, but other pre-training steps do not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh this is because init
does all of the instantiation and load_checkpoint
explicitly mutates its state. In the near-future. we need to add functionality for eval
as well. What else is missing here?
Just wanna confirm based on the test plan: have you done some sanity check on the generations of trained checkpoints before and after the refactor? (Or some other way to build confidence that there are no model quality regressions) |
@@ -0,0 +1,43 @@ | |||
# Copyright (c) Meta Platforms, Inc. and affiliates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another quick note - are you planning on housing all recipe dataclasses here? I would vote for a folder called args or params and each recipe has its own file. So the dataclass for full_finetune would be at recipes/args/full_finetune.py
. Otherwise a single file will get very bloated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original goal was to keep everything in a single file to make recipes easy to share. Even the configs could be set as defaults to make a shareable version of a file. If the dataclass is kept as as separate file now, that'll require more places to track what parameters are in a recipe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RdoubleA feel free to suggest the change in your PR. Introducing dataclasses has been on my mind for a while and I finally got around to adding it here. Open to what this looks like longer term.
@pbontrager I think that requirement is already broken by the move to more structured config system (see Rafi's PR). I'm ok with treating (config, dataclass, recipe class and checkpoint) as the complete source of truth for the recipe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make sure that open comments get addressed (OK to tackle minor ones in a fast follow). Given that (a) we are not replacing the current recipe in this change, (b) we have reasonable parity documented in the test plan, and (c) this will unblock some other dependencies, I am gonna stamp.
There are some good questions raised in this PR. I'll make sure to put out a detailed doc explaining the rationale behind every change I've made and we can discuss this in one of the deepdive sessions. Not going to do that this week given our goal to land everything by EOW. |
Context
The current recipe structure is the wild wild west.
We don't have any structure, which makes the recipe hard to understand, read and extend. Not to mention, the current implementation is a lot more bloated than it needs to be. Our design principles are no longer shining through. In the current form the recipe (our main product) is no longer compelling for new users. Specifically:
In its current form, users will dump code in arbitrary ways to our recipes folder which we'll need to review and in the worse case debug. We need to add som structure.
What our recipe code should not do
In this change, I take a different view of the recipe and try to better organize not just the components but the actual script itself. At the end we have the full recipe in 7 lines of code which is nice. Specifically:
__init__
is broken up into severalsetup_
functions. Most users will only care about a subset of these. For example,setup_data
for changing the dataset.load_checkpoint
to depend on aresume_from_checkpoint
flag so we can unify the API for the cases where we're resuming training or starting afresh.The structure of the recipe is clean: initialize, train, cleanup
I don't expect this to be the perfect structure. But this is the first step towards organizing the recipe better. We'll stress test this more as we go through the LoRA recipe and add more variants to full-finetune (adding determinism) for example.
FAQs
Are we building a Training Framework?
NO. Adding structure doesn't mean we're building a trainer or a training framework. We won't have "one" way to do anything. We do provide well-design utilities. But if a given structure doesn't work, we should just update it.
Are we going back on the principle of "No Inheritance"?
NO. Inheritance for code-reuse is a no-go. At all. We're enforcing a loose interface. And inheriting from an interface is just good design.
Test plan
Unit Tests pass (screenshot):
Distributed Training with and without resume runs (screenshot):
For recipe correctness, I ran both
finetune_llm.py
andfull_finetune.py
with the same config (FSDP and AC enabled) withepochs=2
andmax_steps_per_epoch=10
and losses line up. I'll follow up with a unit test in a separate PR.Here's a gist with the losses.