Recipe Redesign #251

kartikayk · 2024-01-25T20:47:48Z

Context

The current recipe structure is the wild wild west.

We don't have any structure, which makes the recipe hard to understand, read and extend. Not to mention, the current implementation is a lot more bloated than it needs to be. Our design principles are no longer shining through. In the current form the recipe (our main product) is no longer compelling for new users. Specifically:

Logic needed to setup different components is interleaved with core functionality.
We make a number of assumptions, but these aren't clearly documented. For example, we ALWAYS enable FSDP. Unless you understand the code deeply, this won't be obvious to you.
It's unclear where and how the user should add new functionality.

In its current form, users will dump code in arbitrary ways to our recipes folder which we'll need to review and in the worse case debug. We need to add som structure.

What our recipe code should not do

Increase the cognitive load for a user by asking them to understand functionality they don't care about. Simply removing boiler plate code without paying attention to why it's being removed will be the easiest way to add this cognitive load.
Cater to only a single user-demographic: hobbyist or hero. We need to support THE FULL SPECTRUM of users.
Hide functionality users need to do their job. See the first bullet.

In this change, I take a different view of the recipe and try to better organize not just the components but the actual script itself. At the end we have the full recipe in 7 lines of code which is nice. Specifically:

I move the arguments being added to the parser to a different file.
I create a "loose" structure for the recipe by defining an interface. This is "loose" because the methods themselves don't have a signature enforced (not possible if we want this to be generally applicable). This interface should be viewed more as a tool for organizing the recipe and making it easier to parse.
__init__ is broken up into several setup_ functions. Most users will only care about a subset of these. For example, setup_data for changing the dataset.
I update load_checkpoint to depend on a resume_from_checkpoint flag so we can unify the API for the cases where we're resuming training or starting afresh.

The structure of the recipe is clean: initialize, train, cleanup

I don't expect this to be the perfect structure. But this is the first step towards organizing the recipe better. We'll stress test this more as we go through the LoRA recipe and add more variants to full-finetune (adding determinism) for example.

FAQs

Are we building a Training Framework?
NO. Adding structure doesn't mean we're building a trainer or a training framework. We won't have "one" way to do anything. We do provide well-design utilities. But if a given structure doesn't work, we should just update it.
Are we going back on the principle of "No Inheritance"?
NO. Inheritance for code-reuse is a no-go. At all. We're enforcing a loose interface. And inheriting from an interface is just good design.

Test plan

Unit Tests pass (screenshot):

pytest

Distributed Training with and without resume runs (screenshot):

torchrun --nnodes 1 --nproc_per_node 8 recipes/full_finetune.py --config recipes/configs/alpaca_lla
ma2_full_finetune.yaml --model-checkpoint /home/kartikayk/cpts/llama2-7b-01242024 --seed 18 --tokenizer-checkpoint /home/kartikayk/cpts/tokenizer.model --epochs 1 --max-steps-per-epoch 10

For recipe correctness, I ran both finetune_llm.py and full_finetune.py with the same config (FSDP and AC enabled) with epochs=2 and max_steps_per_epoch=10 and losses line up. I'll follow up with a unit test in a separate PR.

Here's a gist with the losses.

netlify · 2024-01-25T20:47:53Z

❌ Deploy Preview for torchtune-preview failed.

Name	Link
🔨 Latest commit	`04ea4b0`
🔍 Latest deploy log	https://app.netlify.com/sites/torchtune-preview/deploys/65b7fe6d6cad200008f57786

rohan-varma · 2024-01-25T22:03:59Z

recipes/full_finetune.py

+        lr: float,
+        output_dir: str,
+        metric_logger_type: str,
+        project: str,


Are we always going to have a bunch of args like this or will it eventually be converted into configs? cc @RdoubleA

Definitely a stop gap here. We should read from config/args and convert to a dataclass or something which has been validated.

kartikayk · 2024-01-26T02:53:02Z

@ebsmothers, @pbontrager made some changes, including supporting resuming from checkpoint in a lot cleaner way.

recipes/llm_recipe_interface.py

NicolasHug · 2024-01-26T10:33:22Z

torchtune/utils/checkpoint.py

-    NOTE: `load_checkpoint` does NOT load model and optimizer states into the model and optimizer respectively.
-    `load_checkpoint` handles the appropriate transformations (i.e. related to FSDP), but user is expected to
-    call `load_state_dict` on the returned results.
+    NOTE: `load_checkpoint` is a wrapper around ```torch.load``` and does not handle loading the state dict or other state


This should hopefully automatically add a link to the torch docs

Suggested change

NOTE: `load_checkpoint` is a wrapper around ```torch.load``` and does not handle loading the state dict or other state

NOTE: `load_checkpoint` is a wrapper around :func:`torch.load` and does not handle loading the state dict or other state

Great point! Thank you for sharing!

joecummings · 2024-01-26T17:23:19Z

recipes/args.py

+from torchtune import datasets, models, utils
+
+
+def create_full_finetune_args(parser) -> argparse.ArgumentParser:


This is stopgap until config system is revamped, yes?

yeah, wondering what the order of merging the config and recipe redesign should be

Synced up with @RdoubleA offline on this.

joecummings

While I agree this makes the recipe more structured and easier to read at a glance, I would be remiss if I didn't point out that this is starting to look like PyTorch Lightning's hooks, which have very mixed reactions.

I want to be sure we acknowledge this is the direction we're heading.

kartikayk · 2024-01-26T18:15:31Z

@joecummings sorry for the strong opinion, but vehemently disagree that structured recipes and training frameworks are the same thing. I have ton's of very strong opinions around this. Happy to chat offline. But definitely read through the context section of this PR!

ebsmothers

Thanks for opening this. As we've discussed, I'm generally supportive of the change and personally think the improved UX is worth adding a bit of loose, non-binding structure to make recipe authoring easier.

Are we going back on the principle of "No Inheritance"?

Follow-up to this FAQ: I agree that we aren't going back on the no inheritance principle, but wonder if we aren't implicitly encouraging our users to go back on it. As I mentioned in our discussion, my mental model for writing e.g. my own recipe with full determinism would be to take the FullFinetune recipe class, extend it, and override the setup_environment and setup_data methods. We have to find a way to very strongly discourage this as otherwise recipe shareability can easily start to go out the window.

ebsmothers · 2024-01-26T19:45:36Z

recipes/llm_recipe_interface.py

+    def setup_environment(self, **kwargs) -> None:
+        """
+        Setup the environment needed for this recipe. This includes
+        initializing distributed, setting the right types, devices and seed,
+        and initializing the logger.
+        """
+        ...
+
+    def setup_model(self, **kwargs) -> None:
+        """
+        Instantiate the model, including additional capabilities such as
+        FSDP and activation checkpointing. This does not include loading the
+        checkpoint. All checkpointing should be done in ```load_checkpoint```.
+        """
+        ...
+
+    def setup_tokenizer(self, **kwargs) -> None:
+        """
+        Instantiate the tokenizer. Unfortunately, given the current implementation
+        for SentencePiece, this does include loading the tokenizer checkpoint.
+        """
+        ...
+
+    def setup_optimizer_and_loss(self, **kwargs) -> None:
+        """
+        Instantiates the optimizer and loss functions correctly. This includes making sure
+        learing-rate schedulers are correctly setup.
+        """
+        ...
+
+    def setup_data(self, **kwargs) -> None:
+        """
+        Logic associated with setting up datasets, samplers and associated dataloaders should
+        be added here.
+        """
+        ...


Personally I would consolidate all of these into a single setup function, then leave it to our recipes to define individual setup_* methods. This is for a couple reasons:

(1) Rn setup methods are more granular than train methods. One natural proposal to reduce this imbalance is "add more structure around the training loop" which I think we definitely do not want.
(2) I think there are some assumptions baked in to some of these. E.g. inference-only may not want optimizer+loss, multimodal may want more than tokenizer, etc. I think there is not a universal split
(3) We should distinguish between hard constraints (i.e. methods defined in this protocol) and guidance/best practices based on the recipes we provide. Imo the splitting of setup methods should fall into the latter category

Also (completely arbitrary) setup + run + teardown partition is inherently more pleasing somehow (maybe it's the rule of three?)

@ebsmothers great point! Let me share some of my thoughts on these, but I'll definitely admit these aren't the perfect split and we should evolve where needed.

setup methods are more granular than train methods

This is primarily because setup is also more standard than the train loop.

You setup your env in similar-ish ways for distributed training on GPUs.

For model fetch, we read the config and call the getter + have some fsdp logic which only applies to the model.

The tokenizer is currently loading checkpoint, but we can consolidate this with the model once we update the tokenizer.

Optimizer and loss can be split up but I want to keep these separate from the model.

I think there are some assumptions baked in to some of these

Yes, this assumes fine-tuning LLMs. Maybe I should make this clearer in the interface. Inference-only should not make use of this. It'll also likely be much simpler (we aren't going to support multi-GPU inference any time soon I think).

We should distinguish between hard constraints (i.e. methods defined in this protocol) and guidance/best practices based on the recipes we provide

Completely agree. But I'd rather err on the side of splitting stuff up while we're in the feel-out process. Taking everything and dumping into a single function is much easier than splitting stuff up later (esp if logic is deeply intertwined). Does this make sense?

I think what you're saying makes sense here, just wanna clarify the last point.

Completely agree. But I'd rather err on the side of splitting stuff up while we're in the feel-out process. Taking everything and dumping into a single function is much easier than splitting stuff up later (esp if logic is deeply intertwined). Does this make sense?

My comment in (3) is not that we should use only a setup method in the FullFinetune class, I like the way that's currently written. I am more wondering if we should change just the protocol to remove the other setup methods. The point about ease of removing methods in the future is valid, but is also kind of just a consequence of the fact that a protocol with more methods is inherently more constrained (for better or for worse).

That being said, if this protocol is tailored specifically towards LLM fine-tuning use cases, then at least my point (2) definitely becomes less of a concern. So I am good to take a "let's try it out and see" approach on this.

ebsmothers · 2024-01-26T19:57:25Z

recipes/llm_recipe_interface.py

+    def load_checkpoint(self, **kwargs) -> None:
+        """
+        This method is responsible for loading ALL of the state for the recipe from the
+        checkpoint, including state for the model, optimizer, dataloader and training
+        parameters such as the epoch and seed.
+        """
+        ...
+
+    def save_checkpoint(self, **kwargs) -> None:
+        """
+        This method is responsible for saving ALL of the state for the recipe,
+        including state for the model, optimizer, dataloader and training
+        parameters such as the epoch and seed.
+        """
+        ...


Q on these: why do we assume that the user wants a single load/save checkpoint API? E.g. when I run a fine-tune I may want to save optimizer and dataloader state at intermediate points, but at the end I probably only wanna save model weights. Do we encapsulate all of that in save_checkpoint?

when I run a fine-tune I may want to save optimizer and dataloader state at intermediate points

Why would you save these and not the model weights? You also don't want to output too many checkpoint files, basically everything should be written together.

Right now the ability to checkpoint the optimizer depends on whether you're in the middle of the training or not. I need to add logic to make sure we don't checkpoint opt state at the end of training. This is too expensive and continue-finetuning isn't a thing right now.

Why would you save these and not the model weights?

Oh sorry, I mean these in addition to model weights.

Right now the ability to checkpoint the optimizer depends on whether you're in the middle of the training or not. I need to add logic to make sure we don't checkpoint opt state at the end of training.

This is basically my question.. where are we doing this? Does save_checkpoint take an epoch arg or something?

yeh the function in FullFinetune class has the following signature:

def save_checkpoint(self, epoch, rank)

kartikayk · 2024-01-26T20:45:13Z

@ebsmothers great question.

As I mentioned in our discussion, my mental model for writing e.g. my own recipe with full determinism would be to take the FullFinetune recipe class, extend it, and override the setup_environment and setup_data methods

I think we need to

Provide enough examples to set a precedence
Explain in our readme and documentation WHY we dissuade inheritance
Clamp down hard in design/code reviews.

Basically we can only do this through a COMBINATION of existing examples, educating users through documentation and explicitly blocking PRs from merging which don't follow our principles.

…eckpoint and add the associated tests. Creating separate utility for now to not block other PRs

…recipe params

RdoubleA · 2024-01-29T06:03:04Z

recipes/params.py

+
+
+@dataclass
+class FullFinetuneParams:


can we break up this PR and add the dataclasses separately? then we can merge in the dataclass update, the config refactor, and finally this recipe redesign. what would you suggest is the best order?

Does @RdoubleA 's work replace this?

RdoubleA · 2024-01-29T06:05:35Z

recipes/params.py

+    dtype: str
+
+    # Reproducability
+    seed: int


this should be Optional[int] = None, right?

That's right, good catch

RdoubleA · 2024-01-29T06:07:37Z

recipes/interfaces.py

+    - This interface is not meant to add constraints. If the interface comes in the
+        way of doing stuff, it needs to be updated or a new interface should be
+        written to support what might be a new "family" of recipes.
+    """


you do all the setup in the init which seems it would be a standard operation for any recipe, wonder if you should add a setup method to the interface as well?

I had setup as a function of the interface, but moved this out since I don't expect this to be a public API. Happy to think through the alternative though. What would be the case of making setup public to this class?

Imo there can be a distinction between the init of the recipe class and the actual "setup" stage of a finetune. E.g. I could imagine a world where someone wants to parse and export the config to a file as a form of validation when the recipe gets called. To me this is a bit different from setting up the model, tokenizer, dataloader, etc.

RdoubleA · 2024-01-29T06:08:10Z

recipes/args.py

+from torchtune import datasets, models, utils
+
+
+def create_full_finetune_args(parser) -> argparse.ArgumentParser:


yeah, wondering what the order of merging the config and recipe redesign should be

pbontrager · 2024-01-29T15:49:00Z

Thinking about this redesign a lot, and the pros and cons that go with it. My main concern with the direction of this change is that it's moving the change in a risky direction towards making the recipes folder another library instead of a home for scripts that uses the torchtune library folder. I'm generally in favor of a lot of the structure suggested here, but do agree with others that the class structure will de-facto have users inheriting from it and abusing it as much or more than the script style recipes we're currently working with.

I heavily prefer script recipes because I think it's much less heavy handed and comfortable to get in and edit. We don't have to bias our users towards particular use cases (setup, train, etc) but instead can let the type of recipes evolve naturally. For example we already have a generation recipe that we might have never made if this structure wasn't here. Finally I feel the structure ads one additional layer of abstraction that the user has to lookup and learn to work with a recipe. If they want to do something slightly outside of the interface signatures, they have to decide to abandon it entirely or hack their way around it.

To encourage homogenous recipe designs and good practice, I'd be in much more favor of providing template structures that users can base their recipes on but don't directly restrict what users can do. These would be more like "documentation" for contributing. Below I provide an example structure, though a real template could include much more lines, I shortened it for brevity.

def setup_model(...):
    # Define model definition here
    return model, tokenizer


def setup_optimization(...):
    # Define optimzer modules here
    return opt, lr_schedule, autocast, grad_scaler


def setup_dataloader(...):
    # Define dataset and dataloader here
    return dataloader


def setup_logging(...):
    # Setup all logging tools here
    return metric_logger, logger, tqdm


def recipe(...):
    # Setup objects and training loop here

kartikayk · 2024-01-29T16:15:58Z

@pbontrager Thanks for the thoughtful discussion on this. I'm glad you spent some time thinking through this.

Looking through your proposal and your concerns, I think the primary difference between our proposals is the ability for the recipe to maintain state. I don't buy some of the other differences (see my comments below). Happy to chat a bit more about this.

Being able to store state is important IMO for a couple reasons (there are others, but I'm sticking to user facing):

You don't have to keep passing in attributes like dtype and device in every method (annoying) or keep computing these on the fly from the utilities (inefficient). It's also more intuitive
You very cleanly decouple methods that mutate this state (init and load checkpoint) from the rest of the recipe i.e. you decouple functionality from the setup. In the best case this makes it easier to read and in the worst case it makes thing easier to debug.
Adding new functionality is easy through composition, and you don't need to modify the signature of every method that needs this component. For example, if I add a new utility for monitoring activations, I can just initialize this as a component in init, store it as a class attribute. I don't have to update the signature of every method I create within the recipe function to be able to use this - which also doesn't make sense.

that the class structure will de-facto have users inheriting from it

See my comment to Evan above on this. I don't think this is a valid concern, at least not yet. I can easily clamp down on inheritance in the code base through examples, reviews and user education. Users inherit because lib maintainers allow it.

I heavily prefer script recipes because I think it's much less heavy handed and comfortable to get in and edit

What does this mean? Why is this any harder to edit? It's essentially the same structure.

but instead can let the type of recipes evolve naturally

See the context section of this PR on why this is not a good idea - our current recipe is a mess.

we already have a generation recipe that we might have never made if this structure wasn't here.

This interface is only for LLM fine-tuning. In the next few changes, we can make this concrete.

Finally I feel the structure ads one additional layer of abstraction that the user has to lookup and learn to work with a recipe

Agsin, what does this mean? The structure of the recipe is pretty much the same, the difference comes from how the state is maintained.

pbontrager · 2024-01-29T16:40:22Z

Thank you for breaking out these points, I'll try to give a better description of my reasoning and my thoughts on recipe state.

that the class structure will de-facto have users inheriting from it

I say this, just because I don't believe those measures will be enough. I think the affordances of the class and biases python developers bring into this will be too hard to fight back against.

"I heavily prefer script recipes because I think it's much less heavy handed and comfortable to get in and edit" and "Finally I feel the structure ads one additional layer of abstraction that the user has to lookup and learn to work with a recipe"

It's not harder to edit an existing recipe, but it's harder to write a new recipe from scratch as you have to figure out what's the correct place to put your code. With scripts this is more of a suggestion so it's not as much of a concern. Also you can freely modify function signatures in scripts.

but instead can let the type of recipes evolve naturally

Part of the disagreement here is over the product goal of our recipes. I don't believe that the current recipes are "a mess" but they do need to be further refined and cleaned up. What I believe we've been building here is a code platform to make it easy to share reproducible and robust PyTorch scripts. To that end we should let users' imagination take over and let them define what is most useful for PyTorch scripts to cover. Contributions to our library though shouldn't be any user script but a rewrite of one of our existing recipes with minimal changes to support the user's recipe. e.g. Is there a new feature that we need a utility/loss/optimizer/model for? Are these just a better config? If the recipe diverges dramatically from an existing strategy, than likely our classes wouldn't have supported that to begin with.

This interface is only for LLM fine-tuning. In the next few changes, we can make this concrete.

This inherently means that we'll make a small library of recipe types we support and we have to approve new recipe types instead of letting the community organically determine what recipes we support.

Finally for the discussion on state, I agree it's very convenient to have an init and I like this innovation from this proposal. But I think there are other ways we could get this same benefits. One example, we could have a recipe util that's essentially a smart namespace similar to the output of argparse. (Just one idea, we could work on the name, e.g. args, state, ...)

def recipe(**kwargs):
    args = get_args(**kwargs)
    args.device = get_device(args.device)
    
    ...
    
    save_checkpoint(args.output_dir, args)

kartikayk · 2024-01-29T17:00:06Z

@pbontrager few thoughts:

It's not harder to edit an existing recipe, but it's harder to write a new recipe from scratch as you have to figure out what's the correct place to put your code

Again, we're proposing the same structure. So I'm not buying this. I think you're coming down to - classes are harder to write than scripts.

With scripts this is more of a suggestion so it's not as much of a concern. Also you can freely modify function signatures in scripts.

The interface isn't explicitly forcing a contract. Though maybe it should. You're viewing this flexibility as a pro. For where we are, the value propositions we're claiming to provide (stability and extensibility) and the users we're currently planning on supporting (hobbyists and then hackers), I don't view this flexibility as a pro at all. I'm happy to constrain new use cases and force users to have some design discussions with the core team before they make these changes. As an alpha, I think that's the right expectation to set. I'd rather do few things but do them well than boil the ocean.

What I believe we've been building here is a code platform to make it easy to share reproducible and robust PyTorch scripts

I think we both agree that recipes is the product and our success will depend on us ensuring users can easily (and reliably) train the LLMs they want to.

What I don't agree with is the fact that users have the best understanding of how to maintain a high bar with regards to all of this. We need to have a maintainable, well tested and well documented code base. As a team of 4 people we will need to scale ourselves to user support. I can't imagine opening this up to writing arbitrary scripts is the way that can happen (not that you're saying that either, but I want to set the expectation here). Again, I'd be ok with doing less but doing it well. I'd be ok with constraining user imagination in the early stages of the lib.

Finally for the discussion on state, I agree it's very convenient to have an init and I like this innovation from this proposal. But I think there are other ways we could get this same benefits

Why reinvent a programming concept that is well understood and reasoned about? We have enough things to innovate on :)

…_checkpoint comment

joecummings · 2024-01-29T17:30:44Z

One high-level comment I want to clarify. My understanding is that we would directly support a fairly limited number of very high quality recipes. Anything else, we would encourage forks. Actually a direct quote from Soumith was that we need to "bless" forks, making it easier for people to go wild with how they want to extend our utilities and recipes.

This design seems to bake in the assumption that users will frequently be contributing recipes directly to our repository, which is new to me. To add to this, this model of thinking would heavily increase our load as maintainers and not necesarily in a way that helps further the community.

Please correct me if I'm not understanding.

kartikayk · 2024-01-29T18:01:51Z

@joecummings good question:

My understanding is that we would directly support a fairly limited number of very high quality recipes. Anything else, we would encourage forks

This makes sense AFTER folks buy into the value of this library. It's not a given that you'll get there i.e. folks will buy what you sell and start using/contributing code to the repo. The immediate priority is to highlight our design principles, make sure what we provide works well and users can see a direct line from there to extending capabilities. Let me know if that makes sense?

ebsmothers · 2024-01-29T17:09:29Z

recipes/args.py

+        help="Tensor dtype used for finetuning, lower precision types result in mixed precision training.",
+    )
+
+    group = parser.add_mutually_exclusive_group()


Curious about the rationale here. Personally I haven't seen --resume-from-checkpoint split out into two separate args like this before

I just followed what seemed to be the convention for booleans in this list. Should I be doing this differently?

Hmmm yeah I see the shuffle/no-shuffle above as well, tbh I don't love that either. Basically I don't think we should be exposing two separate CLI args for True/False, especially if one of them is not found anywhere else (since --no-resume is just equivalent to setting resume_from_checkpoint=False). In this case, why not just have a single --resume-from-checkpoint arg defaulting to False?

@ebsmothers this is also what I was complaining about the other day. But IIUC, @pbontrager and @NicolasHug mentioned it's the only way to set this up. I'll let @RdoubleA clean this up in one of his PRs.

ebsmothers · 2024-01-29T17:14:17Z

recipes/full_finetune.py

+        self.device = utils.get_device(device=params.device)
+        self.dtype = utils.get_dtype(dtype=params.dtype)
+        self.seed = utils.set_seed(seed=params.seed)
+        self.output_dir = params.output_dir
+
+        _, rank = utils.get_world_size_and_rank()
+        self.is_rank_zero = rank == 0
+
+        self.model = self._setup_model(model=params.model)
+        self.tokenizer = self._setup_tokenizer(
+            tokenizer=params.tokenizer, tokenizer_checkpoint=params.tokenizer_checkpoint
+        )
+        self.optimizer = self._setup_optimizer(optimizer=params.optimizer, lr=params.lr)
+        self.loss_fn = self._setup_loss(loss=params.loss)


One thing that's a bit unclear to me here is the split between calling utils functions and FullFinetune methods here. I know utils methods get called in the self.* methods, but personally I feel it's nicer to keep them out of the raw __init__ as much as possible. I think you had a _setup_env method previously, why not bring that back? (Just in this class, not in the interface)

This is a good point. So what I was optimizing for was to ensure all of the attributes are set in the __init__ instead of in helper functions, otherwise tracking these down becomes hard. The setup_environment function just had a bunch of attributes which needed to be returned and also was a cluster of utility calls. After I removed init_process_group it wasn't clear why that needed to be a different function. Does that make sense? Let me know if you disagree.

Yeah as you point out idk if there's a 100% perfect way to do this. I am fine with the approach you have here

ebsmothers · 2024-01-29T17:16:34Z

recipes/full_finetune.py

+
+    def _setup_model(self, model: str) -> nn.Module:
+        """
+        This assumes FSDP and activation checkpointing are enabled by default.


Do we plan to parametrize this later on?

I think we can have a discussion about AC (cc: @rohan-varma) but not sure if disabling FSDP makes sense?

I can imagine cases where one would want to run without FSDP (though doesn't necessarily have to be in this recipe). But this is something we can decouple from this PR anyways

ebsmothers · 2024-01-29T17:21:49Z

torchtune/utils/checkpoint.py

@@ -127,3 +127,79 @@ def load_checkpoint(
        ckpt_dict["optimizer"] = optim_state_dict_to_load

    return ckpt_dict
+
+
+def load_checkpoint_updated(


nit: add a todo here (or file an issue or something) so we actually change this name

ebsmothers · 2024-01-29T17:22:08Z

recipes/interfaces.py

+    should follow. Please note that the interface itself should not be a vehicle for
+    code reuse. TorchTune strictly prohibits implementation inheritance in the codebase.
+
+    TODO: Add link to design principle README


Is this happening as part of this PR?

What does "this" refer to here?

Sorry, referring to the TODO. Are we gonna link to design principles?

I'll follow up.

ebsmothers · 2024-01-29T17:22:58Z

recipes/interfaces.py

+    - This interface is meant to help recipe-writers organize their code in a way
+        which is easy to read, understand and extend.


Should probably also add a link to FullFinetune as an example of how users should implement this interface

ebsmothers · 2024-01-29T17:28:02Z

recipes/interfaces.py

+    - This interface is not meant to add constraints. If the interface comes in the
+        way of doing stuff, it needs to be updated or a new interface should be
+        written to support what might be a new "family" of recipes.
+    """


Imo there can be a distinction between the init of the recipe class and the actual "setup" stage of a finetune. E.g. I could imagine a world where someone wants to parse and export the config to a file as a form of validation when the recipe gets called. To me this is a bit different from setting up the model, tokenizer, dataloader, etc.

ebsmothers · 2024-01-29T17:30:03Z

recipes/full_finetune.py

+
+    def _setup_data(
+        self, dataset: str, shuffle: bool, batch_size: int
+    ) -> Tuple[DistributedSampler, DataLoader]:


Is this a standard way to handle sampler and dataloader? The return type of this method feels unintuitive.

I think this is a constraint of our setup i.e. we only support DistributedSampler and Map-style datasets.

Oh that makes sense, I just mean a return type tuple of sampler + dataloader is a bit weird

ebsmothers · 2024-01-29T18:10:09Z

recipes/full_finetune.py

+
+        # if training is in-progress, checkpoint the optimizer state as well
+        if epoch + 1 < self.total_epochs:
+            ckpt_dict.update({"optimzer": self.optimizer})


Suggested change

ckpt_dict.update({"optimzer": self.optimizer})

ckpt_dict.update({"optimizer": self.optimizer})

ebsmothers · 2024-01-29T18:13:26Z

recipes/full_finetune.py

+    init_process_group(backend="nccl")
+
+    recipe = FullFinetune(params=recipe_params)
+    recipe.load_checkpoint(model_checkpoint=recipe_params.model_checkpoint)


Any particular reason this gets its own line in main, but other pre-training steps do not?

Oh this is because init does all of the instantiation and load_checkpoint explicitly mutates its state. In the near-future. we need to add functionality for eval as well. What else is missing here?

ebsmothers · 2024-01-29T18:26:13Z

Just wanna confirm based on the test plan: have you done some sanity check on the generations of trained checkpoints before and after the refactor? (Or some other way to build confidence that there are no model quality regressions)

RdoubleA · 2024-01-29T19:53:19Z

recipes/params.py

@@ -0,0 +1,43 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.


another quick note - are you planning on housing all recipe dataclasses here? I would vote for a folder called args or params and each recipe has its own file. So the dataclass for full_finetune would be at recipes/args/full_finetune.py. Otherwise a single file will get very bloated.

The original goal was to keep everything in a single file to make recipes easy to share. Even the configs could be set as defaults to make a shareable version of a file. If the dataclass is kept as as separate file now, that'll require more places to track what parameters are in a recipe.

@RdoubleA feel free to suggest the change in your PR. Introducing dataclasses has been on my mind for a while and I finally got around to adding it here. Open to what this looks like longer term.

@pbontrager I think that requirement is already broken by the move to more structured config system (see Rafi's PR). I'm ok with treating (config, dataclass, recipe class and checkpoint) as the complete source of truth for the recipe.

ebsmothers

Please make sure that open comments get addressed (OK to tackle minor ones in a fast follow). Given that (a) we are not replacing the current recipe in this change, (b) we have reasonable parity documented in the test plan, and (c) this will unblock some other dependencies, I am gonna stamp.

kartikayk · 2024-01-29T20:26:49Z

There are some good questions raised in this PR. I'll make sure to put out a detailed doc explaining the rationale behind every change I've made and we can discuss this in one of the deepdive sessions. Not going to do that this week given our goal to land everything by EOW.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 25, 2024

kartikayk changed the title ~~Recipe Redesign~~ [WIP] Recipe Redesign Jan 25, 2024

rohan-varma reviewed Jan 25, 2024

View reviewed changes

NicolasHug reviewed Jan 26, 2024

View reviewed changes

recipes/llm_recipe_interface.py Outdated Show resolved Hide resolved

NicolasHug reviewed Jan 26, 2024

View reviewed changes

joecummings reviewed Jan 26, 2024

View reviewed changes

ebsmothers reviewed Jan 26, 2024

View reviewed changes

kartikayk added 6 commits January 27, 2024 14:51

Recipe redesign

5528d66

Update recipe docstring

5304a47

Add ability to resume from checkpoint

8670b93

Update tests and documentation

d10dc8d

Revert changes which potentially block other PRs

0550b2a

Update interface, docs and setup training params

6937386

kartikayk force-pushed the r_redesign branch from 5a4402a to 6937386 Compare January 27, 2024 22:55

kartikayk added 5 commits January 27, 2024 14:59

Fix lint errors

64fbf30

Update load_checkpoint utility to gracefully support resuming from ch…

073f6eb

…eckpoint and add the associated tests. Creating separate utility for now to not block other PRs

Fix documentation for train()

47ec026

Update the interface to remove private methods and add dataclass for …

f01daf2

…recipe params

Fix linter issues

9702624

rohan-varma mentioned this pull request Jan 28, 2024

[WIP] Dataloader, epoch, seed checkpoint integration + full determinism resume #247

Closed

RdoubleA reviewed Jan 29, 2024

View reviewed changes

Update class name to include Recipe as a suffix and small fix in load…

a43fb0c

…_checkpoint comment

ebsmothers reviewed Jan 29, 2024

View reviewed changes

Fix typos

04ea4b0

RdoubleA reviewed Jan 29, 2024

View reviewed changes

ebsmothers approved these changes Jan 29, 2024

View reviewed changes

kartikayk changed the title ~~[WIP] Recipe Redesign~~ Recipe Redesign Jan 29, 2024

kartikayk merged commit ccd6424 into main Jan 29, 2024
11 of 15 checks passed

kartikayk deleted the r_redesign branch January 29, 2024 21:28

kartikayk mentioned this pull request Jan 30, 2024

Recipe refactor testing #264

Merged

kartikayk mentioned this pull request Feb 8, 2024

Delete old recipe #338

Merged

	NOTE: `load_checkpoint` is a wrapper around ```torch.load``` and does not handle loading the state dict or other state
	NOTE: `load_checkpoint` is a wrapper around :func:`torch.load` and does not handle loading the state dict or other state

		from torchtune import datasets, models, utils


		def create_full_finetune_args(parser) -> argparse.ArgumentParser:

		- This interface is meant to help recipe-writers organize their code in a way
		which is easy to read, understand and extend.

	ckpt_dict.update({"optimzer": self.optimizer})
	ckpt_dict.update({"optimizer": self.optimizer})

		@@ -0,0 +1,43 @@
		# Copyright (c) Meta Platforms, Inc. and affiliates.

Recipe Redesign #251

Recipe Redesign #251

Conversation

kartikayk commented Jan 25, 2024 • edited Loading

Context

FAQs

Test plan

netlify bot commented Jan 25, 2024 • edited Loading

❌ Deploy Preview for torchtune-preview failed.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kartikayk commented Jan 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joecummings left a comment

Choose a reason for hiding this comment

kartikayk commented Jan 26, 2024

ebsmothers left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kartikayk commented Jan 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pbontrager commented Jan 29, 2024

kartikayk commented Jan 29, 2024

pbontrager commented Jan 29, 2024

kartikayk commented Jan 29, 2024

joecummings commented Jan 29, 2024 • edited Loading

kartikayk commented Jan 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ebsmothers commented Jan 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ebsmothers left a comment

Choose a reason for hiding this comment

kartikayk commented Jan 29, 2024

kartikayk commented Jan 25, 2024 •

edited

Loading

netlify bot commented Jan 25, 2024 •

edited

Loading

joecummings commented Jan 29, 2024 •

edited

Loading