Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recipe Redesign #251

Merged
merged 13 commits into from
Jan 29, 2024
Merged

Recipe Redesign #251

merged 13 commits into from
Jan 29, 2024

Conversation

kartikayk
Copy link
Contributor

@kartikayk kartikayk commented Jan 25, 2024

Context

The current recipe structure is the wild wild west.

We don't have any structure, which makes the recipe hard to understand, read and extend. Not to mention, the current implementation is a lot more bloated than it needs to be. Our design principles are no longer shining through. In the current form the recipe (our main product) is no longer compelling for new users. Specifically:

  • Logic needed to setup different components is interleaved with core functionality.
  • We make a number of assumptions, but these aren't clearly documented. For example, we ALWAYS enable FSDP. Unless you understand the code deeply, this won't be obvious to you.
  • It's unclear where and how the user should add new functionality.

In its current form, users will dump code in arbitrary ways to our recipes folder which we'll need to review and in the worse case debug. We need to add som structure.

What our recipe code should not do

  • Increase the cognitive load for a user by asking them to understand functionality they don't care about. Simply removing boiler plate code without paying attention to why it's being removed will be the easiest way to add this cognitive load.
  • Cater to only a single user-demographic: hobbyist or hero. We need to support THE FULL SPECTRUM of users.
  • Hide functionality users need to do their job. See the first bullet.

In this change, I take a different view of the recipe and try to better organize not just the components but the actual script itself. At the end we have the full recipe in 7 lines of code which is nice. Specifically:

  • I move the arguments being added to the parser to a different file.
  • I create a "loose" structure for the recipe by defining an interface. This is "loose" because the methods themselves don't have a signature enforced (not possible if we want this to be generally applicable). This interface should be viewed more as a tool for organizing the recipe and making it easier to parse.
  • __init__ is broken up into several setup_ functions. Most users will only care about a subset of these. For example, setup_data for changing the dataset.
  • I update load_checkpoint to depend on a resume_from_checkpoint flag so we can unify the API for the cases where we're resuming training or starting afresh.

The structure of the recipe is clean: initialize, train, cleanup

I don't expect this to be the perfect structure. But this is the first step towards organizing the recipe better. We'll stress test this more as we go through the LoRA recipe and add more variants to full-finetune (adding determinism) for example.

FAQs

  • Are we building a Training Framework?
    NO. Adding structure doesn't mean we're building a trainer or a training framework. We won't have "one" way to do anything. We do provide well-design utilities. But if a given structure doesn't work, we should just update it.

  • Are we going back on the principle of "No Inheritance"?
    NO. Inheritance for code-reuse is a no-go. At all. We're enforcing a loose interface. And inheriting from an interface is just good design.

Test plan

Unit Tests pass (screenshot):

Screenshot 2024-01-25 at 6 48 55 PM
pytest

Distributed Training with and without resume runs (screenshot):

torchrun --nnodes 1 --nproc_per_node 8 recipes/full_finetune.py --config recipes/configs/alpaca_lla
ma2_full_finetune.yaml --model-checkpoint /home/kartikayk/cpts/llama2-7b-01242024 --seed 18 --tokenizer-checkpoint /home/kartikayk/cpts/tokenizer.model --epochs 1 --max-steps-per-epoch 10
image

For recipe correctness, I ran both finetune_llm.py and full_finetune.py with the same config (FSDP and AC enabled) with epochs=2 and max_steps_per_epoch=10 and losses line up. I'll follow up with a unit test in a separate PR.

Here's a gist with the losses.

Copy link

netlify bot commented Jan 25, 2024

Deploy Preview for torchtune-preview failed.

Name Link
🔨 Latest commit 04ea4b0
🔍 Latest deploy log https://app.netlify.com/sites/torchtune-preview/deploys/65b7fe6d6cad200008f57786

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 25, 2024
@kartikayk kartikayk changed the title Recipe Redesign [WIP] Recipe Redesign Jan 25, 2024
lr: float,
output_dir: str,
metric_logger_type: str,
project: str,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we always going to have a bunch of args like this or will it eventually be converted into configs? cc @RdoubleA

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely a stop gap here. We should read from config/args and convert to a dataclass or something which has been validated.

@kartikayk
Copy link
Contributor Author

@ebsmothers, @pbontrager made some changes, including supporting resuming from checkpoint in a lot cleaner way.

NOTE: `load_checkpoint` does NOT load model and optimizer states into the model and optimizer respectively.
`load_checkpoint` handles the appropriate transformations (i.e. related to FSDP), but user is expected to
call `load_state_dict` on the returned results.
NOTE: `load_checkpoint` is a wrapper around ```torch.load``` and does not handle loading the state dict or other state
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should hopefully automatically add a link to the torch docs

Suggested change
NOTE: `load_checkpoint` is a wrapper around ```torch.load``` and does not handle loading the state dict or other state
NOTE: `load_checkpoint` is a wrapper around :func:`torch.load` and does not handle loading the state dict or other state

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point! Thank you for sharing!

from torchtune import datasets, models, utils


def create_full_finetune_args(parser) -> argparse.ArgumentParser:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is stopgap until config system is revamped, yes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, wondering what the order of merging the config and recipe redesign should be

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Synced up with @RdoubleA offline on this.

Copy link
Contributor

@joecummings joecummings left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I agree this makes the recipe more structured and easier to read at a glance, I would be remiss if I didn't point out that this is starting to look like PyTorch Lightning's hooks, which have very mixed reactions.

Screenshot 2024-01-26 at 9 25 10 AM

I want to be sure we acknowledge this is the direction we're heading.

@kartikayk
Copy link
Contributor Author

@joecummings sorry for the strong opinion, but vehemently disagree that structured recipes and training frameworks are the same thing. I have ton's of very strong opinions around this. Happy to chat offline. But definitely read through the context section of this PR!

Copy link
Contributor

@ebsmothers ebsmothers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for opening this. As we've discussed, I'm generally supportive of the change and personally think the improved UX is worth adding a bit of loose, non-binding structure to make recipe authoring easier.

Are we going back on the principle of "No Inheritance"?

Follow-up to this FAQ: I agree that we aren't going back on the no inheritance principle, but wonder if we aren't implicitly encouraging our users to go back on it. As I mentioned in our discussion, my mental model for writing e.g. my own recipe with full determinism would be to take the FullFinetune recipe class, extend it, and override the setup_environment and setup_data methods. We have to find a way to very strongly discourage this as otherwise recipe shareability can easily start to go out the window.

Comment on lines 16 to 51
def setup_environment(self, **kwargs) -> None:
"""
Setup the environment needed for this recipe. This includes
initializing distributed, setting the right types, devices and seed,
and initializing the logger.
"""
...

def setup_model(self, **kwargs) -> None:
"""
Instantiate the model, including additional capabilities such as
FSDP and activation checkpointing. This does not include loading the
checkpoint. All checkpointing should be done in ```load_checkpoint```.
"""
...

def setup_tokenizer(self, **kwargs) -> None:
"""
Instantiate the tokenizer. Unfortunately, given the current implementation
for SentencePiece, this does include loading the tokenizer checkpoint.
"""
...

def setup_optimizer_and_loss(self, **kwargs) -> None:
"""
Instantiates the optimizer and loss functions correctly. This includes making sure
learing-rate schedulers are correctly setup.
"""
...

def setup_data(self, **kwargs) -> None:
"""
Logic associated with setting up datasets, samplers and associated dataloaders should
be added here.
"""
...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I would consolidate all of these into a single setup function, then leave it to our recipes to define individual setup_* methods. This is for a couple reasons:

(1) Rn setup methods are more granular than train methods. One natural proposal to reduce this imbalance is "add more structure around the training loop" which I think we definitely do not want.
(2) I think there are some assumptions baked in to some of these. E.g. inference-only may not want optimizer+loss, multimodal may want more than tokenizer, etc. I think there is not a universal split
(3) We should distinguish between hard constraints (i.e. methods defined in this protocol) and guidance/best practices based on the recipes we provide. Imo the splitting of setup methods should fall into the latter category

Also (completely arbitrary) setup + run + teardown partition is inherently more pleasing somehow (maybe it's the rule of three?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ebsmothers great point! Let me share some of my thoughts on these, but I'll definitely admit these aren't the perfect split and we should evolve where needed.

setup methods are more granular than train methods

This is primarily because setup is also more standard than the train loop.

  • You setup your env in similar-ish ways for distributed training on GPUs.
  • For model fetch, we read the config and call the getter + have some fsdp logic which only applies to the model.
  • The tokenizer is currently loading checkpoint, but we can consolidate this with the model once we update the tokenizer.
  • Optimizer and loss can be split up but I want to keep these separate from the model.

I think there are some assumptions baked in to some of these

Yes, this assumes fine-tuning LLMs. Maybe I should make this clearer in the interface. Inference-only should not make use of this. It'll also likely be much simpler (we aren't going to support multi-GPU inference any time soon I think).

We should distinguish between hard constraints (i.e. methods defined in this protocol) and guidance/best practices based on the recipes we provide

Completely agree. But I'd rather err on the side of splitting stuff up while we're in the feel-out process. Taking everything and dumping into a single function is much easier than splitting stuff up later (esp if logic is deeply intertwined). Does this make sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what you're saying makes sense here, just wanna clarify the last point.

Completely agree. But I'd rather err on the side of splitting stuff up while we're in the feel-out process. Taking everything and dumping into a single function is much easier than splitting stuff up later (esp if logic is deeply intertwined). Does this make sense?

My comment in (3) is not that we should use only a setup method in the FullFinetune class, I like the way that's currently written. I am more wondering if we should change just the protocol to remove the other setup methods. The point about ease of removing methods in the future is valid, but is also kind of just a consequence of the fact that a protocol with more methods is inherently more constrained (for better or for worse).

That being said, if this protocol is tailored specifically towards LLM fine-tuning use cases, then at least my point (2) definitely becomes less of a concern. So I am good to take a "let's try it out and see" approach on this.

Comment on lines 53 to 67
def load_checkpoint(self, **kwargs) -> None:
"""
This method is responsible for loading ALL of the state for the recipe from the
checkpoint, including state for the model, optimizer, dataloader and training
parameters such as the epoch and seed.
"""
...

def save_checkpoint(self, **kwargs) -> None:
"""
This method is responsible for saving ALL of the state for the recipe,
including state for the model, optimizer, dataloader and training
parameters such as the epoch and seed.
"""
...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q on these: why do we assume that the user wants a single load/save checkpoint API? E.g. when I run a fine-tune I may want to save optimizer and dataloader state at intermediate points, but at the end I probably only wanna save model weights. Do we encapsulate all of that in save_checkpoint?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when I run a fine-tune I may want to save optimizer and dataloader state at intermediate points

Why would you save these and not the model weights? You also don't want to output too many checkpoint files, basically everything should be written together.

Right now the ability to checkpoint the optimizer depends on whether you're in the middle of the training or not. I need to add logic to make sure we don't checkpoint opt state at the end of training. This is too expensive and continue-finetuning isn't a thing right now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would you save these and not the model weights?

Oh sorry, I mean these in addition to model weights.

Right now the ability to checkpoint the optimizer depends on whether you're in the middle of the training or not. I need to add logic to make sure we don't checkpoint opt state at the end of training.

This is basically my question.. where are we doing this? Does save_checkpoint take an epoch arg or something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeh the function in FullFinetune class has the following signature:

def save_checkpoint(self, epoch, rank)

@kartikayk
Copy link
Contributor Author

@ebsmothers great question.

As I mentioned in our discussion, my mental model for writing e.g. my own recipe with full determinism would be to take the FullFinetune recipe class, extend it, and override the setup_environment and setup_data methods

I think we need to

  • Provide enough examples to set a precedence
  • Explain in our readme and documentation WHY we dissuade inheritance
  • Clamp down hard in design/code reviews.

Basically we can only do this through a COMBINATION of existing examples, educating users through documentation and explicitly blocking PRs from merging which don't follow our principles.



@dataclass
class FullFinetuneParams:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we break up this PR and add the dataclasses separately? then we can merge in the dataclass update, the config refactor, and finally this recipe redesign. what would you suggest is the best order?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does @RdoubleA 's work replace this?

dtype: str

# Reproducability
seed: int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be Optional[int] = None, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, good catch

- This interface is not meant to add constraints. If the interface comes in the
way of doing stuff, it needs to be updated or a new interface should be
written to support what might be a new "family" of recipes.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you do all the setup in the init which seems it would be a standard operation for any recipe, wonder if you should add a setup method to the interface as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had setup as a function of the interface, but moved this out since I don't expect this to be a public API. Happy to think through the alternative though. What would be the case of making setup public to this class?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imo there can be a distinction between the init of the recipe class and the actual "setup" stage of a finetune. E.g. I could imagine a world where someone wants to parse and export the config to a file as a form of validation when the recipe gets called. To me this is a bit different from setting up the model, tokenizer, dataloader, etc.

from torchtune import datasets, models, utils


def create_full_finetune_args(parser) -> argparse.ArgumentParser:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, wondering what the order of merging the config and recipe redesign should be

@pbontrager
Copy link
Contributor

Thinking about this redesign a lot, and the pros and cons that go with it. My main concern with the direction of this change is that it's moving the change in a risky direction towards making the recipes folder another library instead of a home for scripts that uses the torchtune library folder. I'm generally in favor of a lot of the structure suggested here, but do agree with others that the class structure will de-facto have users inheriting from it and abusing it as much or more than the script style recipes we're currently working with.

I heavily prefer script recipes because I think it's much less heavy handed and comfortable to get in and edit. We don't have to bias our users towards particular use cases (setup, train, etc) but instead can let the type of recipes evolve naturally. For example we already have a generation recipe that we might have never made if this structure wasn't here. Finally I feel the structure ads one additional layer of abstraction that the user has to lookup and learn to work with a recipe. If they want to do something slightly outside of the interface signatures, they have to decide to abandon it entirely or hack their way around it.

To encourage homogenous recipe designs and good practice, I'd be in much more favor of providing template structures that users can base their recipes on but don't directly restrict what users can do. These would be more like "documentation" for contributing. Below I provide an example structure, though a real template could include much more lines, I shortened it for brevity.

def setup_model(...):
    # Define model definition here
    return model, tokenizer


def setup_optimization(...):
    # Define optimzer modules here
    return opt, lr_schedule, autocast, grad_scaler


def setup_dataloader(...):
    # Define dataset and dataloader here
    return dataloader


def setup_logging(...):
    # Setup all logging tools here
    return metric_logger, logger, tqdm


def recipe(...):
    # Setup objects and training loop here

@kartikayk
Copy link
Contributor Author

@pbontrager Thanks for the thoughtful discussion on this. I'm glad you spent some time thinking through this.

Looking through your proposal and your concerns, I think the primary difference between our proposals is the ability for the recipe to maintain state. I don't buy some of the other differences (see my comments below). Happy to chat a bit more about this.

Being able to store state is important IMO for a couple reasons (there are others, but I'm sticking to user facing):

  • You don't have to keep passing in attributes like dtype and device in every method (annoying) or keep computing these on the fly from the utilities (inefficient). It's also more intuitive
  • You very cleanly decouple methods that mutate this state (init and load checkpoint) from the rest of the recipe i.e. you decouple functionality from the setup. In the best case this makes it easier to read and in the worst case it makes thing easier to debug.
  • Adding new functionality is easy through composition, and you don't need to modify the signature of every method that needs this component. For example, if I add a new utility for monitoring activations, I can just initialize this as a component in init, store it as a class attribute. I don't have to update the signature of every method I create within the recipe function to be able to use this - which also doesn't make sense.

that the class structure will de-facto have users inheriting from it

See my comment to Evan above on this. I don't think this is a valid concern, at least not yet. I can easily clamp down on inheritance in the code base through examples, reviews and user education. Users inherit because lib maintainers allow it.

I heavily prefer script recipes because I think it's much less heavy handed and comfortable to get in and edit

What does this mean? Why is this any harder to edit? It's essentially the same structure.

but instead can let the type of recipes evolve naturally

See the context section of this PR on why this is not a good idea - our current recipe is a mess.

we already have a generation recipe that we might have never made if this structure wasn't here.

This interface is only for LLM fine-tuning. In the next few changes, we can make this concrete.

Finally I feel the structure ads one additional layer of abstraction that the user has to lookup and learn to work with a recipe

Agsin, what does this mean? The structure of the recipe is pretty much the same, the difference comes from how the state is maintained.

@pbontrager
Copy link
Contributor

Thank you for breaking out these points, I'll try to give a better description of my reasoning and my thoughts on recipe state.

that the class structure will de-facto have users inheriting from it

I say this, just because I don't believe those measures will be enough. I think the affordances of the class and biases python developers bring into this will be too hard to fight back against.

"I heavily prefer script recipes because I think it's much less heavy handed and comfortable to get in and edit" and "Finally I feel the structure ads one additional layer of abstraction that the user has to lookup and learn to work with a recipe"

It's not harder to edit an existing recipe, but it's harder to write a new recipe from scratch as you have to figure out what's the correct place to put your code. With scripts this is more of a suggestion so it's not as much of a concern. Also you can freely modify function signatures in scripts.

but instead can let the type of recipes evolve naturally

Part of the disagreement here is over the product goal of our recipes. I don't believe that the current recipes are "a mess" but they do need to be further refined and cleaned up. What I believe we've been building here is a code platform to make it easy to share reproducible and robust PyTorch scripts. To that end we should let users' imagination take over and let them define what is most useful for PyTorch scripts to cover. Contributions to our library though shouldn't be any user script but a rewrite of one of our existing recipes with minimal changes to support the user's recipe. e.g. Is there a new feature that we need a utility/loss/optimizer/model for? Are these just a better config? If the recipe diverges dramatically from an existing strategy, than likely our classes wouldn't have supported that to begin with.

This interface is only for LLM fine-tuning. In the next few changes, we can make this concrete.

This inherently means that we'll make a small library of recipe types we support and we have to approve new recipe types instead of letting the community organically determine what recipes we support.

Finally for the discussion on state, I agree it's very convenient to have an init and I like this innovation from this proposal. But I think there are other ways we could get this same benefits. One example, we could have a recipe util that's essentially a smart namespace similar to the output of argparse. (Just one idea, we could work on the name, e.g. args, state, ...)

def recipe(**kwargs):
    args = get_args(**kwargs)
    args.device = get_device(args.device)
    
    ...
    
    save_checkpoint(args.output_dir, args)

@kartikayk
Copy link
Contributor Author

@pbontrager few thoughts:

It's not harder to edit an existing recipe, but it's harder to write a new recipe from scratch as you have to figure out what's the correct place to put your code

Again, we're proposing the same structure. So I'm not buying this. I think you're coming down to - classes are harder to write than scripts.

With scripts this is more of a suggestion so it's not as much of a concern. Also you can freely modify function signatures in scripts.

The interface isn't explicitly forcing a contract. Though maybe it should. You're viewing this flexibility as a pro. For where we are, the value propositions we're claiming to provide (stability and extensibility) and the users we're currently planning on supporting (hobbyists and then hackers), I don't view this flexibility as a pro at all. I'm happy to constrain new use cases and force users to have some design discussions with the core team before they make these changes. As an alpha, I think that's the right expectation to set. I'd rather do few things but do them well than boil the ocean.

What I believe we've been building here is a code platform to make it easy to share reproducible and robust PyTorch scripts

I think we both agree that recipes is the product and our success will depend on us ensuring users can easily (and reliably) train the LLMs they want to.

What I don't agree with is the fact that users have the best understanding of how to maintain a high bar with regards to all of this. We need to have a maintainable, well tested and well documented code base. As a team of 4 people we will need to scale ourselves to user support. I can't imagine opening this up to writing arbitrary scripts is the way that can happen (not that you're saying that either, but I want to set the expectation here). Again, I'd be ok with doing less but doing it well. I'd be ok with constraining user imagination in the early stages of the lib.

Finally for the discussion on state, I agree it's very convenient to have an init and I like this innovation from this proposal. But I think there are other ways we could get this same benefits

Why reinvent a programming concept that is well understood and reasoned about? We have enough things to innovate on :)

@joecummings
Copy link
Contributor

joecummings commented Jan 29, 2024

One high-level comment I want to clarify. My understanding is that we would directly support a fairly limited number of very high quality recipes. Anything else, we would encourage forks. Actually a direct quote from Soumith was that we need to "bless" forks, making it easier for people to go wild with how they want to extend our utilities and recipes.

This design seems to bake in the assumption that users will frequently be contributing recipes directly to our repository, which is new to me. To add to this, this model of thinking would heavily increase our load as maintainers and not necesarily in a way that helps further the community.

Please correct me if I'm not understanding.

@kartikayk
Copy link
Contributor Author

@joecummings good question:

My understanding is that we would directly support a fairly limited number of very high quality recipes. Anything else, we would encourage forks

This makes sense AFTER folks buy into the value of this library. It's not a given that you'll get there i.e. folks will buy what you sell and start using/contributing code to the repo. The immediate priority is to highlight our design principles, make sure what we provide works well and users can see a direct line from there to extending capabilities. Let me know if that makes sense?

help="Tensor dtype used for finetuning, lower precision types result in mixed precision training.",
)

group = parser.add_mutually_exclusive_group()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious about the rationale here. Personally I haven't seen --resume-from-checkpoint split out into two separate args like this before

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just followed what seemed to be the convention for booleans in this list. Should I be doing this differently?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm yeah I see the shuffle/no-shuffle above as well, tbh I don't love that either. Basically I don't think we should be exposing two separate CLI args for True/False, especially if one of them is not found anywhere else (since --no-resume is just equivalent to setting resume_from_checkpoint=False). In this case, why not just have a single --resume-from-checkpoint arg defaulting to False?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ebsmothers this is also what I was complaining about the other day. But IIUC, @pbontrager and @NicolasHug mentioned it's the only way to set this up. I'll let @RdoubleA clean this up in one of his PRs.

Comment on lines +59 to +72
self.device = utils.get_device(device=params.device)
self.dtype = utils.get_dtype(dtype=params.dtype)
self.seed = utils.set_seed(seed=params.seed)
self.output_dir = params.output_dir

_, rank = utils.get_world_size_and_rank()
self.is_rank_zero = rank == 0

self.model = self._setup_model(model=params.model)
self.tokenizer = self._setup_tokenizer(
tokenizer=params.tokenizer, tokenizer_checkpoint=params.tokenizer_checkpoint
)
self.optimizer = self._setup_optimizer(optimizer=params.optimizer, lr=params.lr)
self.loss_fn = self._setup_loss(loss=params.loss)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that's a bit unclear to me here is the split between calling utils functions and FullFinetune methods here. I know utils methods get called in the self.* methods, but personally I feel it's nicer to keep them out of the raw __init__ as much as possible. I think you had a _setup_env method previously, why not bring that back? (Just in this class, not in the interface)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. So what I was optimizing for was to ensure all of the attributes are set in the __init__ instead of in helper functions, otherwise tracking these down becomes hard. The setup_environment function just had a bunch of attributes which needed to be returned and also was a cluster of utility calls. After I removed init_process_group it wasn't clear why that needed to be a different function. Does that make sense? Let me know if you disagree.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah as you point out idk if there's a 100% perfect way to do this. I am fine with the approach you have here


def _setup_model(self, model: str) -> nn.Module:
"""
This assumes FSDP and activation checkpointing are enabled by default.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we plan to parametrize this later on?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can have a discussion about AC (cc: @rohan-varma) but not sure if disabling FSDP makes sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can imagine cases where one would want to run without FSDP (though doesn't necessarily have to be in this recipe). But this is something we can decouple from this PR anyways

@@ -127,3 +127,79 @@ def load_checkpoint(
ckpt_dict["optimizer"] = optim_state_dict_to_load

return ckpt_dict


def load_checkpoint_updated(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add a todo here (or file an issue or something) so we actually change this name

should follow. Please note that the interface itself should not be a vehicle for
code reuse. TorchTune strictly prohibits implementation inheritance in the codebase.

TODO: Add link to design principle README
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this happening as part of this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "this" refer to here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, referring to the TODO. Are we gonna link to design principles?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll follow up.

Comment on lines +19 to +20
- This interface is meant to help recipe-writers organize their code in a way
which is easy to read, understand and extend.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably also add a link to FullFinetune as an example of how users should implement this interface

- This interface is not meant to add constraints. If the interface comes in the
way of doing stuff, it needs to be updated or a new interface should be
written to support what might be a new "family" of recipes.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imo there can be a distinction between the init of the recipe class and the actual "setup" stage of a finetune. E.g. I could imagine a world where someone wants to parse and export the config to a file as a form of validation when the recipe gets called. To me this is a bit different from setting up the model, tokenizer, dataloader, etc.


def _setup_data(
self, dataset: str, shuffle: bool, batch_size: int
) -> Tuple[DistributedSampler, DataLoader]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a standard way to handle sampler and dataloader? The return type of this method feels unintuitive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a constraint of our setup i.e. we only support DistributedSampler and Map-style datasets.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that makes sense, I just mean a return type tuple of sampler + dataloader is a bit weird


# if training is in-progress, checkpoint the optimizer state as well
if epoch + 1 < self.total_epochs:
ckpt_dict.update({"optimzer": self.optimizer})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ckpt_dict.update({"optimzer": self.optimizer})
ckpt_dict.update({"optimizer": self.optimizer})

init_process_group(backend="nccl")

recipe = FullFinetune(params=recipe_params)
recipe.load_checkpoint(model_checkpoint=recipe_params.model_checkpoint)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason this gets its own line in main, but other pre-training steps do not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh this is because init does all of the instantiation and load_checkpoint explicitly mutates its state. In the near-future. we need to add functionality for eval as well. What else is missing here?

@ebsmothers
Copy link
Contributor

Just wanna confirm based on the test plan: have you done some sanity check on the generations of trained checkpoints before and after the refactor? (Or some other way to build confidence that there are no model quality regressions)

@@ -0,0 +1,43 @@
# Copyright (c) Meta Platforms, Inc. and affiliates.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another quick note - are you planning on housing all recipe dataclasses here? I would vote for a folder called args or params and each recipe has its own file. So the dataclass for full_finetune would be at recipes/args/full_finetune.py. Otherwise a single file will get very bloated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original goal was to keep everything in a single file to make recipes easy to share. Even the configs could be set as defaults to make a shareable version of a file. If the dataclass is kept as as separate file now, that'll require more places to track what parameters are in a recipe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RdoubleA feel free to suggest the change in your PR. Introducing dataclasses has been on my mind for a while and I finally got around to adding it here. Open to what this looks like longer term.

@pbontrager I think that requirement is already broken by the move to more structured config system (see Rafi's PR). I'm ok with treating (config, dataclass, recipe class and checkpoint) as the complete source of truth for the recipe.

Copy link
Contributor

@ebsmothers ebsmothers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure that open comments get addressed (OK to tackle minor ones in a fast follow). Given that (a) we are not replacing the current recipe in this change, (b) we have reasonable parity documented in the test plan, and (c) this will unblock some other dependencies, I am gonna stamp.

@kartikayk
Copy link
Contributor Author

There are some good questions raised in this PR. I'll make sure to put out a detailed doc explaining the rationale behind every change I've made and we can discuss this in one of the deepdive sessions. Not going to do that this week given our goal to land everything by EOW.

@kartikayk kartikayk changed the title [WIP] Recipe Redesign Recipe Redesign Jan 29, 2024
@kartikayk kartikayk merged commit ccd6424 into main Jan 29, 2024
11 of 15 checks passed
@kartikayk kartikayk deleted the r_redesign branch January 29, 2024 21:28
@kartikayk kartikayk mentioned this pull request Feb 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants