-
Notifications
You must be signed in to change notification settings - Fork 1.2k
feat: add huggingface post_training impl #2132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
42e0c06 to
c8fe49c
Compare
e323876 to
380b4b2
Compare
|
This is great! |
a91326d to
9270ff3
Compare
llama_stack/providers/inline/post_training/huggingface/recipes/finetune_single_device.py
Show resolved
Hide resolved
|
This is very promising overall; once we have it in-tree, we should be able to revive the integration tests suite for post-training API that was blocked because llama consolidated model builds were not available for fetch. #1786 |
c7e8fd4 to
6cd0a75
Compare
llama_stack/providers/inline/post_training/huggingface/__init__.py
Outdated
Show resolved
Hide resolved
llama_stack/providers/inline/post_training/huggingface/recipes/finetune_single_device.py
Outdated
Show resolved
Hide resolved
llama_stack/providers/inline/post_training/huggingface/recipes/finetune_single_device.py
Show resolved
Hide resolved
llama_stack/providers/inline/post_training/huggingface/recipes/finetune_single_device.py
Outdated
Show resolved
Hide resolved
llama_stack/providers/inline/post_training/huggingface/recipes/finetune_single_device.py
Show resolved
Hide resolved
llama_stack/providers/inline/post_training/huggingface/recipes/finetune_single_device.py
Outdated
Show resolved
Hide resolved
| finally: | ||
| # Clean up resources | ||
| if hasattr(trainer, "model"): | ||
| if device.type != "cpu": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lines 408-414 are implemented in torchtune. Please move the code to a common function (evacuate_model_from_device?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
created common/utils.py for post_training and clear_model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am lazy importing torch so we don't hit issues within this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the name @booxter suggested more than clear_model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure I can change it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
|
I've reviewed the code, it looks reasonable enough to merge it after some basic cleanup. I don't expect you to fix the blocking problem as part of this PR, but making a note in code about the problem would be nice. Thanks a lot for enabling (some) integration tests even! I will follow up with enabling other test cases in #1786 once this PR lands. 🚀 |
787c440 to
9d7b77c
Compare
llama_stack/providers/inline/post_training/huggingface/recipes/finetune_single_device.py
Show resolved
Hide resolved
|
One question about proper procedure for model evacuation to CPU. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the contribution!
|
rebased |
| assert job_artifacts.checkpoints[0].epoch == 0 | ||
| assert "/.llama/checkpoints/Llama3.2-3B-Instruct-sft-0" in job_artifacts.checkpoints[0].path | ||
|
|
||
| while True: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should add a timeout here so the CI action can die appropriately
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a timeout using pytests pytest-timeout package. This can be applied to other tests as well in the future for a succinct timeout
9d1739d to
c63efdc
Compare
|
rebased |
adds an inline HF SFTTrainer provider. Alongside touchtune -- this is a super popular option for running training jobs. The config allows a user to specify some key fields such as a model, chat_template, device, etc the provider comes with one recipe `finetune_single_device` which works both with and without LoRA. any model that is a valid HF identifier can be given and the model will be pulled. this has been tested so far with CPU and MPS device types, but should be compatible with CUDA out of the box The provider processes the given dataset into the proper format, established the various steps per epoch, steps per save, steps per eval, sets a sane SFTConfig, and runs n_epochs of training if checkpoint_dir is none, no model is saved. If there is a checkpoint dir, a model is saved every `save_steps` and at the end of training. Signed-off-by: Charlie Doern <cdoern@redhat.com>
the experimental_post_training template now uses HF post_training and dataset providers Signed-off-by: Charlie Doern <cdoern@redhat.com>
set inline::huggingface as the default post_training provider for the ollama distribution and add integration tests for post_training Signed-off-by: Charlie Doern <cdoern@redhat.com>
currently this impl hangs because of `trainer.train()` blocking. Re-write the implementation to kick off the model download, device instantiation, dataset processing, and training in a monitored subprocess. All of these steps need to be in a subprocess or else different devices are used which causes torch errors. Signed-off-by: Charlie Doern <cdoern@redhat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets go 🚀
What does this PR do?
adds an inline HF SFTTrainer provider. Alongside touchtune -- this is a super popular option for running training jobs. The config allows a user to specify some key fields such as a model, chat_template, device, etc
the provider comes with one recipe
finetune_single_devicewhich works both with and without LoRA.any model that is a valid HF identifier can be given and the model will be pulled.
this has been tested so far with CPU and MPS device types, but should be compatible with CUDA out of the box
The provider processes the given dataset into the proper format, establishes the various steps per epoch, steps per save, steps per eval, sets a sane SFTConfig, and runs n_epochs of training
if checkpoint_dir is none, no model is saved. If there is a checkpoint dir, a model is saved every
save_stepsand at the end of training.Test Plan
re-enabled post_training integration test suite with a singular test that loads the simpleqa dataset: https://huggingface.co/datasets/llamastack/simpleqa and a tiny granite model: https://huggingface.co/ibm-granite/granite-3.3-2b-instruct. The test now uses the llama stack client and the proper post_training API
runs one step with a batch_size of 1. This test runs on CPU on the Ubuntu runner so it needs to be a small batch and a single step.