feat: add huggingface post_training impl #2132

cdoern · 2025-05-09T20:48:05Z

What does this PR do?

adds an inline HF SFTTrainer provider. Alongside touchtune -- this is a super popular option for running training jobs. The config allows a user to specify some key fields such as a model, chat_template, device, etc

the provider comes with one recipe finetune_single_device which works both with and without LoRA.

any model that is a valid HF identifier can be given and the model will be pulled.

this has been tested so far with CPU and MPS device types, but should be compatible with CUDA out of the box

The provider processes the given dataset into the proper format, establishes the various steps per epoch, steps per save, steps per eval, sets a sane SFTConfig, and runs n_epochs of training

if checkpoint_dir is none, no model is saved. If there is a checkpoint dir, a model is saved every save_steps and at the end of training.

Test Plan

re-enabled post_training integration test suite with a singular test that loads the simpleqa dataset: https://huggingface.co/datasets/llamastack/simpleqa and a tiny granite model: https://huggingface.co/ibm-granite/granite-3.3-2b-instruct. The test now uses the llama stack client and the proper post_training API

runs one step with a batch_size of 1. This test runs on CPU on the Ubuntu runner so it needs to be a small batch and a single step.

ashwinb · 2025-05-13T18:44:37Z

This is great!

llama_stack/providers/inline/post_training/huggingface/recipes/finetune_single_device.py

booxter · 2025-05-14T01:43:51Z

This is very promising overall; once we have it in-tree, we should be able to revive the integration tests suite for post-training API that was blocked because llama consolidated model builds were not available for fetch. #1786

llama_stack/templates/huggingface/huggingface.py

llama_stack/templates/huggingface/run.yaml

llama_stack/providers/inline/post_training/huggingface/__init__.py

llama_stack/providers/inline/post_training/huggingface/recipes/finetune_single_device.py

booxter · 2025-05-14T18:01:26Z

llama_stack/providers/inline/post_training/huggingface/recipes/finetune_single_device.py

+        finally:
+            # Clean up resources
+            if hasattr(trainer, "model"):
+                if device.type != "cpu":


lines 408-414 are implemented in torchtune. Please move the code to a common function (evacuate_model_from_device?)

created common/utils.py for post_training and clear_model

I am lazy importing torch so we don't hit issues within this method.

I like the name @booxter suggested more than clear_model

sure I can change it

llama_stack/templates/huggingface/huggingface.py

booxter · 2025-05-14T18:06:15Z

I've reviewed the code, it looks reasonable enough to merge it after some basic cleanup. I don't expect you to fix the blocking problem as part of this PR, but making a note in code about the problem would be nice. Thanks a lot for enabling (some) integration tests even! I will follow up with enabling other test cases in #1786 once this PR lands.

🚀

llama_stack/templates/huggingface/huggingface.py

.github/workflows/integration-tests.yml

llama_stack/providers/inline/post_training/common/utils.py

llama_stack/providers/inline/post_training/huggingface/recipes/finetune_single_device.py

booxter · 2025-05-16T13:42:39Z

One question about proper procedure for model evacuation to CPU.

booxter

Thank you for the contribution!

booxter · 2025-05-16T16:02:10Z

@ashwinb I think this is ready to go. I'd like to get this in so that we can enable the rest of the tests for the API: #1786 🙏

cdoern · 2025-05-16T17:51:04Z

rebased

ashwinb · 2025-05-16T18:59:27Z

tests/integration/post_training/test_post_training.py

-        assert job_artifacts.checkpoints[0].epoch == 0
-        assert "/.llama/checkpoints/Llama3.2-3B-Instruct-sft-0" in job_artifacts.checkpoints[0].path
+
+        while True:


we should add a timeout here so the CI action can die appropriately

added a timeout using pytests pytest-timeout package. This can be applied to other tests as well in the future for a succinct timeout

cdoern · 2025-05-16T20:01:00Z

rebased

adds an inline HF SFTTrainer provider. Alongside touchtune -- this is a super popular option for running training jobs. The config allows a user to specify some key fields such as a model, chat_template, device, etc the provider comes with one recipe `finetune_single_device` which works both with and without LoRA. any model that is a valid HF identifier can be given and the model will be pulled. this has been tested so far with CPU and MPS device types, but should be compatible with CUDA out of the box The provider processes the given dataset into the proper format, established the various steps per epoch, steps per save, steps per eval, sets a sane SFTConfig, and runs n_epochs of training if checkpoint_dir is none, no model is saved. If there is a checkpoint dir, a model is saved every `save_steps` and at the end of training. Signed-off-by: Charlie Doern <cdoern@redhat.com>

the experimental_post_training template now uses HF post_training and dataset providers Signed-off-by: Charlie Doern <cdoern@redhat.com>

set inline::huggingface as the default post_training provider for the ollama distribution and add integration tests for post_training Signed-off-by: Charlie Doern <cdoern@redhat.com>

currently this impl hangs because of `trainer.train()` blocking. Re-write the implementation to kick off the model download, device instantiation, dataset processing, and training in a monitored subprocess. All of these steps need to be in a subprocess or else different devices are used which causes torch errors. Signed-off-by: Charlie Doern <cdoern@redhat.com>

ashwinb

lets go 🚀

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 9, 2025

cdoern force-pushed the hf branch 2 times, most recently from 42e0c06 to c8fe49c Compare May 12, 2025 01:24

leseb added the new-in-tree-provider label May 12, 2025

cdoern force-pushed the hf branch 3 times, most recently from e323876 to 380b4b2 Compare May 12, 2025 21:27

cdoern force-pushed the hf branch 5 times, most recently from a91326d to 9270ff3 Compare May 14, 2025 00:15

booxter mentioned this pull request May 14, 2025

fix: cancel scheduler tasks on shutdown #2130

Merged

booxter reviewed May 14, 2025

View reviewed changes

llama_stack/providers/inline/post_training/huggingface/recipes/finetune_single_device.py Show resolved Hide resolved

cdoern force-pushed the hf branch 8 times, most recently from c7e8fd4 to 6cd0a75 Compare May 14, 2025 15:49

booxter reviewed May 14, 2025

View reviewed changes

ashwinb reviewed May 14, 2025

View reviewed changes

llama_stack/templates/huggingface/huggingface.py Outdated Show resolved Hide resolved

cdoern force-pushed the hf branch from 6cd0a75 to 6d18236 Compare May 14, 2025 19:57

cdoern changed the title ~~[WIP] feat: add huggingface post_training impl~~ feat: add huggingface post_training impl May 14, 2025

cdoern marked this pull request as ready for review May 14, 2025 20:02

cdoern requested a review from ashwinb May 15, 2025 19:59

cdoern force-pushed the hf branch 3 times, most recently from 787c440 to 9d7b77c Compare May 15, 2025 21:11

cdoern requested a review from booxter May 16, 2025 12:42

cdoern force-pushed the hf branch from 9d7b77c to 00815a3 Compare May 16, 2025 12:43

booxter reviewed May 16, 2025

View reviewed changes

.github/workflows/integration-tests.yml Outdated Show resolved Hide resolved

llama_stack/providers/inline/post_training/common/utils.py Show resolved Hide resolved

llama_stack/providers/inline/post_training/huggingface/recipes/finetune_single_device.py Show resolved Hide resolved

cdoern force-pushed the hf branch from 00815a3 to b64c6f9 Compare May 16, 2025 14:04

booxter approved these changes May 16, 2025

View reviewed changes

cdoern force-pushed the hf branch from b64c6f9 to d80ae05 Compare May 16, 2025 17:51

ashwinb reviewed May 16, 2025

View reviewed changes

cdoern force-pushed the hf branch from d80ae05 to f769daf Compare May 16, 2025 19:14

cdoern requested a review from ashwinb May 16, 2025 19:16

cdoern force-pushed the hf branch 3 times, most recently from 9d1739d to c63efdc Compare May 16, 2025 20:00

cdoern added 3 commits May 16, 2025 16:37

feat: add huggingface post_training and dataset provider to template

7dcb997

the experimental_post_training template now uses HF post_training and dataset providers Signed-off-by: Charlie Doern <cdoern@redhat.com>

feat: add integration tests for post_training

ff246d8

set inline::huggingface as the default post_training provider for the ollama distribution and add integration tests for post_training Signed-off-by: Charlie Doern <cdoern@redhat.com>

cdoern force-pushed the hf branch from c63efdc to b91aac0 Compare May 16, 2025 20:37

cdoern force-pushed the hf branch from b91aac0 to 46c5b14 Compare May 16, 2025 20:41

ashwinb approved these changes May 16, 2025

View reviewed changes

ashwinb merged commit f02f7b2 into llamastack:main May 16, 2025
24 checks passed

cdoern mentioned this pull request May 28, 2025

Road to v1 #2296

Open

41 tasks

reluctantfuturist mentioned this pull request Jun 3, 2025

Add tests for all P0 APIs #2354

Closed

2 tasks

feat: add huggingface post_training impl #2132

feat: add huggingface post_training impl #2132

Uh oh!

Conversation

cdoern commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Test Plan

Uh oh!

ashwinb commented May 13, 2025

Uh oh!

Uh oh!

booxter commented May 14, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

booxter May 14, 2025

Choose a reason for hiding this comment

Uh oh!

cdoern May 15, 2025

Choose a reason for hiding this comment

Uh oh!

cdoern May 15, 2025

Choose a reason for hiding this comment

Uh oh!

ashwinb May 15, 2025

Choose a reason for hiding this comment

Uh oh!

cdoern May 15, 2025

Choose a reason for hiding this comment

Uh oh!

cdoern May 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

booxter commented May 14, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

booxter commented May 16, 2025

Uh oh!

booxter left a comment

Choose a reason for hiding this comment

Uh oh!

booxter commented May 16, 2025

Uh oh!

cdoern commented May 16, 2025

Uh oh!

ashwinb May 16, 2025

Choose a reason for hiding this comment

Uh oh!

cdoern May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cdoern commented May 16, 2025

Uh oh!

ashwinb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

cdoern commented May 9, 2025 •

edited

Loading

cdoern May 16, 2025 •

edited

Loading