fix(tests): enable post-training tests #1786

booxter · 2025-03-25T21:47:38Z

What does this PR do?

Fix and enable post-training integration tests.

booxter · 2025-03-25T21:48:04Z

.github/workflows/integration-tests.yml

I don't like this; will refactor

booxter · 2025-03-25T21:48:36Z

tests/integration/post_training/test_post_training.py

copy-pasta, revisit

booxter · 2025-03-25T21:59:03Z

@ashwinb @raghotham for these tests to work in CI, I'd need to download the model (either via Meta or HF). AFAIU both require some kind of a token. My understanding is we'll need to add a github secret for HF token, then use it to llama download. (or an equivalent for Meta checkpoint.) This is probably something that CI owners will have to do. Please advise.

llama_stack/providers/inline/post_training/torchtune/post_training.py:73: in supervised_fine_tune
    recipe = LoraFinetuningSingleDevice(
llama_stack/providers/inline/post_training/torchtune/recipes/lora_finetuning_single_device.py:120: in __init__
    self.checkpoint_dir = model_checkpoint_dir(model_obj)
llama_stack/providers/inline/post_training/torchtune/recipes/lora_finetuning_single_device.py:108: in model_checkpoint_dir
    assert checkpoint_dir.exists(), (
E   AssertionError: Could not find checkpoints in: /home/runner/.llama/checkpoints/Llama3.2-3B-Instruct. Please download model using `llama download --model-id Llama3.2-3B-Instruct`

courtneypacheco · 2025-03-26T02:05:24Z

@booxter If you went the HuggingFace token route, how would you plan to leverage a HuggingFace token in the workflow file you're editing?

If you need HF_TOKEN accessible in the integration-tests.yml file as an env var, just be aware there is a potential security concern given the current structure of that workflow file. Whether there is an actual security concern though, that mostly depends on how you plan to leverage HF_TOKEN. :)

If you want the maintainers to add HF_TOKEN to their repo/org secrets as ${{ secrets.HF_TOKEN }} in order for your integration test to pull models from HuggingFace, the maintainers should first evaluate if they need to update the workflow's trigger conditions. Right now, the integration-tests.yml workflow triggers automatically on PR builds for non-first-time contributors. See here:
https://github.com/meta-llama/llama-stack/blob/934de0a28126c187eb09dbdd5833db792ec6610f/.github/workflows/integration-tests.yml#L6L7

I've copy-pasted the exact trigger condition below:

  pull_request:
    branches: [ main ]

To illustrate why these trigger conditions could cause an issue in this situation: Let's say you leave this file's trigger conditions alone, but you add your innocent-looking workflow step like this...

- name: Run integration tests
  env:
    HF_TOKEN: "${{ secrets.HF_TOKEN }}" #<------ exposed secret here
    INFERENCE_MODEL: "meta-llama/Llama-3.2-3B-Instruct"
  run: uv run pytest -v tests/integration/${{ matrix.test-type }} --stack-config=experimental-post-training --text-model="meta-llama/Llama-3.2-3B-Instruct" --embedding-model=all-MiniLM-L6-v2
  if: matrix.test-type == 'post_training'

Next, let's say your PR is approved and it's merged into this repo's main (default) branch. Once that change is merged into main, a bad actor can create a PR against main where they manipulate one of the existing Python integration tests called through your uv run pytest command in order to snipe the maintainers' Hugging Face token -- or do something else malicious with it -- as long as that workflow is still automatically triggering on PR builds. Heck, someone can even do this with a very old PR as long as they rebase against main to get these changes.

So if you do need HF_TOKEN set as an env var, then... you just need to make sure you propose your changes in such a way that a bad actor cannot perform that sort of manipulation *or* the maintainers should make sure this workflow only runs on PRs after manual approval so they can detect if someone is trying to manipulate the code. Typically, it's much easier (and safer) to use the latter approach, but the downside is that PRs could pile up waiting for manual approvals.

Technically, a third option would be to create a new workflow file for these integration tests you're proposing and leave the existing integration-tests.yml workflow alone. Then we can continue to let integration-tests.yml auto-trigger on PR builds without security concerns, while your new workflow file can be one that maintainers manually trigger if/when desired.

WDYT?

booxter · 2025-03-26T19:36:52Z

I think ultimately, we should run our tests against models that are not gated on HuggingFace. If llama models are not, then we may instead run tests using some other small model like Granite. The problem is that as of right not, the inline torchtune provider doesn't support Granite. (I think even torchtune lacks some model specific changes to enable Granite tuning.) Until this is fixed, we may not actually be able to enable these tests it seems.

For the record, I tried to hack something for Granite here but got stuck: main...booxter:llama-stack:post-training-granite mostly because I am not well versed in tuning.

booxter · 2025-04-28T20:38:17Z

@raghotham @ashwinb is there any guidance from Meta on how to pull llama models in CI environment without tokens? Is it possible / desirable?

booxter · 2025-05-16T15:34:54Z

failure in pre-commit?

Error processing together: deadlock detected by 
_ModuleLock('llama_stack.providers.remote.inference.together.config') at 
140349475893456
⠸ Processing distribution templates...
⠴ Processing distribution templates...
⠦ Processing distribution templates...
25hTraceback (most recent call last):
  File "/home/runner/work/llama-stack/llama-stack/./scripts/distro_codegen.py", line 159, in <module>
    main()
  File "/home/runner/work/llama-stack/llama-stack/./scripts/distro_codegen.py", line 143, in main
    list(executor.map(process_func, template_dirs))
  File "/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/concurrent/futures/_base.py", line 619, in result_iterator
    yield _result_or_cancel(fs.pop())
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/concurrent/futures/_base.py", line 317, in _result_or_cancel
    return fut.result(timeout)
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/llama-stack/llama-stack/./scripts/distro_codegen.py", line 76, in process_template
    raise e
  File "/home/runner/work/llama-stack/llama-stack/./scripts/distro_codegen.py", line 58, in process_template
    module = importlib.import_module(module_name)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1204, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1176, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1147, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 940, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/home/runner/work/llama-stack/llama-stack/llama_stack/templates/together/__init__.py", line 7, in <module>
    from .together import get_distribution_template  # noqa: F401
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/llama-stack/llama-stack/llama_stack/templates/together/together.py", line 20, in <module>
    from llama_stack.providers.remote.inference.together import TogetherImplConfig
  File "/home/runner/work/llama-stack/llama-stack/llama_stack/providers/remote/inference/together/__init__.py", line 9, in <module>
    from .config import TogetherImplConfig
  File "<frozen importlib._bootstrap>", line 1173, in _find_and_load
  File "<frozen importlib._bootstrap>", line 171, in __enter__
  File "<frozen importlib._bootstrap>", line 116, in acquire
_frozen_importlib._DeadlockError: deadlock detected by _ModuleLock('llama_stack.providers.remote.inference.together.config') at 140349475893456

hm...

Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>

booxter · 2025-05-18T22:21:36Z

Disc is overflown because we download inference models and training model and write checkpoints... Need to clean the runner up; or avoid downloading models for inference.

Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 25, 2025

booxter commented Mar 25, 2025

View reviewed changes

.github/workflows/integration-tests.yml Outdated

Copy link

Contributor Author

booxter Mar 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this; will refactor

booxter commented Mar 25, 2025

View reviewed changes

tests/integration/post_training/test_post_training.py Outdated

Copy link

Contributor Author

booxter Mar 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy-pasta, revisit

booxter force-pushed the post-training-tests branch from d23fd58 to e6272eb Compare March 25, 2025 21:51

booxter mentioned this pull request Apr 1, 2025

Streamline Job management for long running tasks (eval, synthetic data generation, post_training etc.) #1587

Closed

22 tasks

This was referenced Apr 28, 2025

fix: add todo for schema validation #1991

Merged

Road to v1 - project plan #2017

Closed

Create an integration test job that would run a training cycle opendatahub-io/llama-stack-provider-kfp-trainer#6

Open

booxter mentioned this pull request May 14, 2025

feat: add huggingface post_training impl #2132

Merged

booxter force-pushed the post-training-tests branch from e6272eb to 667a05f Compare May 16, 2025 15:19

fix(tests): enable post-training tests

35dcfff

Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>

booxter force-pushed the post-training-tests branch from 667a05f to 35dcfff Compare May 18, 2025 22:08

ci: Clean up disk when testing post-training

9607140

Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>

booxter force-pushed the post-training-tests branch 4 times, most recently from 41621de to 62ba825 Compare May 20, 2025 17:22

XXX

01fdae0

Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>

booxter force-pushed the post-training-tests branch from 62ba825 to 01fdae0 Compare May 20, 2025 17:30

booxter closed this Jun 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(tests): enable post-training tests #1786

fix(tests): enable post-training tests #1786

Uh oh!

booxter commented Mar 25, 2025

Uh oh!

booxter Mar 25, 2025

Uh oh!

booxter Mar 25, 2025

Uh oh!

booxter commented Mar 25, 2025 •

edited

Loading

Uh oh!

courtneypacheco commented Mar 26, 2025 •

edited

Loading

Uh oh!

booxter commented Mar 26, 2025 •

edited

Loading

Uh oh!

booxter commented Apr 28, 2025

Uh oh!

booxter commented May 16, 2025 •

edited

Loading

Uh oh!

booxter commented May 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix(tests): enable post-training tests #1786

fix(tests): enable post-training tests #1786

Uh oh!

Conversation

booxter commented Mar 25, 2025

What does this PR do?

Uh oh!

booxter Mar 25, 2025

Choose a reason for hiding this comment

Uh oh!

booxter Mar 25, 2025

Choose a reason for hiding this comment

Uh oh!

booxter commented Mar 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

courtneypacheco commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

booxter commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

booxter commented Apr 28, 2025

Uh oh!

booxter commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

booxter commented May 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

booxter commented Mar 25, 2025 •

edited

Loading

courtneypacheco commented Mar 26, 2025 •

edited

Loading

booxter commented Mar 26, 2025 •

edited

Loading

booxter commented May 16, 2025 •

edited

Loading