Skip to content

Conversation

@booxter
Copy link
Contributor

@booxter booxter commented Mar 25, 2025

What does this PR do?

Fix and enable post-training integration tests.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 25, 2025
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this; will refactor

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy-pasta, revisit

@booxter booxter force-pushed the post-training-tests branch from d23fd58 to e6272eb Compare March 25, 2025 21:51
@booxter
Copy link
Contributor Author

booxter commented Mar 25, 2025

@ashwinb @raghotham for these tests to work in CI, I'd need to download the model (either via Meta or HF). AFAIU both require some kind of a token. My understanding is we'll need to add a github secret for HF token, then use it to llama download. (or an equivalent for Meta checkpoint.) This is probably something that CI owners will have to do. Please advise.

llama_stack/providers/inline/post_training/torchtune/post_training.py:73: in supervised_fine_tune
    recipe = LoraFinetuningSingleDevice(
llama_stack/providers/inline/post_training/torchtune/recipes/lora_finetuning_single_device.py:120: in __init__
    self.checkpoint_dir = model_checkpoint_dir(model_obj)
llama_stack/providers/inline/post_training/torchtune/recipes/lora_finetuning_single_device.py:108: in model_checkpoint_dir
    assert checkpoint_dir.exists(), (
E   AssertionError: Could not find checkpoints in: /home/runner/.llama/checkpoints/Llama3.2-3B-Instruct. Please download model using `llama download --model-id Llama3.2-3B-Instruct`

@courtneypacheco
Copy link
Contributor

courtneypacheco commented Mar 26, 2025

@booxter If you went the HuggingFace token route, how would you plan to leverage a HuggingFace token in the workflow file you're editing?

If you need HF_TOKEN accessible in the integration-tests.yml file as an env var, just be aware there is a potential security concern given the current structure of that workflow file. Whether there is an actual security concern though, that mostly depends on how you plan to leverage HF_TOKEN. :)

If you want the maintainers to add HF_TOKEN to their repo/org secrets as ${{ secrets.HF_TOKEN }} in order for your integration test to pull models from HuggingFace, the maintainers should first evaluate if they need to update the workflow's trigger conditions. Right now, the integration-tests.yml workflow triggers automatically on PR builds for non-first-time contributors. See here:
https://github.com/meta-llama/llama-stack/blob/934de0a28126c187eb09dbdd5833db792ec6610f/.github/workflows/integration-tests.yml#L6L7

I've copy-pasted the exact trigger condition below:

  pull_request:
    branches: [ main ]

To illustrate why these trigger conditions could cause an issue in this situation: Let's say you leave this file's trigger conditions alone, but you add your innocent-looking workflow step like this...

- name: Run integration tests
  env:
    HF_TOKEN: "${{ secrets.HF_TOKEN }}" #<------ exposed secret here
    INFERENCE_MODEL: "meta-llama/Llama-3.2-3B-Instruct"
  run: uv run pytest -v tests/integration/${{ matrix.test-type }} --stack-config=experimental-post-training --text-model="meta-llama/Llama-3.2-3B-Instruct" --embedding-model=all-MiniLM-L6-v2
  if: matrix.test-type == 'post_training'

Next, let's say your PR is approved and it's merged into this repo's main (default) branch. Once that change is merged into main, a bad actor can create a PR against main where they manipulate one of the existing Python integration tests called through your uv run pytest command in order to snipe the maintainers' Hugging Face token -- or do something else malicious with it -- as long as that workflow is still automatically triggering on PR builds. Heck, someone can even do this with a very old PR as long as they rebase against main to get these changes.

So if you do need HF_TOKEN set as an env var, then... you just need to make sure you propose your changes in such a way that a bad actor cannot perform that sort of manipulation *or* the maintainers should make sure this workflow only runs on PRs after manual approval so they can detect if someone is trying to manipulate the code. Typically, it's much easier (and safer) to use the latter approach, but the downside is that PRs could pile up waiting for manual approvals.

Technically, a third option would be to create a new workflow file for these integration tests you're proposing and leave the existing integration-tests.yml workflow alone. Then we can continue to let integration-tests.yml auto-trigger on PR builds without security concerns, while your new workflow file can be one that maintainers manually trigger if/when desired.

WDYT?

@booxter
Copy link
Contributor Author

booxter commented Mar 26, 2025

I think ultimately, we should run our tests against models that are not gated on HuggingFace. If llama models are not, then we may instead run tests using some other small model like Granite. The problem is that as of right not, the inline torchtune provider doesn't support Granite. (I think even torchtune lacks some model specific changes to enable Granite tuning.) Until this is fixed, we may not actually be able to enable these tests it seems.

For the record, I tried to hack something for Granite here but got stuck: main...booxter:llama-stack:post-training-granite mostly because I am not well versed in tuning.

@booxter
Copy link
Contributor Author

booxter commented Apr 28, 2025

@raghotham @ashwinb is there any guidance from Meta on how to pull llama models in CI environment without tokens? Is it possible / desirable?

@booxter
Copy link
Contributor Author

booxter commented May 16, 2025

failure in pre-commit?

Error processing together: deadlock detected by 
_ModuleLock('llama_stack.providers.remote.inference.together.config') at 
140349475893456
⠸ Processing distribution templates...
⠴ Processing distribution templates...
⠦ Processing distribution templates...
25hTraceback (most recent call last):
  File "/home/runner/work/llama-stack/llama-stack/./scripts/distro_codegen.py", line 159, in <module>
    main()
  File "/home/runner/work/llama-stack/llama-stack/./scripts/distro_codegen.py", line 143, in main
    list(executor.map(process_func, template_dirs))
  File "/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/concurrent/futures/_base.py", line 619, in result_iterator
    yield _result_or_cancel(fs.pop())
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/concurrent/futures/_base.py", line 317, in _result_or_cancel
    return fut.result(timeout)
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/llama-stack/llama-stack/./scripts/distro_codegen.py", line 76, in process_template
    raise e
  File "/home/runner/work/llama-stack/llama-stack/./scripts/distro_codegen.py", line 58, in process_template
    module = importlib.import_module(module_name)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1204, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1176, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1147, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 940, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/home/runner/work/llama-stack/llama-stack/llama_stack/templates/together/__init__.py", line 7, in <module>
    from .together import get_distribution_template  # noqa: F401
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/llama-stack/llama-stack/llama_stack/templates/together/together.py", line 20, in <module>
    from llama_stack.providers.remote.inference.together import TogetherImplConfig
  File "/home/runner/work/llama-stack/llama-stack/llama_stack/providers/remote/inference/together/__init__.py", line 9, in <module>
    from .config import TogetherImplConfig
  File "<frozen importlib._bootstrap>", line 1173, in _find_and_load
  File "<frozen importlib._bootstrap>", line 171, in __enter__
  File "<frozen importlib._bootstrap>", line 116, in acquire
_frozen_importlib._DeadlockError: deadlock detected by _ModuleLock('llama_stack.providers.remote.inference.together.config') at 140349475893456

hm...

Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
@booxter booxter force-pushed the post-training-tests branch from 667a05f to 35dcfff Compare May 18, 2025 22:08
@booxter
Copy link
Contributor Author

booxter commented May 18, 2025

Disc is overflown because we download inference models and training model and write checkpoints... Need to clean the runner up; or avoid downloading models for inference.

Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
@booxter booxter force-pushed the post-training-tests branch 4 times, most recently from 41621de to 62ba825 Compare May 20, 2025 17:22
Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
@booxter booxter force-pushed the post-training-tests branch from 62ba825 to 01fdae0 Compare May 20, 2025 17:30
@booxter booxter closed this Jun 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants