-
Notifications
You must be signed in to change notification settings - Fork 27
Adds a Hugging Face distributed LLM fine tuning CPU workflow with k8s #98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Dina Suehiro Jones <dina.s.jones@intel.com>
Signed-off-by: Dina Suehiro Jones <dina.s.jones@intel.com>
Dependency ReviewThe following issues were found:
License Issuesworkflows/charts/huggingface-llm/requirements.txt
OpenSSF ScorecardScorecard details
Scanned Manifest Filesworkflows/charts/huggingface-llm/requirements.txt
|
workflows/charts/training/huggingface_llm/tests/distilgpt2_values.yaml
Outdated
Show resolved
Hide resolved
tylertitsworth
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need:
.actions.jsonfile to enable build CItests.yamlfile for container tests- Run pre-commit over your code
- add yourself to
CODEOWNERSunderworkflows/training - Fix any lint issues flagged
…into dina/hf_workflow
Signed-off-by: dmsuehir <dina.s.jones@intel.com>
Signed-off-by: dmsuehir <dina.s.jones@intel.com>
…into dina/hf_workflow
Signed-off-by: dmsuehir <dina.s.jones@intel.com>
Signed-off-by: dmsuehir <dina.s.jones@intel.com>
Signed-off-by: dmsuehir <dina.s.jones@intel.com>
Signed-off-by: dmsuehir <dina.s.jones@intel.com>
Signed-off-by: dmsuehir <dina.s.jones@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want this README to be included in our repo website or uploaded to dockerhub intel/ai-workflows?
If you want to add configs now, we can do that or in a future PR since you want update docs later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think not on dockerhub for now because there are other container tags at intel/ai-workflows that we don't have in this table.
Not sure about that repo website, I will check with Ebi. If we do want to add it we can do a follow up PR.
tylertitsworth
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Description
This LLM fine tuning workflow as originally published in the TLT repository, but it doesn't use the TLT API/CLI (it uses pytorch/hugging face code). The workflow includes a Dockerfile, helm chart, and a few different Helm values files. The Helm values are for LLM fine tuning with a few different use cases: 1) fine tuning a financial chatbot with a dataset loaded from a file 2) instruction tuning with a medical dataset from Hugging Face hub 3) a values file that is intended to be a template for someone who wants to fine tune a LLM with their own dataset/model.
The docker is already published at:
intel/ai-workflows:torch-2.2.0-huggingface-multinode-py3.10. I've also tested this with 2.3 by building the pytorch multinode base from themainbranch, then building the LLM workflow container with the updated 2.3 base. I had to add extraENVvars to the pytorch multinode base in order for the distributed workflow to work in k8s for 2.3. These env vars would typically be set by the Torch CCL setvars.sh file, but those don't get applied in k8s, so they need to get set asENVvars in the Dockerfile.The test loops to check for a
eval_results.jsonfile in the mounted persistent volume claim, which would indicate that the training and evaluate have both completed.Changes Made
docker-compose.yamlfileValidation
The helm chart can be tested using the
tests/distilgpt2_values.yamlfile which fine tunes distilgpt2 using the databricks-dolly-15k dataset for 5 steps and then evaluates the trained model with a subset of the dataset.test_runner.pywith all existing tests passing, and I have added new tests where applicable.