Adds a Hugging Face distributed LLM fine tuning CPU workflow with k8s #98

dmsuehir · 2024-06-05T22:27:54Z

Description

This LLM fine tuning workflow as originally published in the TLT repository, but it doesn't use the TLT API/CLI (it uses pytorch/hugging face code). The workflow includes a Dockerfile, helm chart, and a few different Helm values files. The Helm values are for LLM fine tuning with a few different use cases: 1) fine tuning a financial chatbot with a dataset loaded from a file 2) instruction tuning with a medical dataset from Hugging Face hub 3) a values file that is intended to be a template for someone who wants to fine tune a LLM with their own dataset/model.

The docker is already published at: intel/ai-workflows:torch-2.2.0-huggingface-multinode-py3.10. I've also tested this with 2.3 by building the pytorch multinode base from the main branch, then building the LLM workflow container with the updated 2.3 base. I had to add extra ENV vars to the pytorch multinode base in order for the distributed workflow to work in k8s for 2.3. These env vars would typically be set by the Torch CCL setvars.sh file, but those don't get applied in k8s, so they need to get set as ENV vars in the Dockerfile.

The test loops to check for a eval_results.json file in the mounted persistent volume claim, which would indicate that the training and evaluate have both completed.

Changes Made

Added env vars to the pytorch multinode dockerfile for Torch CCL paths
Added a workflows docker-compose.yaml file
Added the LLM fine tuning helm chart in /workflows/charts/training/huggingface_llm

The code follows the project's coding standards.
No Intel Internal IP is present within the changes.
The documentation has been updated to reflect any changes in functionality.

Validation

The helm chart can be tested using the tests/distilgpt2_values.yaml file which fine tunes distilgpt2 using the databricks-dolly-15k dataset for 5 steps and then evaluates the trained model with a subset of the dataset.

cd workflows/charts/training/huggingface_llm

helm install -f tests/distilgpt2_values.yaml llm-test .

helm test llm-test

I have tested any changes in container groups locally with test_runner.py with all existing tests passing, and I have added new tests where applicable.

Signed-off-by: Dina Suehiro Jones <dina.s.jones@intel.com>

github-actions · 2024-06-05T22:28:19Z

Dependency Review

The following issues were found:

✅ 0 vulnerable package(s)
✅ 0 package(s) with incompatible licenses
✅ 0 package(s) with invalid SPDX license definitions
⚠️ 3 package(s) with unknown licenses.

See the Details below.

License Issues

workflows/charts/huggingface-llm/requirements.txt

Package	Version	License	Issue Type
mkl-include	2023.2.0	Null	Unknown License
rouge_score	0.1.2	Null	Unknown License
mkl	2023.2.0	Null	Unknown License

OpenSSF Scorecard

Scorecard details

Package

Version

Score

Details

pip/SentencePiece

0.2.0

🟢 7.5

Details

Check	Score	Reason
Code-Review	🟢 5	Found 6/11 approved changesets -- score normalized to 5
Maintained	🟢 10	22 commit(s) and 17 issue activity found in the last 90 days -- score normalized to 10
CII-Best-Practices	⚠️ 0	no effort to earn an OpenSSF best practices badge detected
License	🟢 10	license file detected
Signed-Releases	🟢 4	2 out of the last 5 releases have a total of 2 signed artifacts.
Packaging	⚠️ -1	packaging workflow not detected
Binary-Artifacts	🟢 10	no binaries found in the repo
Branch-Protection	⚠️ 0	branch protection not enabled on development/release branches
Dangerous-Workflow	🟢 10	no dangerous workflow patterns detected
Token-Permissions	🟢 10	GitHub workflow tokens follow principle of least privilege
SAST	🟢 5	SAST tool is not run on all commits -- score normalized to 5
Fuzzing	🟢 10	project is fuzzed
Security-Policy	🟢 10	security policy file detected
Pinned-Dependencies	🟢 8	dependency not pinned by hash detected -- score normalized to 8
Vulnerabilities	🟢 10	0 existing vulnerabilities detected

pip/accelerate

0.30.1

🟢 6.3

Details

Check	Score	Reason
Code-Review	🟢 9	Found 28/30 approved changesets -- score normalized to 9
Maintained	🟢 10	30 commit(s) and 14 issue activity found in the last 90 days -- score normalized to 10
CII-Best-Practices	⚠️ 0	no effort to earn an OpenSSF best practices badge detected
License	🟢 10	license file detected
Branch-Protection	⚠️ -1	internal error: error during branchesHandler.setup: internal error: githubv4.Query: Resource not accessible by integration
Dangerous-Workflow	🟢 10	no dangerous workflow patterns detected
Binary-Artifacts	🟢 10	no binaries found in the repo
Token-Permissions	⚠️ 0	detected GitHub workflow tokens with excessive permissions
Signed-Releases	⚠️ -1	no releases found
Pinned-Dependencies	⚠️ 0	dependency not pinned by hash detected -- score normalized to 0
Fuzzing	⚠️ 0	project is not fuzzed
Security-Policy	⚠️ 0	security policy file not detected
Vulnerabilities	🟢 10	0 existing vulnerabilities detected
Packaging	🟢 10	packaging workflow detected
SAST	🟢 4	SAST tool is not run on all commits -- score normalized to 4

pip/datasets

2.19.0

🟢 5.8

Details

Check	Score	Reason
Maintained	🟢 10	30 commit(s) and 12 issue activity found in the last 90 days -- score normalized to 10
Code-Review	🟢 3	Found 9/30 approved changesets -- score normalized to 3
CII-Best-Practices	⚠️ 0	no effort to earn an OpenSSF best practices badge detected
License	🟢 10	license file detected
Branch-Protection	⚠️ -1	internal error: error during branchesHandler.setup: internal error: githubv4.Query: Resource not accessible by integration
Signed-Releases	⚠️ -1	no releases found
Packaging	⚠️ -1	packaging workflow not detected
Dangerous-Workflow	🟢 10	no dangerous workflow patterns detected
Security-Policy	🟢 10	security policy file detected
Token-Permissions	⚠️ 0	detected GitHub workflow tokens with excessive permissions
Binary-Artifacts	🟢 10	no binaries found in the repo
Pinned-Dependencies	⚠️ 0	dependency not pinned by hash detected -- score normalized to 0
Vulnerabilities	🟢 10	0 existing vulnerabilities detected
Fuzzing	⚠️ 0	project is not fuzzed
SAST	⚠️ 0	SAST tool is not run on all commits -- score normalized to 0

pip/einops

0.7.0

🟢 5

Details

Check	Score	Reason
Code-Review	⚠️ 2	Found 4/20 approved changesets -- score normalized to 2
Maintained	🟢 10	8 commit(s) and 8 issue activity found in the last 90 days -- score normalized to 10
CII-Best-Practices	⚠️ 0	no effort to earn an OpenSSF best practices badge detected
License	🟢 10	license file detected
Signed-Releases	⚠️ -1	no releases found
Packaging	⚠️ -1	packaging workflow not detected
Dangerous-Workflow	🟢 10	no dangerous workflow patterns detected
Token-Permissions	⚠️ 0	detected GitHub workflow tokens with excessive permissions
Binary-Artifacts	🟢 10	no binaries found in the repo
Pinned-Dependencies	⚠️ 0	dependency not pinned by hash detected -- score normalized to 0
Branch-Protection	⚠️ -1	internal error: error during branchesHandler.setup: internal error: githubv4.Query: Resource not accessible by integration
Vulnerabilities	🟢 10	0 existing vulnerabilities detected
Fuzzing	⚠️ 0	project is not fuzzed
Security-Policy	⚠️ 0	security policy file not detected
SAST	⚠️ 0	SAST tool is not run on all commits -- score normalized to 0

pip/evaluate

0.4.2

🟢 5.4

Details

Check	Score	Reason
Code-Review	🟢 9	Found 29/30 approved changesets -- score normalized to 9
Maintained	🟢 5	5 commit(s) and 1 issue activity found in the last 90 days -- score normalized to 5
CII-Best-Practices	⚠️ 0	no effort to earn an OpenSSF best practices badge detected
License	🟢 10	license file detected
Signed-Releases	⚠️ -1	no releases found
Branch-Protection	⚠️ -1	internal error: error during branchesHandler.setup: internal error: githubv4.Query: Resource not accessible by integration
Packaging	⚠️ -1	packaging workflow not detected
Token-Permissions	⚠️ 0	detected GitHub workflow tokens with excessive permissions
Dangerous-Workflow	🟢 10	no dangerous workflow patterns detected
Binary-Artifacts	🟢 10	no binaries found in the repo
Pinned-Dependencies	⚠️ 0	dependency not pinned by hash detected -- score normalized to 0
Fuzzing	⚠️ 0	project is not fuzzed
Security-Policy	⚠️ 0	security policy file not detected
Vulnerabilities	🟢 10	0 existing vulnerabilities detected
SAST	🟢 3	SAST tool is not run on all commits -- score normalized to 3

pip/mkl

2023.2.0

Unknown

pip/mkl-include

2023.2.0

Unknown

pip/nltk

3.8.1

🟢 5.2

Details

Check	Score	Reason
Code-Review	🟢 10	all changesets reviewed
Maintained	🟢 3	2 commit(s) and 2 issue activity found in the last 90 days -- score normalized to 3
CII-Best-Practices	⚠️ 0	no effort to earn an OpenSSF best practices badge detected
License	🟢 10	license file detected
Signed-Releases	⚠️ -1	no releases found
Dangerous-Workflow	🟢 10	no dangerous workflow patterns detected
Security-Policy	🟢 9	security policy file detected
Packaging	⚠️ -1	packaging workflow not detected
Branch-Protection	⚠️ 0	branch protection not enabled on development/release branches
Token-Permissions	⚠️ 0	detected GitHub workflow tokens with excessive permissions
Binary-Artifacts	🟢 10	no binaries found in the repo
Vulnerabilities	🟢 10	0 existing vulnerabilities detected
Fuzzing	⚠️ 0	project is not fuzzed
SAST	⚠️ 0	SAST tool is not run on all commits -- score normalized to 0
Pinned-Dependencies	⚠️ 0	dependency not pinned by hash detected -- score normalized to 0

pip/onnxruntime

1.17.3

🟢 6.8

Details

Check	Score	Reason
Code-Review	🟢 10	all last 30 commits are reviewed through GitHub
Maintained	🟢 10	30 commit(s) out of 30 and 8 issue activity out of 30 found in the last 90 days -- score normalized to 10
CII-Best-Practices	⚠️ 0	no badge detected
Vulnerabilities	🟢 10	no vulnerabilities detected
Signed-Releases	⚠️ 0	0 out of 5 artifacts are signed or have provenance
Branch-Protection	🟢 8	branch protection is not maximal on development and all release branches
Security-Policy	🟢 10	security policy file detected
Dangerous-Workflow	🟢 10	no dangerous workflow patterns detected
Packaging	⚠️ -1	no published package detected
License	🟢 10	license file detected
Token-Permissions	⚠️ 0	non read-only tokens detected in GitHub workflows
Dependency-Update-Tool	🟢 10	update tool detected
Binary-Artifacts	🟢 10	no binaries found in the repo
Fuzzing	⚠️ 0	project is not fuzzed
Pinned-Dependencies	⚠️ 0	dependency not pinned by hash detected -- score normalized to 0

pip/onnxruntime-extensions

0.10.1

🟢 6.1

Details

Check	Score	Reason
Code-Review	🟢 9	Found 29/30 approved changesets -- score normalized to 9
Maintained	🟢 10	30 commit(s) and 13 issue activity found in the last 90 days -- score normalized to 10
CII-Best-Practices	⚠️ 0	no effort to earn an OpenSSF best practices badge detected
License	🟢 10	license file detected
Signed-Releases	⚠️ -1	no releases found
Branch-Protection	⚠️ -1	internal error: error during branchesHandler.setup: internal error: githubv4.Query: Resource not accessible by integration
Token-Permissions	⚠️ 0	detected GitHub workflow tokens with excessive permissions
Packaging	⚠️ -1	packaging workflow not detected
Security-Policy	🟢 10	security policy file detected
Dangerous-Workflow	🟢 10	no dangerous workflow patterns detected
SAST	⚠️ 0	SAST tool is not run on all commits -- score normalized to 0
Fuzzing	⚠️ 0	project is not fuzzed
Vulnerabilities	🟢 10	0 existing vulnerabilities detected
Binary-Artifacts	🟢 7	binaries present in source code
Pinned-Dependencies	⚠️ 0	dependency not pinned by hash detected -- score normalized to 0

pip/peft

0.11.1

Unknown

pip/protobuf

4.24.4

🟢 7.1

Details

Check	Score	Reason
Binary-Artifacts	🟢 10	no binaries found in the repo
Branch-Protection	⚠️ -1	internal error: error during branchesHandler.setup: internal error: githubv4.Query: Resource not accessible by integration
CI-Tests	🟢 9	20 out of 21 merged PRs checked by a CI test -- score normalized to 9
CII-Best-Practices	⚠️ 0	no effort to earn an OpenSSF best practices badge detected
Code-Review	🟢 4	found 5 unreviewed changesets out of 9 -- score normalized to 4
Contributors	🟢 10	13 different organizations found -- score normalized to 10
Dangerous-Workflow	🟢 10	no dangerous workflow patterns detected
Dependency-Update-Tool	🟢 10	update tool detected
Fuzzing	🟢 10	project is fuzzed
License	🟢 9	license file detected
Maintained	🟢 10	30 commit(s) out of 30 and 9 issue activity out of 30 found in the last 90 days -- score normalized to 10
Packaging	⚠️ -1	no published package detected
Pinned-Dependencies	⚠️ 0	dependency not pinned by hash detected -- score normalized to 0
SAST	⚠️ 0	SAST tool is not run on all commits -- score normalized to 0
Security-Policy	🟢 10	security policy file detected
Signed-Releases	⚠️ 0	0 out of 5 artifacts are signed or have provenance
Token-Permissions	🟢 10	GitHub workflow tokens follow principle of least privilege
Vulnerabilities	🟢 7	3 existing vulnerabilities detected

pip/psutil

5.9.5

🟢 5.9

Details

Check	Score	Reason
Code-Review	🟢 3	Found 9/30 approved changesets -- score normalized to 3
Maintained	🟢 10	28 commit(s) and 14 issue activity found in the last 90 days -- score normalized to 10
CII-Best-Practices	⚠️ 0	no effort to earn an OpenSSF best practices badge detected
License	🟢 10	license file detected
Signed-Releases	⚠️ -1	no releases found
Branch-Protection	⚠️ 0	branch protection not enabled on development/release branches
Security-Policy	🟢 10	security policy file detected
Packaging	⚠️ -1	packaging workflow not detected
Dangerous-Workflow	🟢 10	no dangerous workflow patterns detected
Token-Permissions	⚠️ 0	detected GitHub workflow tokens with excessive permissions
Binary-Artifacts	🟢 10	no binaries found in the repo
Pinned-Dependencies	⚠️ 0	dependency not pinned by hash detected -- score normalized to 0
Fuzzing	🟢 10	project is fuzzed
Vulnerabilities	🟢 10	0 existing vulnerabilities detected
SAST	⚠️ 0	SAST tool is not run on all commits -- score normalized to 0

pip/py-cpuinfo

9.0.0

🟢 3.8

Details

Check	Score	Reason
Code-Review	🟢 4	Found 7/17 approved changesets -- score normalized to 4
Maintained	⚠️ 0	0 commit(s) and 1 issue activity found in the last 90 days -- score normalized to 0
CII-Best-Practices	⚠️ 0	no effort to earn an OpenSSF best practices badge detected
License	🟢 10	license file detected
Signed-Releases	⚠️ -1	no releases found
Dangerous-Workflow	🟢 10	no dangerous workflow patterns detected
Packaging	⚠️ -1	packaging workflow not detected
Branch-Protection	⚠️ 0	branch protection not enabled on development/release branches
Binary-Artifacts	🟢 10	no binaries found in the repo
Token-Permissions	⚠️ 0	detected GitHub workflow tokens with excessive permissions
Vulnerabilities	🟢 10	0 existing vulnerabilities detected
Fuzzing	⚠️ 0	project is not fuzzed
Security-Policy	⚠️ 0	security policy file not detected
SAST	⚠️ 0	SAST tool is not run on all commits -- score normalized to 0
Pinned-Dependencies	⚠️ 0	dependency not pinned by hash detected -- score normalized to 0

pip/rouge_score

0.1.2

Unknown

pip/tokenizers

0.19.1

🟢 5.5

Details

Check	Score	Reason
Code-Review	🟢 6	Found 19/28 approved changesets -- score normalized to 6
Maintained	🟢 10	14 commit(s) and 10 issue activity found in the last 90 days -- score normalized to 10
CII-Best-Practices	⚠️ 0	no effort to earn an OpenSSF best practices badge detected
License	🟢 10	license file detected
Branch-Protection	⚠️ -1	internal error: error during branchesHandler.setup: internal error: githubv4.Query: Resource not accessible by integration
Signed-Releases	⚠️ -1	no releases found
Dangerous-Workflow	🟢 10	no dangerous workflow patterns detected
Binary-Artifacts	🟢 10	no binaries found in the repo
Security-Policy	⚠️ 0	security policy file not detected
Token-Permissions	⚠️ 0	detected GitHub workflow tokens with excessive permissions
Fuzzing	⚠️ 0	project is not fuzzed
Pinned-Dependencies	⚠️ 0	dependency not pinned by hash detected -- score normalized to 0
Packaging	🟢 10	packaging workflow detected
SAST	⚠️ 0	SAST tool is not run on all commits -- score normalized to 0
Vulnerabilities	🟢 7	3 existing vulnerabilities detected

Scanned Manifest Files

workflows/charts/huggingface-llm/requirements.txt

SentencePiece@0.2.0
accelerate@0.30.1
datasets@2.19.0
einops@0.7.0
evaluate@0.4.2
mkl@2023.2.0
mkl-include@2023.2.0
nltk@3.8.1
onnxruntime@1.17.3
onnxruntime-extensions@0.10.1
peft@0.11.1
protobuf@4.24.4
psutil@5.9.5
py-cpuinfo@9.0.0
rouge_score@0.1.2
tokenizers@0.19.1

workflows/charts/training/huggingface_llm/tests/distilgpt2_values.yaml

tylertitsworth

You need:

.actions.json file to enable build CI
tests.yaml file for container tests
Run pre-commit over your code
add yourself to CODEOWNERS under workflows/training
Fix any lint issues flagged

…into dina/hf_workflow

Signed-off-by: dmsuehir <dina.s.jones@intel.com>

…into dina/hf_workflow

Signed-off-by: dmsuehir <dina.s.jones@intel.com>

…_workflow

tylertitsworth · 2024-06-06T21:03:53Z

workflows/README.md

Do you want this README to be included in our repo website or uploaded to dockerhub intel/ai-workflows?

If you want to add configs now, we can do that or in a future PR since you want update docs later

I think not on dockerhub for now because there are other container tags at intel/ai-workflows that we don't have in this table.

Not sure about that repo website, I will check with Ebi. If we do want to add it we can do a follow up PR.

tylertitsworth

LGTM

dmsuehir added 2 commits June 5, 2024 14:15

Adds a Hugging Face distributed LLM fine tuning CPU workflow with k8s

5bfc34d

Signed-off-by: Dina Suehiro Jones <dina.s.jones@intel.com>

Add timeout to the test

d0c790f

Signed-off-by: Dina Suehiro Jones <dina.s.jones@intel.com>

dmsuehir requested review from jitendra42 and tylertitsworth as code owners June 5, 2024 22:27

[pre-commit.ci] auto fixes from pre-commit.com hooks

ecb4167

dmsuehir commented Jun 5, 2024

View reviewed changes

workflows/charts/training/huggingface_llm/tests/distilgpt2_values.yaml Outdated Show resolved Hide resolved

tylertitsworth suggested changes Jun 5, 2024

View reviewed changes

tylertitsworth assigned dmsuehir Jun 5, 2024

dmsuehir and others added 21 commits June 5, 2024 16:07

Add tests.yaml

f47d48a

Merge branch 'dina/hf_workflow' of github.com:dmsuehir/ai-containers …

6f8a456

…into dina/hf_workflow

Add .actions.json

1914850

[pre-commit.ci] auto fixes from pre-commit.com hooks

9bbce21

Pylint updates

3fd7c30

Signed-off-by: dmsuehir <dina.s.jones@intel.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

6bd0a82

Formatting

d719133

Signed-off-by: dmsuehir <dina.s.jones@intel.com>

Merge branch 'dina/hf_workflow' of github.com:dmsuehir/ai-containers …

d8e0fcc

…into dina/hf_workflow

Bash and markdown lint fixes

948ebd7

Update yaml-lint ignore paths and Dockerfile lint fixes

a4b11f6

[pre-commit.ci] auto fixes from pre-commit.com hooks

fee46c4

Lint fixes

8790511

Signed-off-by: dmsuehir <dina.s.jones@intel.com>

Remove 'training' directory

ff6816b

Signed-off-by: dmsuehir <dina.s.jones@intel.com>

Chart lint fixes and fix README link path

e584985

Fix paths to remove 'training' directory

fded8b4

Signed-off-by: dmsuehir <dina.s.jones@intel.com>

Update chart maintainer

8a8a200

Signed-off-by: dmsuehir <dina.s.jones@intel.com>

Renmaing directories ci and huggingface-llm and adding config map env

c8d9b4c

Signed-off-by: dmsuehir <dina.s.jones@intel.com>

Fix config map values formatting

a9e3c36

Merge branch 'main' of github.com:dmsuehir/ai-containers into dina/hf…

1dc23dd

…_workflow

Append to env var paths and add config map values

5239739

Update image tag to 2.3.0

b9f9452

tylertitsworth previously approved these changes Jun 6, 2024

View reviewed changes

tylertitsworth added this pull request to the merge queue Jun 6, 2024

tylertitsworth removed this pull request from the merge queue due to a manual request Jun 6, 2024

Updating to use the helm-ci namespace for the helm test.

425ce9d

dmsuehir dismissed tylertitsworth’s stale review via 425ce9d June 6, 2024 21:23

Remove namespace from values files

21af1af

tylertitsworth approved these changes Jun 6, 2024

View reviewed changes

tylertitsworth enabled auto-merge June 6, 2024 22:06

tylertitsworth added this pull request to the merge queue Jun 6, 2024

Merged via the queue into intel:main with commit fc9afb4 Jun 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adds a Hugging Face distributed LLM fine tuning CPU workflow with k8s #98

Adds a Hugging Face distributed LLM fine tuning CPU workflow with k8s #98

Uh oh!

dmsuehir commented Jun 5, 2024 •

edited

Loading

Uh oh!

github-actions bot commented Jun 5, 2024 •

edited

Loading

Uh oh!

Uh oh!

tylertitsworth left a comment

Uh oh!

tylertitsworth Jun 6, 2024

Uh oh!

dmsuehir Jun 6, 2024

Uh oh!

Uh oh!

tylertitsworth left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Adds a Hugging Face distributed LLM fine tuning CPU workflow with k8s #98

Adds a Hugging Face distributed LLM fine tuning CPU workflow with k8s #98

Uh oh!

Conversation

dmsuehir commented Jun 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes Made

Validation

Uh oh!

github-actions bot commented Jun 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dependency Review

License Issues

workflows/charts/huggingface-llm/requirements.txt

OpenSSF Scorecard

Scanned Manifest Files

Uh oh!

Uh oh!

tylertitsworth left a comment

Choose a reason for hiding this comment

Uh oh!

tylertitsworth Jun 6, 2024

Choose a reason for hiding this comment

Uh oh!

dmsuehir Jun 6, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tylertitsworth left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dmsuehir commented Jun 5, 2024 •

edited

Loading

github-actions bot commented Jun 5, 2024 •

edited

Loading