SGLang doc user flow updates #703

stbaione · 2024-12-16T18:46:31Z

Expand examples in user docs to cover all supported features
Default docs to targeting cluster application
General cleanup and removal of unnecessary text
Add k8s instructions for shortfin deployment

Expand examples for targeting shortfin from sglang

kumardeepakamd

So, we should have steps for generating artifacts before launching shortfin/sglang and cluster: steps for downloading gguf or safetensor, ingesting it using sharktank and exporting mlir and compiling vmfb and then launch shortfin with artifacts -- in that sequence

docs/shortfin/llm/user/e2e_llama8b_k8s.md

shortfin/python/shortfin_apps/llm/k8s/llama-app-deployment.yaml

stbaione · 2024-12-18T23:00:13Z

So, we should have steps for generating artifacts before launching shortfin/sglang and cluster: steps for downloading gguf or safetensor, ingesting it using sharktank and exporting mlir and compiling vmfb and then launch shortfin with artifacts -- in that sequence

@kumardeepakamd Should we have instructions here, or link to the shortfin user docs? If we link to shortfin user docs, it'll stay consistent as we make updates over there. Otherwise we'll have to reflect updates in both places

kumardeepakamd · 2024-12-18T23:17:12Z

Whatever is easier to maintain is fine. Your call.

…

On Wed, Dec 18, 2024, 3:01 PM Stephen Baione ***@***.***> wrote: So, we should have steps for generating artifacts before launching shortfin/sglang and cluster: steps for downloading gguf or safetensor, ingesting it using sharktank and exporting mlir and compiling vmfb and then launch shortfin with artifacts -- in that sequence @kumardeepakamd <https://github.com/kumardeepakamd> Should we have instructions here, or link to the shortfin user docs? If we link to shortfin user docs, it'll stay consistent as we make updates over there. Otherwise we'll have to reflect updates in both places — Reply to this email directly, view it on GitHub <#703 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A5OMX35FNJ4VLNR3NUFDYUD2GH5DFAVCNFSM6AAAAABTWYWVMGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNJSGQZDQNBVGU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Add iree-base-runtime, iree-base-compiler and iree-turbine to nightly install instructions

stbaione · 2024-12-19T00:01:01Z

Whatever is easier to maintain is fine. Your call.
…
On Wed, Dec 18, 2024, 3:01 PM Stephen Baione @.> wrote: So, we should have steps for generating artifacts before launching shortfin/sglang and cluster: steps for downloading gguf or safetensor, ingesting it using sharktank and exporting mlir and compiling vmfb and then launch shortfin with artifacts -- in that sequence @kumardeepakamd https://github.com/kumardeepakamd Should we have instructions here, or link to the shortfin user docs? If we link to shortfin user docs, it'll stay consistent as we make updates over there. Otherwise we'll have to reflect updates in both places — Reply to this email directly, view it on GitHub <#703 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/A5OMX35FNJ4VLNR3NUFDYUD2GH5DFAVCNFSM6AAAAABTWYWVMGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNJSGQZDQNBVGU . You are receiving this because you were mentioned.Message ID: @.>

Made links to existing docs for shortfin more visible, updated install instructions on Shortfin 8b user docs

docs/shortfin/llm/user/shortfin_with_sglang_frontend_language.md

amd-chrissosa

Few nits to clean up the example code samples.

docs/shortfin/llm/user/e2e_llama8b_k8s.md

docs/shortfin/llm/user/shortfin_with_sglang_frontend_language.md

shortfin/deployment/shortfin_apps/llm/k8s/llama-app-deployment.yaml

ScottTodd · 2024-12-20T18:21:49Z

shortfin/deployment/shortfin_apps/llm/k8s/llama-app-deployment.yaml

+        # update to your artifacts and change cli flags for instantiation of server to match your intended llama configuration
+        args:
+        - |
+          sudo apt update &&
+          curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash &&
+          sudo apt install git -y &&
+          sudo apt install python3.11 python3.11-dev python3.11-venv -y &&
+          sudo apt-get install wget -y &&
+          python3.11 -m venv shark_venv && source shark_venv/bin/activate &&
+          mkdir shark_artifacts &&
+          wget https://sharkpublic.blob.core.windows.net/sharkpublic/stephen/llama3.1_8b/config.json -O shark_artifacts/config.json &&
+          wget https://sharkpublic.blob.core.windows.net/sharkpublic/stephen/llama3.1_8b/meta-llama-3.1-8b-instruct.f16.gguf -O shark_artifacts/meta-llama-3.1-8b-instruct.f16.gguf &&
+          wget https://sharkpublic.blob.core.windows.net/sharkpublic/stephen/llama3.1_8b/model.vmfb -O shark_artifacts/model.vmfb &&
+          wget https://sharkpublic.blob.core.windows.net/sharkpublic/stephen/llama3.1_8b/tokenizer_config.json -O shark_artifacts/tokenizer_config.json &&
+          wget https://sharkpublic.blob.core.windows.net/sharkpublic/stephen/llama3.1_8b/tokenizer.json -O shark_artifacts/tokenizer.json &&
+          pip install --pre  shortfin[apps] -f https://github.com/nod-ai/shark-ai/releases/expanded_assets/dev-wheels &&
+          pip install pandas &&
+          python -m shortfin_apps.llm.server --tokenizer_json=shark_artifacts/tokenizer.json --model_config=shark_artifacts/config.json --vmfb=shark_artifacts/model.vmfb --parameters=shark_artifacts/meta-llama-3.1-8b-instruct.f16.gguf --device_ids 0 --device=hip;


These files should come from huggingface and we shouldn't use a precompiled .vmfb file, we should use sharktank + iree-compile.

Maybe file an issue for now to generalize this.

So here, I don't think we want to put the whole flow as part of the kubernetes instantiation. It would require each instance to pull down a 100+ GB gguf and go through the whole flow which isn't really ideal at scale. Instead, we expect the users to follow the llama_serving.md and generate artifacts that they push somewhere (NFS, S3, CSP) and pull down on each instantiation. I can make this more clearer with the docs, but why I put configure with your own artifacts in the docs

SGTM. We can provide an example, but if we expect people to run this then it should be really clear what they are expected to fork/edit and what is shared.

ScottTodd · 2024-12-20T18:22:33Z

shortfin/deployment/shortfin_apps/llm/k8s/llama-app-deployment.yaml

+          wget https://sharkpublic.blob.core.windows.net/sharkpublic/stephen/llama3.1_8b/model.vmfb -O shark_artifacts/model.vmfb &&
+          wget https://sharkpublic.blob.core.windows.net/sharkpublic/stephen/llama3.1_8b/tokenizer_config.json -O shark_artifacts/tokenizer_config.json &&
+          wget https://sharkpublic.blob.core.windows.net/sharkpublic/stephen/llama3.1_8b/tokenizer.json -O shark_artifacts/tokenizer.json &&
+          pip install --pre  shortfin[apps] -f https://github.com/nod-ai/shark-ai/releases/expanded_assets/dev-wheels &&


We should point users at stable releases. Could have the file using nightly releases until we push 3.1.0 though.

yeah, I can update when we push 3.1.0

shortfin/deployment/shortfin_apps/llm/k8s/llama-app-deployment.yaml

Co-authored-by: Scott Todd <scott.todd0@gmail.com>

…s BASE_URL in code examples

docs/shortfin/llm/user/e2e_llama8b_k8s.md

docs/shortfin/llm/user/shortfin_with_sglang_frontend_language.md

ScottTodd · 2024-12-20T19:21:52Z

docs/shortfin/llm/user/e2e_llama8b_k8s.md

@@ -0,0 +1,42 @@
+# LLama 8b GPU instructions on Kubernetes


I'd also keep this guide general, maybe keep it next to llama_end_to_end.md as llama_serving_on_kubernetes.md, dropping "8B" and "GPU" from the title. Could then also rename llama_end_to_end.md as llama_serving.md? IDK. Naming is hard.

I'm being picky about file names since I want to link to these guides in the release notes, which will then make renaming them later harder without creating 404s

Cool, I think we should go with llama_serving_on_kubernetes.md and llama_serving.md. end to end can be confusing to what it entails (especially with the sglang layer on top)

Co-authored-by: Scott Todd <scott.todd0@gmail.com>

…ang`, Added `Server Options` to `llama_serving.md`, detailing server flags

- Expand examples in user docs to cover all supported features - Default docs to targeting cluster application - General cleanup and removal of unnecessary text - Add k8s instructions for shortfin deployment --------- Co-authored-by: saienduri <saimanas.enduri@amd.com> Co-authored-by: saienduri <77521230+saienduri@users.noreply.github.com> Co-authored-by: Scott Todd <scott.todd0@gmail.com>

Cleanup sglang doc for user flow,

18b4141

Expand examples for targeting shortfin from sglang

stbaione requested a review from saienduri December 16, 2024 18:46

saienduri added 2 commits December 17, 2024 10:47

add k8s instructions

8425920

update doc wording

a32a2eb

kumardeepakamd reviewed Dec 18, 2024

View reviewed changes

docs/shortfin/llm/user/e2e_llama8b_k8s.md Outdated Show resolved Hide resolved

stbaione commented Dec 18, 2024

View reviewed changes

shortfin/python/shortfin_apps/llm/k8s/llama-app-deployment.yaml Outdated Show resolved Hide resolved

Make shortfin server heading for more visible doc links,

bfe1058

Add iree-base-runtime, iree-base-compiler and iree-turbine to nightly install instructions

stbaione added 2 commits December 20, 2024 07:50

Update link in sglang docs

5120feb

Update link to k8 docs

b320fcb

stbaione marked this pull request as ready for review December 20, 2024 15:43

saienduri and others added 2 commits December 20, 2024 10:22

move k8s deployment file

78dcc9b

allow multiple yaml docs in one for check-yaml

0a39a52

stbaione requested a review from ScottTodd December 20, 2024 16:44

amd-chrissosa reviewed Dec 20, 2024

View reviewed changes

docs/shortfin/llm/user/shortfin_with_sglang_frontend_language.md Outdated Show resolved Hide resolved

amd-chrissosa reviewed Dec 20, 2024

View reviewed changes

docs/shortfin/llm/user/shortfin_with_sglang_frontend_language.md Show resolved Hide resolved

amd-chrissosa reviewed Dec 20, 2024

View reviewed changes

ScottTodd reviewed Dec 20, 2024

View reviewed changes

saienduri and others added 4 commits December 20, 2024 12:28

remove amd specific things

34c7df4

text gen update

52de1ab

inline document titles for readme links

bd3e5ad

Co-authored-by: Scott Todd <scott.todd0@gmail.com>

Add clarification for BASE_URL, and use environment variable to acces…

2b31e45

…s BASE_URL in code examples

ScottTodd reviewed Dec 20, 2024

View reviewed changes

docs/shortfin/llm/user/e2e_llama8b_k8s.md Outdated Show resolved Hide resolved

docs/shortfin/llm/user/shortfin_with_sglang_frontend_language.md Outdated Show resolved Hide resolved

ScottTodd reviewed Dec 20, 2024

View reviewed changes

saienduri and others added 4 commits December 20, 2024 11:25

Llama instead of LLama

b2d2d32

Co-authored-by: Scott Todd <scott.todd0@gmail.com>

Llama instead of LLama

739324c

Co-authored-by: Scott Todd <scott.todd0@gmail.com>

rename file names and update docs

6d8be0f

more explicit about artifacts in deployment file

b722269

saienduri added 2 commits December 20, 2024 14:14

remove device id cli flag

50e473e

more storage instructions for artifacts

91d5fb0

ScottTodd approved these changes Dec 20, 2024

View reviewed changes

Add explanation for completion params that can be passed through `SGL…

0e39445

…ang`, Added `Server Options` to `llama_serving.md`, detailing server flags

stbaione requested review from amd-chrissosa and ScottTodd December 21, 2024 01:59

saienduri merged commit f5e9cb4 into nod-ai:main Dec 23, 2024
27 of 28 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SGLang doc user flow updates #703

SGLang doc user flow updates #703

stbaione commented Dec 16, 2024 •

edited by saienduri

Loading

kumardeepakamd left a comment

stbaione commented Dec 18, 2024 •

edited

Loading

kumardeepakamd commented Dec 18, 2024 via email

stbaione commented Dec 19, 2024

amd-chrissosa left a comment

ScottTodd Dec 20, 2024

saienduri Dec 20, 2024 •

edited

Loading

ScottTodd Dec 20, 2024

ScottTodd Dec 20, 2024

saienduri Dec 20, 2024

ScottTodd Dec 20, 2024

saienduri Dec 20, 2024

SGLang doc user flow updates #703

SGLang doc user flow updates #703

Conversation

stbaione commented Dec 16, 2024 • edited by saienduri Loading

kumardeepakamd left a comment

Choose a reason for hiding this comment

stbaione commented Dec 18, 2024 • edited Loading

kumardeepakamd commented Dec 18, 2024 via email

stbaione commented Dec 19, 2024

amd-chrissosa left a comment

Choose a reason for hiding this comment

ScottTodd Dec 20, 2024

Choose a reason for hiding this comment

saienduri Dec 20, 2024 • edited Loading

Choose a reason for hiding this comment

ScottTodd Dec 20, 2024

Choose a reason for hiding this comment

ScottTodd Dec 20, 2024

Choose a reason for hiding this comment

saienduri Dec 20, 2024

Choose a reason for hiding this comment

ScottTodd Dec 20, 2024

Choose a reason for hiding this comment

saienduri Dec 20, 2024

Choose a reason for hiding this comment

stbaione commented Dec 16, 2024 •

edited by saienduri

Loading

stbaione commented Dec 18, 2024 •

edited

Loading

saienduri Dec 20, 2024 •

edited

Loading