Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SGLang doc user flow updates #703

Merged
merged 19 commits into from
Dec 23, 2024
Merged

Conversation

stbaione
Copy link
Contributor

@stbaione stbaione commented Dec 16, 2024

  • Expand examples in user docs to cover all supported features
  • Default docs to targeting cluster application
  • General cleanup and removal of unnecessary text
  • Add k8s instructions for shortfin deployment

Expand examples for targeting shortfin from sglang
@stbaione stbaione requested a review from saienduri December 16, 2024 18:46
Copy link

@kumardeepakamd kumardeepakamd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, we should have steps for generating artifacts before launching shortfin/sglang and cluster: steps for downloading gguf or safetensor, ingesting it using sharktank and exporting mlir and compiling vmfb and then launch shortfin with artifacts -- in that sequence

docs/shortfin/llm/user/e2e_llama8b_k8s.md Outdated Show resolved Hide resolved
@stbaione
Copy link
Contributor Author

stbaione commented Dec 18, 2024

So, we should have steps for generating artifacts before launching shortfin/sglang and cluster: steps for downloading gguf or safetensor, ingesting it using sharktank and exporting mlir and compiling vmfb and then launch shortfin with artifacts -- in that sequence

@kumardeepakamd Should we have instructions here, or link to the shortfin user docs? If we link to shortfin user docs, it'll stay consistent as we make updates over there. Otherwise we'll have to reflect updates in both places

@kumardeepakamd
Copy link

kumardeepakamd commented Dec 18, 2024 via email

Add iree-base-runtime, iree-base-compiler and iree-turbine to nightly install instructions
@stbaione
Copy link
Contributor Author

Whatever is easier to maintain is fine. Your call.

On Wed, Dec 18, 2024, 3:01 PM Stephen Baione @.> wrote: So, we should have steps for generating artifacts before launching shortfin/sglang and cluster: steps for downloading gguf or safetensor, ingesting it using sharktank and exporting mlir and compiling vmfb and then launch shortfin with artifacts -- in that sequence @kumardeepakamd https://github.com/kumardeepakamd Should we have instructions here, or link to the shortfin user docs? If we link to shortfin user docs, it'll stay consistent as we make updates over there. Otherwise we'll have to reflect updates in both places — Reply to this email directly, view it on GitHub <#703 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/A5OMX35FNJ4VLNR3NUFDYUD2GH5DFAVCNFSM6AAAAABTWYWVMGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNJSGQZDQNBVGU . You are receiving this because you were mentioned.Message ID: @.>

Made links to existing docs for shortfin more visible, updated install instructions on Shortfin 8b user docs

@stbaione stbaione marked this pull request as ready for review December 20, 2024 15:43
@stbaione stbaione requested a review from ScottTodd December 20, 2024 16:44
Copy link
Contributor

@amd-chrissosa amd-chrissosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few nits to clean up the example code samples.

docs/shortfin/llm/user/e2e_llama8b_k8s.md Outdated Show resolved Hide resolved
Comment on lines 20 to 37
# update to your artifacts and change cli flags for instantiation of server to match your intended llama configuration
args:
- |
sudo apt update &&
curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash &&
sudo apt install git -y &&
sudo apt install python3.11 python3.11-dev python3.11-venv -y &&
sudo apt-get install wget -y &&
python3.11 -m venv shark_venv && source shark_venv/bin/activate &&
mkdir shark_artifacts &&
wget https://sharkpublic.blob.core.windows.net/sharkpublic/stephen/llama3.1_8b/config.json -O shark_artifacts/config.json &&
wget https://sharkpublic.blob.core.windows.net/sharkpublic/stephen/llama3.1_8b/meta-llama-3.1-8b-instruct.f16.gguf -O shark_artifacts/meta-llama-3.1-8b-instruct.f16.gguf &&
wget https://sharkpublic.blob.core.windows.net/sharkpublic/stephen/llama3.1_8b/model.vmfb -O shark_artifacts/model.vmfb &&
wget https://sharkpublic.blob.core.windows.net/sharkpublic/stephen/llama3.1_8b/tokenizer_config.json -O shark_artifacts/tokenizer_config.json &&
wget https://sharkpublic.blob.core.windows.net/sharkpublic/stephen/llama3.1_8b/tokenizer.json -O shark_artifacts/tokenizer.json &&
pip install --pre shortfin[apps] -f https://github.com/nod-ai/shark-ai/releases/expanded_assets/dev-wheels &&
pip install pandas &&
python -m shortfin_apps.llm.server --tokenizer_json=shark_artifacts/tokenizer.json --model_config=shark_artifacts/config.json --vmfb=shark_artifacts/model.vmfb --parameters=shark_artifacts/meta-llama-3.1-8b-instruct.f16.gguf --device_ids 0 --device=hip;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These files should come from huggingface and we shouldn't use a precompiled .vmfb file, we should use sharktank + iree-compile.

Maybe file an issue for now to generalize this.

Copy link
Contributor

@saienduri saienduri Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So here, I don't think we want to put the whole flow as part of the kubernetes instantiation. It would require each instance to pull down a 100+ GB gguf and go through the whole flow which isn't really ideal at scale. Instead, we expect the users to follow the llama_serving.md and generate artifacts that they push somewhere (NFS, S3, CSP) and pull down on each instantiation. I can make this more clearer with the docs, but why I put configure with your own artifacts in the docs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM. We can provide an example, but if we expect people to run this then it should be really clear what they are expected to fork/edit and what is shared.

wget https://sharkpublic.blob.core.windows.net/sharkpublic/stephen/llama3.1_8b/model.vmfb -O shark_artifacts/model.vmfb &&
wget https://sharkpublic.blob.core.windows.net/sharkpublic/stephen/llama3.1_8b/tokenizer_config.json -O shark_artifacts/tokenizer_config.json &&
wget https://sharkpublic.blob.core.windows.net/sharkpublic/stephen/llama3.1_8b/tokenizer.json -O shark_artifacts/tokenizer.json &&
pip install --pre shortfin[apps] -f https://github.com/nod-ai/shark-ai/releases/expanded_assets/dev-wheels &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should point users at stable releases. Could have the file using nightly releases until we push 3.1.0 though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I can update when we push 3.1.0

@@ -0,0 +1,42 @@
# LLama 8b GPU instructions on Kubernetes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also keep this guide general, maybe keep it next to llama_end_to_end.md as llama_serving_on_kubernetes.md, dropping "8B" and "GPU" from the title. Could then also rename llama_end_to_end.md as llama_serving.md? IDK. Naming is hard.

I'm being picky about file names since I want to link to these guides in the release notes, which will then make renaming them later harder without creating 404s

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I think we should go with llama_serving_on_kubernetes.md and llama_serving.md. end to end can be confusing to what it entails (especially with the sglang layer on top)

saienduri and others added 4 commits December 20, 2024 11:25
Co-authored-by: Scott Todd <scott.todd0@gmail.com>
Co-authored-by: Scott Todd <scott.todd0@gmail.com>
…ang`,

Added `Server Options` to `llama_serving.md`, detailing server flags
@saienduri saienduri merged commit f5e9cb4 into nod-ai:main Dec 23, 2024
27 of 28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants