-
Notifications
You must be signed in to change notification settings - Fork 833
[LLM] Add llama 4 example #5125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
12 commits
Select commit
Hold shift + click to select a range
1b4db35
llama4 support
Michaelvll e017e0e
Add service section
Michaelvll 5838aac
Add llama 4
Michaelvll 640d3bc
use 0.8.3 for vllm
Michaelvll fe35563
minor readme fix
Michaelvll e11dc18
fix input
Michaelvll e1c18f0
Add video
Michaelvll 5fec58c
Add video
Michaelvll 8c54cd1
Update vllm version in README.md
Michaelvll 326d901
Update video
Michaelvll a422415
Update readme
Michaelvll 9de9ed2
update accelerators
Michaelvll File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| ../../../../llm/llama-4/README.md |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,156 @@ | ||
|
|
||
| <!-- $REMOVE --> | ||
| # Run Llama 4 on Kubernetes or Any Cloud | ||
| <!-- $END_REMOVE --> | ||
| <!-- $UNCOMMENT# Llama 4 --> | ||
|
|
||
|
|
||
| [Llama 4](https://ai.meta.com/blog/llama-4-multimodal-intelligence/) family was released by Meta on Apr 5, 2025. | ||
|
|
||
|  | ||
|
|
||
|
|
||
| ## Prerequisites | ||
|
|
||
| - Go to the [HuggingFace model page](https://huggingface.co/meta-llama/) and request access to the model [meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8](https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8). | ||
| - Check that you have installed SkyPilot ([docs](https://docs.skypilot.co/en/latest/getting-started/installation.html)). | ||
| - Check that `sky check` shows clouds or Kubernetes are enabled. | ||
|
|
||
| ## Run Llama 4 | ||
|
|
||
| ```bash | ||
| sky launch llama4.yaml -c llama4 --env HF_TOKEN | ||
| ``` | ||
|
|
||
| https://github.com/user-attachments/assets/48cdc44a-31a5-45f0-93be-7a8b6c6a0ded | ||
|
|
||
|
|
||
| The `llama4.yaml` file is as follows: | ||
| ```yaml | ||
| envs: | ||
| MODEL_NAME: meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 | ||
| HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass. | ||
|
|
||
| resources: | ||
| accelerators: { H100:8, H200:8, B100:8, B200:8, GB200:8 } | ||
| cpus: 32+ | ||
| disk_size: 512 # Ensure model checkpoints can fit. | ||
| disk_tier: best | ||
| ports: 8081 # Expose to internet traffic. | ||
|
|
||
| setup: | | ||
| uv pip install vllm==0.8.3 | ||
|
|
||
| run: | | ||
| echo 'Starting vllm api server...' | ||
|
|
||
| vllm serve $MODEL_NAME \ | ||
| --port 8081 \ | ||
| --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \ | ||
| --max-model-len 430000 | ||
|
|
||
| ``` | ||
|
|
||
Michaelvll marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| You can use other models by setting different `MODEL_NAME`. | ||
| ```bash | ||
| sky launch llama4.yaml -c llama4 --env HF_TOKEN --env MODEL_NAME=meta-llama/Llama-4-Scout-17B-16E-Instruct | ||
| ``` | ||
|
|
||
|
|
||
| 🎉 **Congratulations!** 🎉 You have now launched the Llama 4 Maverick Instruct LLM on your infra. | ||
|
|
||
| ### Chat with Llama 4 with OpenAI API | ||
|
|
||
| To curl `/v1/chat/completions`: | ||
| ```console | ||
| ENDPOINT=$(sky status --endpoint 8081 llama4) | ||
|
|
||
| curl http://$ENDPOINT/v1/chat/completions \ | ||
| -H "Content-Type: application/json" \ | ||
| -d '{ | ||
| "model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8", | ||
| "messages": [ | ||
| { | ||
| "role": "system", | ||
| "content": "You are a helpful assistant." | ||
| }, | ||
| { | ||
| "role": "user", | ||
| "content": "Who are you?" | ||
| } | ||
| ] | ||
| }' | jq . | ||
| ``` | ||
|
|
||
| To stop the instance: | ||
| ```console | ||
| sky stop llama4 | ||
| ``` | ||
|
|
||
| To shut down all resources: | ||
| ```console | ||
| sky down llama4 | ||
| ``` | ||
|
|
||
| ## Serving Llama-4: scaling up with SkyServe | ||
|
|
||
|
|
||
| With no change to the YAML, launch a fully managed service on your infra: | ||
| ```console | ||
| HF_TOKEN=xxx sky serve up llama4.yaml -n llama4 --env HF_TOKEN | ||
| ``` | ||
|
|
||
| Wait until the service is ready: | ||
| ```console | ||
| watch -n10 sky serve status llama4 | ||
| ``` | ||
|
|
||
| <details> | ||
| <summary>Example outputs:</summary> | ||
|
|
||
| ```console | ||
| Services | ||
| NAME VERSION UPTIME STATUS REPLICAS ENDPOINT | ||
| llama4 1 35s READY 2/2 xx.yy.zz.100:30001 | ||
|
|
||
| Service Replicas | ||
| SERVICE_NAME ID VERSION IP LAUNCHED RESOURCES STATUS REGION | ||
| llama4 1 1 xx.yy.zz.121 18 mins ago 1x GCP([Spot]{'H100': 8}) READY us-east4 | ||
| llama4 2 1 xx.yy.zz.245 18 mins ago 1x GCP([Spot]{'H100': 8}) READY us-east4 | ||
| ``` | ||
| </details> | ||
|
|
||
|
|
||
| Get a single endpoint that load-balances across replicas: | ||
| ```console | ||
| ENDPOINT=$(sky serve status --endpoint llama4) | ||
| ``` | ||
|
|
||
| > **Tip:** SkyServe fully manages the lifecycle of your replicas. For example, if a spot replica is preempted, the controller will automatically replace it. This significantly reduces the operational burden while saving costs. | ||
|
|
||
| To curl the endpoint: | ||
| ```console | ||
| curl http://$ENDPOINT/v1/chat/completions \ | ||
| -H "Content-Type: application/json" \ | ||
| -d '{ | ||
| "model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8", | ||
| "messages": [ | ||
| { | ||
| "role": "system", | ||
| "content": "You are a helpful assistant." | ||
| }, | ||
| { | ||
| "role": "user", | ||
| "content": "Who are you?" | ||
| } | ||
| ] | ||
| }' | jq . | ||
| ``` | ||
|
|
||
| To shut down all resources: | ||
| ```console | ||
| sky serve down llama4 | ||
| ``` | ||
|
|
||
| See more details in [SkyServe docs](https://docs.skypilot.co/en/latest/serving/sky-serve.html). | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,34 @@ | ||
| envs: | ||
| MODEL_NAME: meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 | ||
| # MODEL_NAME: meta-llama/Llama-3.2-3B-Vision | ||
| HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass. | ||
|
|
||
| resources: | ||
| accelerators: { H100:8, H200:8, B100:8, B200:8, GB200:8 } | ||
| cpus: 32+ | ||
| disk_size: 512 # Ensure model checkpoints can fit. | ||
| disk_tier: best | ||
| ports: 8081 | ||
|
|
||
| setup: | | ||
| uv pip install "vllm>=0.8.3" | ||
|
|
||
| run: | | ||
| echo 'Starting vllm api server...' | ||
|
|
||
| vllm serve $MODEL_NAME \ | ||
| --port 8081 \ | ||
| --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \ | ||
| --max-model-len 430000 | ||
|
|
||
| service: | ||
| replicas: 2 | ||
| # An actual request for readiness probe. | ||
| readiness_probe: | ||
| path: /v1/chat/completions | ||
| post_data: | ||
| model: $MODEL_NAME | ||
| messages: | ||
| - role: user | ||
| content: Hello! What is your name? | ||
| max_tokens: 1 |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.