-
Notifications
You must be signed in to change notification settings - Fork 855
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat: Add a new template for
dell
(#978)
- Added new template `dell` and its documentation - Update docs - [minor] uv fix i came across - codegen for all templates Tested with ```bash export INFERENCE_PORT=8181 export DEH_URL=http://0.0.0.0:$INFERENCE_PORT export INFERENCE_MODEL=meta-llama/Llama-3.1-8B-Instruct export CHROMADB_HOST=localhost export CHROMADB_PORT=6601 export CHROMA_URL=[http://$CHROMADB_HOST:$CHROMADB_PORT](about:blank) export CUDA_VISIBLE_DEVICES=0 export LLAMA_STACK_PORT=8321 # build the stack template llama stack build --template=dell # start the TGI inference server podman run --rm -it --network host -v $HOME/.cache/huggingface:/data -e HF_TOKEN=$HF_TOKEN -p $INFERENCE_PORT:$INFERENCE_PORT --gpus $CUDA_VISIBLE_DEVICES [ghcr.io/huggingface/text-generation-inference](http://ghcr.io/huggingface/text-generation-inference) --dtype bfloat16 --usage-stats off --sharded false --cuda-memory-fraction 0.7 --model-id $INFERENCE_MODEL --port $INFERENCE_PORT --hostname 0.0.0.0 # start chroma-db for vector-io ( aka RAG ) podman run --rm -it --network host --name chromadb -v .:/chroma/chroma -e IS_PERSISTENT=TRUE chromadb/chroma:latest --port $CHROMADB_PORT --host $(hostname) # build docker llama stack build --template=dell --image-type=container # run llama stack server ( via docker ) podman run -it \ --network host \ -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \ -v ~/.llama:/root/.llama \ # NOTE: mount the llama-stack / llama-model directories if testing local changes -v /home/hjshah/git/llama-stack:/app/llama-stack-source -v /home/hjshah/git/llama-models:/app/llama-models-source \ localhost/distribution-dell:dev \ --port $LLAMA_STACK_PORT \ --env INFERENCE_MODEL=$INFERENCE_MODEL \ --env DEH_URL=$DEH_URL \ --env CHROMA_URL=$CHROMA_URL # test the server cd <PATH_TO_LLAMA_STACK_REPO> LLAMA_STACK_BASE_URL=http://0.0.0.0:$LLAMA_STACK_PORT pytest -s -v tests/client-sdk/agents/test_agents.py ``` --------- Co-authored-by: Hardik Shah <hjshah@fb.com>
- Loading branch information
1 parent
dd1265b
commit a84e766
Showing
24 changed files
with
895 additions
and
71 deletions.
There are no files selected for viewing
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,186 @@ | ||
<!-- This file was auto-generated by distro_codegen.py, please edit source --> | ||
--- | ||
orphan: true | ||
--- | ||
|
||
# Dell Distribution of Llama Stack | ||
|
||
```{toctree} | ||
:maxdepth: 2 | ||
:hidden: | ||
self | ||
``` | ||
|
||
The `llamastack/distribution-dell` distribution consists of the following provider configurations. | ||
|
||
| API | Provider(s) | | ||
|-----|-------------| | ||
| agents | `inline::meta-reference` | | ||
| datasetio | `remote::huggingface`, `inline::localfs` | | ||
| eval | `inline::meta-reference` | | ||
| inference | `remote::tgi` | | ||
| safety | `inline::llama-guard` | | ||
| scoring | `inline::basic`, `inline::llm-as-judge`, `inline::braintrust` | | ||
| telemetry | `inline::meta-reference` | | ||
| tool_runtime | `remote::brave-search`, `remote::tavily-search`, `inline::code-interpreter`, `inline::rag-runtime` | | ||
| vector_io | `inline::faiss`, `remote::chromadb`, `remote::pgvector` | | ||
|
||
|
||
You can use this distribution if you have GPUs and want to run an independent TGI or Dell Enterprise Hub container for running inference. | ||
|
||
### Environment Variables | ||
|
||
The following environment variables can be configured: | ||
|
||
- `DEH_URL`: URL for the Dell inference server (default: `http://0.0.0.0:8181`) | ||
- `DEH_SAFETY_URL`: URL for the Dell safety inference server (default: `http://0.0.0.0:8282`) | ||
- `CHROMA_URL`: URL for the Chroma server (default: `http://localhost:6601`) | ||
- `INFERENCE_MODEL`: Inference model loaded into the TGI server (default: `meta-llama/Llama-3.2-3B-Instruct`) | ||
- `SAFETY_MODEL`: Name of the safety (Llama-Guard) model to use (default: `meta-llama/Llama-Guard-3-1B`) | ||
|
||
|
||
## Setting up Inference server using Dell Enterprise Hub's custom TGI container. | ||
|
||
NOTE: This is a placeholder to run inference with TGI. This will be updated to use [Dell Enterprise Hub's containers](https://dell.huggingface.co/authenticated/models) once verified. | ||
|
||
```bash | ||
export INFERENCE_PORT=8181 | ||
export DEH_URL=http://0.0.0.0:$INFERENCE_PORT | ||
export INFERENCE_MODEL=meta-llama/Llama-3.1-8B-Instruct | ||
export CHROMADB_HOST=localhost | ||
export CHROMADB_PORT=6601 | ||
export CHROMA_URL=http://$CHROMADB_HOST:$CHROMADB_PORT | ||
export CUDA_VISIBLE_DEVICES=0 | ||
export LLAMA_STACK_PORT=8321 | ||
|
||
docker run --rm -it \ | ||
--network host \ | ||
-v $HOME/.cache/huggingface:/data \ | ||
-e HF_TOKEN=$HF_TOKEN \ | ||
-p $INFERENCE_PORT:$INFERENCE_PORT \ | ||
--gpus $CUDA_VISIBLE_DEVICES \ | ||
ghcr.io/huggingface/text-generation-inference \ | ||
--dtype bfloat16 \ | ||
--usage-stats off \ | ||
--sharded false \ | ||
--cuda-memory-fraction 0.7 \ | ||
--model-id $INFERENCE_MODEL \ | ||
--port $INFERENCE_PORT --hostname 0.0.0.0 | ||
``` | ||
|
||
If you are using Llama Stack Safety / Shield APIs, then you will need to also run another instance of a TGI with a corresponding safety model like `meta-llama/Llama-Guard-3-1B` using a script like: | ||
|
||
```bash | ||
export SAFETY_INFERENCE_PORT=8282 | ||
export DEH_SAFETY_URL=http://0.0.0.0:$SAFETY_INFERENCE_PORT | ||
export SAFETY_MODEL=meta-llama/Llama-Guard-3-1B | ||
export CUDA_VISIBLE_DEVICES=1 | ||
|
||
docker run --rm -it \ | ||
--network host \ | ||
-v $HOME/.cache/huggingface:/data \ | ||
-e HF_TOKEN=$HF_TOKEN \ | ||
-p $SAFETY_INFERENCE_PORT:$SAFETY_INFERENCE_PORT \ | ||
--gpus $CUDA_VISIBLE_DEVICES \ | ||
ghcr.io/huggingface/text-generation-inference \ | ||
--dtype bfloat16 \ | ||
--usage-stats off \ | ||
--sharded false \ | ||
--cuda-memory-fraction 0.7 \ | ||
--model-id $SAFETY_MODEL \ | ||
--hostname 0.0.0.0 \ | ||
--port $SAFETY_INFERENCE_PORT | ||
``` | ||
|
||
## Dell distribution relies on ChromaDB for vector database usage | ||
|
||
You can start a chroma-db easily using docker. | ||
```bash | ||
# This is where the indices are persisted | ||
mkdir -p $HOME/chromadb | ||
|
||
podman run --rm -it \ | ||
--network host \ | ||
--name chromadb \ | ||
-v $HOME/chromadb:/chroma/chroma \ | ||
-e IS_PERSISTENT=TRUE \ | ||
chromadb/chroma:latest \ | ||
--port $CHROMADB_PORT \ | ||
--host $CHROMADB_HOST | ||
``` | ||
|
||
## Running Llama Stack | ||
|
||
Now you are ready to run Llama Stack with TGI as the inference provider. You can do this via Conda (build code) or Docker which has a pre-built image. | ||
|
||
### Via Docker | ||
|
||
This method allows you to get started quickly without having to build the distribution code. | ||
|
||
```bash | ||
docker run -it \ | ||
--network host \ | ||
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \ | ||
-v $HOME/.llama:/root/.llama \ | ||
# NOTE: mount the llama-stack / llama-model directories if testing local changes else not needed | ||
-v /home/hjshah/git/llama-stack:/app/llama-stack-source -v /home/hjshah/git/llama-models:/app/llama-models-source \ | ||
# localhost/distribution-dell:dev if building / testing locally | ||
llamastack/distribution-dell\ | ||
--port $LLAMA_STACK_PORT \ | ||
--env INFERENCE_MODEL=$INFERENCE_MODEL \ | ||
--env DEH_URL=$DEH_URL \ | ||
--env CHROMA_URL=$CHROMA_URL | ||
|
||
``` | ||
|
||
If you are using Llama Stack Safety / Shield APIs, use: | ||
|
||
```bash | ||
# You need a local checkout of llama-stack to run this, get it using | ||
# git clone https://github.com/meta-llama/llama-stack.git | ||
cd /path/to/llama-stack | ||
|
||
export SAFETY_INFERENCE_PORT=8282 | ||
export DEH_SAFETY_URL=http://0.0.0.0:$SAFETY_INFERENCE_PORT | ||
export SAFETY_MODEL=meta-llama/Llama-Guard-3-1B | ||
|
||
docker run \ | ||
-it \ | ||
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \ | ||
-v $HOME/.llama:/root/.llama \ | ||
-v ./llama_stack/templates/tgi/run-with-safety.yaml:/root/my-run.yaml \ | ||
llamastack/distribution-dell \ | ||
--yaml-config /root/my-run.yaml \ | ||
--port $LLAMA_STACK_PORT \ | ||
--env INFERENCE_MODEL=$INFERENCE_MODEL \ | ||
--env DEH_URL=$DEH_URL \ | ||
--env SAFETY_MODEL=$SAFETY_MODEL \ | ||
--env DEH_SAFETY_URL=$DEH_SAFETY_URL \ | ||
--env CHROMA_URL=$CHROMA_URL | ||
``` | ||
|
||
### Via Conda | ||
|
||
Make sure you have done `pip install llama-stack` and have the Llama Stack CLI available. | ||
|
||
```bash | ||
llama stack build --template dell --image-type conda | ||
llama stack run dell | ||
--port $LLAMA_STACK_PORT \ | ||
--env INFERENCE_MODEL=$INFERENCE_MODEL \ | ||
--env DEH_URL=$DEH_URL \ | ||
--env CHROMA_URL=$CHROMA_URL | ||
``` | ||
|
||
If you are using Llama Stack Safety / Shield APIs, use: | ||
|
||
```bash | ||
llama stack run ./run-with-safety.yaml \ | ||
--port $LLAMA_STACK_PORT \ | ||
--env INFERENCE_MODEL=$INFERENCE_MODEL \ | ||
--env DEH_URL=$DEH_URL \ | ||
--env SAFETY_MODEL=$SAFETY_MODEL \ | ||
--env DEH_SAFETY_URL=$DEH_SAFETY_URL \ | ||
--env CHROMA_URL=$CHROMA_URL | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,4 @@ | ||
<!-- This file was auto-generated by distro_codegen.py, please edit source --> | ||
--- | ||
orphan: true | ||
--- | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,4 @@ | ||
<!-- This file was auto-generated by distro_codegen.py, please edit source --> | ||
--- | ||
orphan: true | ||
--- | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,4 @@ | ||
<!-- This file was auto-generated by distro_codegen.py, please edit source --> | ||
--- | ||
orphan: true | ||
--- | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
# Copyright (c) Meta Platforms, Inc. and affiliates. | ||
# All rights reserved. | ||
# | ||
# This source code is licensed under the terms described in the LICENSE file in | ||
# the root directory of this source tree. | ||
|
||
from .dell import get_distribution_template # noqa: F401 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
version: '2' | ||
distribution_spec: | ||
description: Dell's distribution of Llama Stack. TGI inference via Dell's custom | ||
container | ||
providers: | ||
inference: | ||
- remote::tgi | ||
vector_io: | ||
- inline::faiss | ||
- remote::chromadb | ||
- remote::pgvector | ||
safety: | ||
- inline::llama-guard | ||
agents: | ||
- inline::meta-reference | ||
telemetry: | ||
- inline::meta-reference | ||
eval: | ||
- inline::meta-reference | ||
datasetio: | ||
- remote::huggingface | ||
- inline::localfs | ||
scoring: | ||
- inline::basic | ||
- inline::llm-as-judge | ||
- inline::braintrust | ||
tool_runtime: | ||
- remote::brave-search | ||
- remote::tavily-search | ||
- inline::code-interpreter | ||
- inline::rag-runtime | ||
image_type: conda |
Oops, something went wrong.