Skip to content

Commit

Permalink
feat: Add a new template for dell (#978)
Browse files Browse the repository at this point in the history
- Added new template `dell` and its documentation 
- Update docs 
- [minor] uv fix i came across 
- codegen for all templates 

Tested with 

```bash
export INFERENCE_PORT=8181
export DEH_URL=http://0.0.0.0:$INFERENCE_PORT
export INFERENCE_MODEL=meta-llama/Llama-3.1-8B-Instruct
export CHROMADB_HOST=localhost
export CHROMADB_PORT=6601
export CHROMA_URL=[http://$CHROMADB_HOST:$CHROMADB_PORT](about:blank)
export CUDA_VISIBLE_DEVICES=0
export LLAMA_STACK_PORT=8321

# build the stack template 
llama stack build --template=dell 

# start the TGI inference server 
podman run --rm -it --network host -v $HOME/.cache/huggingface:/data -e HF_TOKEN=$HF_TOKEN -p $INFERENCE_PORT:$INFERENCE_PORT --gpus $CUDA_VISIBLE_DEVICES [ghcr.io/huggingface/text-generation-inference](http://ghcr.io/huggingface/text-generation-inference) --dtype bfloat16 --usage-stats off --sharded false --cuda-memory-fraction 0.7 --model-id $INFERENCE_MODEL --port $INFERENCE_PORT --hostname 0.0.0.0

# start chroma-db for vector-io ( aka RAG )
podman run --rm -it --network host --name chromadb -v .:/chroma/chroma -e IS_PERSISTENT=TRUE chromadb/chroma:latest --port $CHROMADB_PORT --host $(hostname)

# build docker 
llama stack build --template=dell --image-type=container

# run llama stack server ( via docker )
podman run -it \
--network host \
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-v ~/.llama:/root/.llama \
# NOTE: mount the llama-stack / llama-model directories if testing local changes 
-v /home/hjshah/git/llama-stack:/app/llama-stack-source -v /home/hjshah/git/llama-models:/app/llama-models-source \ localhost/distribution-dell:dev \
--port $LLAMA_STACK_PORT  \
--env INFERENCE_MODEL=$INFERENCE_MODEL \
--env DEH_URL=$DEH_URL \
--env CHROMA_URL=$CHROMA_URL

# test the server 
cd <PATH_TO_LLAMA_STACK_REPO>
LLAMA_STACK_BASE_URL=http://0.0.0.0:$LLAMA_STACK_PORT pytest -s -v tests/client-sdk/agents/test_agents.py

```

---------

Co-authored-by: Hardik Shah <hjshah@fb.com>
  • Loading branch information
hardikjshah and Hardik Shah authored Feb 6, 2025
1 parent dd1265b commit a84e766
Show file tree
Hide file tree
Showing 24 changed files with 895 additions and 71 deletions.
152 changes: 93 additions & 59 deletions distributions/dependencies.json

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions docs/source/distributions/remote_hosted_distro/nvidia.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
<!-- This file was auto-generated by distro_codegen.py, please edit source -->
# NVIDIA Distribution

The `llamastack/distribution-nvidia` distribution consists of the following provider configurations.
Expand Down
1 change: 1 addition & 0 deletions docs/source/distributions/self_hosted_distro/bedrock.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
<!-- This file was auto-generated by distro_codegen.py, please edit source -->
# Bedrock Distribution

```{toctree}
Expand Down
1 change: 1 addition & 0 deletions docs/source/distributions/self_hosted_distro/cerebras.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
<!-- This file was auto-generated by distro_codegen.py, please edit source -->
# Cerebras Distribution

The `llamastack/distribution-cerebras` distribution consists of the following provider configurations.
Expand Down
186 changes: 186 additions & 0 deletions docs/source/distributions/self_hosted_distro/dell.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,186 @@
<!-- This file was auto-generated by distro_codegen.py, please edit source -->
---
orphan: true
---

# Dell Distribution of Llama Stack

```{toctree}
:maxdepth: 2
:hidden:
self
```

The `llamastack/distribution-dell` distribution consists of the following provider configurations.

| API | Provider(s) |
|-----|-------------|
| agents | `inline::meta-reference` |
| datasetio | `remote::huggingface`, `inline::localfs` |
| eval | `inline::meta-reference` |
| inference | `remote::tgi` |
| safety | `inline::llama-guard` |
| scoring | `inline::basic`, `inline::llm-as-judge`, `inline::braintrust` |
| telemetry | `inline::meta-reference` |
| tool_runtime | `remote::brave-search`, `remote::tavily-search`, `inline::code-interpreter`, `inline::rag-runtime` |
| vector_io | `inline::faiss`, `remote::chromadb`, `remote::pgvector` |


You can use this distribution if you have GPUs and want to run an independent TGI or Dell Enterprise Hub container for running inference.

### Environment Variables

The following environment variables can be configured:

- `DEH_URL`: URL for the Dell inference server (default: `http://0.0.0.0:8181`)
- `DEH_SAFETY_URL`: URL for the Dell safety inference server (default: `http://0.0.0.0:8282`)
- `CHROMA_URL`: URL for the Chroma server (default: `http://localhost:6601`)
- `INFERENCE_MODEL`: Inference model loaded into the TGI server (default: `meta-llama/Llama-3.2-3B-Instruct`)
- `SAFETY_MODEL`: Name of the safety (Llama-Guard) model to use (default: `meta-llama/Llama-Guard-3-1B`)


## Setting up Inference server using Dell Enterprise Hub's custom TGI container.

NOTE: This is a placeholder to run inference with TGI. This will be updated to use [Dell Enterprise Hub's containers](https://dell.huggingface.co/authenticated/models) once verified.

```bash
export INFERENCE_PORT=8181
export DEH_URL=http://0.0.0.0:$INFERENCE_PORT
export INFERENCE_MODEL=meta-llama/Llama-3.1-8B-Instruct
export CHROMADB_HOST=localhost
export CHROMADB_PORT=6601
export CHROMA_URL=http://$CHROMADB_HOST:$CHROMADB_PORT
export CUDA_VISIBLE_DEVICES=0
export LLAMA_STACK_PORT=8321

docker run --rm -it \
--network host \
-v $HOME/.cache/huggingface:/data \
-e HF_TOKEN=$HF_TOKEN \
-p $INFERENCE_PORT:$INFERENCE_PORT \
--gpus $CUDA_VISIBLE_DEVICES \
ghcr.io/huggingface/text-generation-inference \
--dtype bfloat16 \
--usage-stats off \
--sharded false \
--cuda-memory-fraction 0.7 \
--model-id $INFERENCE_MODEL \
--port $INFERENCE_PORT --hostname 0.0.0.0
```

If you are using Llama Stack Safety / Shield APIs, then you will need to also run another instance of a TGI with a corresponding safety model like `meta-llama/Llama-Guard-3-1B` using a script like:

```bash
export SAFETY_INFERENCE_PORT=8282
export DEH_SAFETY_URL=http://0.0.0.0:$SAFETY_INFERENCE_PORT
export SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
export CUDA_VISIBLE_DEVICES=1

docker run --rm -it \
--network host \
-v $HOME/.cache/huggingface:/data \
-e HF_TOKEN=$HF_TOKEN \
-p $SAFETY_INFERENCE_PORT:$SAFETY_INFERENCE_PORT \
--gpus $CUDA_VISIBLE_DEVICES \
ghcr.io/huggingface/text-generation-inference \
--dtype bfloat16 \
--usage-stats off \
--sharded false \
--cuda-memory-fraction 0.7 \
--model-id $SAFETY_MODEL \
--hostname 0.0.0.0 \
--port $SAFETY_INFERENCE_PORT
```

## Dell distribution relies on ChromaDB for vector database usage

You can start a chroma-db easily using docker.
```bash
# This is where the indices are persisted
mkdir -p $HOME/chromadb

podman run --rm -it \
--network host \
--name chromadb \
-v $HOME/chromadb:/chroma/chroma \
-e IS_PERSISTENT=TRUE \
chromadb/chroma:latest \
--port $CHROMADB_PORT \
--host $CHROMADB_HOST
```

## Running Llama Stack

Now you are ready to run Llama Stack with TGI as the inference provider. You can do this via Conda (build code) or Docker which has a pre-built image.

### Via Docker

This method allows you to get started quickly without having to build the distribution code.

```bash
docker run -it \
--network host \
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-v $HOME/.llama:/root/.llama \
# NOTE: mount the llama-stack / llama-model directories if testing local changes else not needed
-v /home/hjshah/git/llama-stack:/app/llama-stack-source -v /home/hjshah/git/llama-models:/app/llama-models-source \
# localhost/distribution-dell:dev if building / testing locally
llamastack/distribution-dell\
--port $LLAMA_STACK_PORT \
--env INFERENCE_MODEL=$INFERENCE_MODEL \
--env DEH_URL=$DEH_URL \
--env CHROMA_URL=$CHROMA_URL

```

If you are using Llama Stack Safety / Shield APIs, use:

```bash
# You need a local checkout of llama-stack to run this, get it using
# git clone https://github.com/meta-llama/llama-stack.git
cd /path/to/llama-stack

export SAFETY_INFERENCE_PORT=8282
export DEH_SAFETY_URL=http://0.0.0.0:$SAFETY_INFERENCE_PORT
export SAFETY_MODEL=meta-llama/Llama-Guard-3-1B

docker run \
-it \
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-v $HOME/.llama:/root/.llama \
-v ./llama_stack/templates/tgi/run-with-safety.yaml:/root/my-run.yaml \
llamastack/distribution-dell \
--yaml-config /root/my-run.yaml \
--port $LLAMA_STACK_PORT \
--env INFERENCE_MODEL=$INFERENCE_MODEL \
--env DEH_URL=$DEH_URL \
--env SAFETY_MODEL=$SAFETY_MODEL \
--env DEH_SAFETY_URL=$DEH_SAFETY_URL \
--env CHROMA_URL=$CHROMA_URL
```

### Via Conda

Make sure you have done `pip install llama-stack` and have the Llama Stack CLI available.

```bash
llama stack build --template dell --image-type conda
llama stack run dell
--port $LLAMA_STACK_PORT \
--env INFERENCE_MODEL=$INFERENCE_MODEL \
--env DEH_URL=$DEH_URL \
--env CHROMA_URL=$CHROMA_URL
```

If you are using Llama Stack Safety / Shield APIs, use:

```bash
llama stack run ./run-with-safety.yaml \
--port $LLAMA_STACK_PORT \
--env INFERENCE_MODEL=$INFERENCE_MODEL \
--env DEH_URL=$DEH_URL \
--env SAFETY_MODEL=$SAFETY_MODEL \
--env DEH_SAFETY_URL=$DEH_SAFETY_URL \
--env CHROMA_URL=$CHROMA_URL
```
1 change: 1 addition & 0 deletions docs/source/distributions/self_hosted_distro/fireworks.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
<!-- This file was auto-generated by distro_codegen.py, please edit source -->
---
orphan: true
---
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
<!-- This file was auto-generated by distro_codegen.py, please edit source -->
---
orphan: true
---
Expand Down Expand Up @@ -82,7 +83,7 @@ docker run \

### Via Conda

Make sure you have done `pip install llama-stack` and have the Llama Stack CLI available.
Make sure you have done `uv pip install llama-stack` and have the Llama Stack CLI available.

```bash
llama stack build --template meta-reference-gpu --image-type conda
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
<!-- This file was auto-generated by distro_codegen.py, please edit source -->
---
orphan: true
---
Expand Down Expand Up @@ -82,7 +83,7 @@ docker run \

### Via Conda

Make sure you have done `pip install llama-stack` and have the Llama Stack CLI available.
Make sure you have done `uv pip install llama-stack` and have the Llama Stack CLI available.

```bash
llama stack build --template meta-reference-quantized-gpu --image-type conda
Expand Down
3 changes: 2 additions & 1 deletion docs/source/distributions/self_hosted_distro/ollama.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
<!-- This file was auto-generated by distro_codegen.py, please edit source -->
---
orphan: true
---
Expand Down Expand Up @@ -103,7 +104,7 @@ docker run \

### Via Conda

Make sure you have done `pip install llama-stack` and have the Llama Stack CLI available.
Make sure you have done `uv pip install llama-stack` and have the Llama Stack CLI available.

```bash
export LLAMA_STACK_PORT=5001
Expand Down
3 changes: 2 additions & 1 deletion docs/source/distributions/self_hosted_distro/remote-vllm.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
<!-- This file was auto-generated by distro_codegen.py, please edit source -->
---
orphan: true
---
Expand Down Expand Up @@ -131,7 +132,7 @@ docker run \

### Via Conda

Make sure you have done `pip install llama-stack` and have the Llama Stack CLI available.
Make sure you have done `uv pip install llama-stack` and have the Llama Stack CLI available.

```bash
export INFERENCE_PORT=8000
Expand Down
1 change: 1 addition & 0 deletions docs/source/distributions/self_hosted_distro/sambanova.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
<!-- This file was auto-generated by distro_codegen.py, please edit source -->
---
orphan: true
---
Expand Down
3 changes: 2 additions & 1 deletion docs/source/distributions/self_hosted_distro/tgi.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
<!-- This file was auto-generated by distro_codegen.py, please edit source -->
---
orphan: true
---
Expand Down Expand Up @@ -122,7 +123,7 @@ docker run \

### Via Conda

Make sure you have done `pip install llama-stack` and have the Llama Stack CLI available.
Make sure you have done `uv pip install llama-stack` and have the Llama Stack CLI available.

```bash
llama stack build --template tgi --image-type conda
Expand Down
1 change: 1 addition & 0 deletions docs/source/distributions/self_hosted_distro/together.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
<!-- This file was auto-generated by distro_codegen.py, please edit source -->
---
orphan: true
---
Expand Down
2 changes: 1 addition & 1 deletion llama_stack/distribution/build_conda_env.sh
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,7 @@ ensure_conda_env_python310() {
fi

printf "Installing from LLAMA_MODELS_DIR: $LLAMA_MODELS_DIR\n"
uv pip uninstall -y llama-models
uv pip uninstall llama-models
uv pip install --no-cache-dir -e "$LLAMA_MODELS_DIR"
fi

Expand Down
2 changes: 1 addition & 1 deletion llama_stack/distribution/build_venv.sh
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,7 @@ run() {
fi

printf "Installing from LLAMA_MODELS_DIR: $LLAMA_MODELS_DIR\n"
uv pip uninstall -y llama-models
uv pip uninstall llama-models
uv pip install --no-cache-dir -e "$LLAMA_MODELS_DIR"
fi

Expand Down
1 change: 1 addition & 0 deletions llama_stack/providers/remote/inference/nvidia/nvidia.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
Message,
ResponseFormat,
ToolChoice,
ToolConfig,
)
from llama_stack.providers.utils.inference.model_registry import (
build_model_alias,
Expand Down
6 changes: 3 additions & 3 deletions llama_stack/providers/utils/memory/vector_store.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,6 @@
import numpy as np

from llama_models.llama3.api.tokenizer import Tokenizer
from numpy.typing import NDArray

from pypdf import PdfReader

from llama_stack.apis.common.content_types import (
InterleavedContent,
Expand All @@ -33,6 +30,9 @@
from llama_stack.providers.utils.inference.prompt_adapter import (
interleaved_content_as_str,
)
from numpy.typing import NDArray

from pypdf import PdfReader

log = logging.getLogger(__name__)

Expand Down
7 changes: 7 additions & 0 deletions llama_stack/templates/dell/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the terms described in the LICENSE file in
# the root directory of this source tree.

from .dell import get_distribution_template # noqa: F401
32 changes: 32 additions & 0 deletions llama_stack/templates/dell/build.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
version: '2'
distribution_spec:
description: Dell's distribution of Llama Stack. TGI inference via Dell's custom
container
providers:
inference:
- remote::tgi
vector_io:
- inline::faiss
- remote::chromadb
- remote::pgvector
safety:
- inline::llama-guard
agents:
- inline::meta-reference
telemetry:
- inline::meta-reference
eval:
- inline::meta-reference
datasetio:
- remote::huggingface
- inline::localfs
scoring:
- inline::basic
- inline::llm-as-judge
- inline::braintrust
tool_runtime:
- remote::brave-search
- remote::tavily-search
- inline::code-interpreter
- inline::rag-runtime
image_type: conda
Loading

0 comments on commit a84e766

Please sign in to comment.