Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add single command LLM deployment #3209

Merged
merged 33 commits into from
Jun 28, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
c4f968a
move start_torchserve from test_utils into ts.launcher
mreso Jun 14, 2024
08377fe
Move register model into launcher
mreso Jun 14, 2024
fbb1f5d
Readd imports to register_model in test_util
mreso Jun 21, 2024
449603d
Move vllm_handler into ts/torch_handler and add vllm to dependencies
mreso Jun 21, 2024
bdb3f80
Register vllm_handler in model_archiver
mreso Jun 22, 2024
5b8801d
Remove gen_mars from launcher
mreso Jun 22, 2024
6781012
Add llm_launcher script + llm docker
mreso Jun 22, 2024
0da0b47
Use model_path as mode id if path does not exist
mreso Jun 24, 2024
202d137
Add arguments to llm_launcher
mreso Jun 24, 2024
4161c9d
Wait for load command to finish
mreso Jun 24, 2024
18016d5
Optionally skip waiting in launcher.stop
mreso Jun 24, 2024
8ab85b8
remove custom loading of model archiver
mreso Jun 24, 2024
5bc4914
Move llm_launcher to ts
mreso Jun 24, 2024
0009390
Set model load timeout to 10 min
mreso Jun 24, 2024
c5476fc
Finalize dockerfile.llm
mreso Jun 24, 2024
e9de819
Adjust default value of ts launcher for token auth and model api
mreso Jun 26, 2024
a69c79c
updated llm_launcher.py
mreso Jun 26, 2024
5eb640f
Add llm deployment to readme.md
mreso Jun 26, 2024
3122616
Added documentation for llm launcher
mreso Jun 26, 2024
2003bd0
Added section on supported models
mreso Jun 26, 2024
de572a0
Enable tensor parallelism in llm launcher
mreso Jun 26, 2024
332fb43
Add reference to go beyond quickstart
mreso Jun 26, 2024
f3508dd
fix spellcheck lint
mreso Jun 26, 2024
61b8820
HPC->HPU
mreso Jun 26, 2024
286e034
doc
mreso Jun 26, 2024
65b6480
Move margen import below path changes
mreso Jun 26, 2024
3b1f27c
Merge remote-tracking branch 'origin/master' into feature/single_cmd_…
mreso Jun 27, 2024
c7fdbf4
Fix java formatting
mreso Jun 27, 2024
8155398
Remove gen_mar kw
mreso Jun 27, 2024
474494a
Fix error if model_store is used as positional argument
mreso Jun 27, 2024
57de73e
Remove .queue
mreso Jun 27, 2024
a12dd8b
Merge branch 'master' into feature/single_cmd_llm_deployment
mreso Jun 27, 2024
b49d8d8
Merge branch 'master' into feature/single_cmd_llm_deployment
mreso Jun 27, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,19 @@ docker pull pytorch/torchserve-nightly

Refer to [torchserve docker](docker/README.md) for details.

### 🤖 Quick Start LLM Deployment

```bash
#export token=<HUGGINGFACE_HUB_TOKEN>
docker build . -f docker/Dockerfile.llm -t ts/llm

docker run --rm -ti --gpus all -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:8080 -v data:/data ts/llm --model_id meta-llama/Meta-Llama-3-8B-Instruct --disable_token

curl -X POST -d '{"prompt":"Hello, my name is", "max_new_tokens": 50}' --header "Content-Type: application/json" "http://localhost:8080/predictions/model"
```

Refer to [LLM deployment][docs/llm_deployment.md] for details and other methods.

## ⚡ Why TorchServe
* Write once, run anywhere, on-prem, on-cloud, supports inference on CPUs, GPUs, AWS Inf1/Inf2/Trn1, Google Cloud TPUs, [Nvidia MPS](docs/nvidia_mps.md)
* [Model Management API](docs/management_api.md): multi model management with optimized worker to model allocation
Expand Down
9 changes: 9 additions & 0 deletions docker/Dockerfile.llm
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
FROM pytorch/torchserve-nightly:latest-gpu as server

USER root

RUN mkdir /data && chown -R model-server /data

USER model-server

mreso marked this conversation as resolved.
Show resolved Hide resolved
ENTRYPOINT [ "python", "-m", "ts.llm_launcher", "--vllm_engine.download_dir", "/data" ]
mreso marked this conversation as resolved.
Show resolved Hide resolved
1 change: 1 addition & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ TorchServe is a performant, flexible and easy to use tool for serving PyTorch ea

## Examples

* [Deploying LLMs](./llm_deployment.md) - How to easily deploy LLMs using TorchServe
* [HuggingFace Language Model](https://github.com/pytorch/serve/blob/master/examples/Huggingface_Transformers/Transformer_handler_generalized.py) - This handler takes an input sentence and can return sequence classifications, token classifications or Q&A answers
* [Multi Modal Framework](https://github.com/pytorch/serve/blob/master/examples/MMF-activity-recognition/handler.py) - Build and deploy a classifier that combines text, audio and video input data
* [Dual Translation Workflow](https://github.com/pytorch/serve/tree/master/examples/Workflows/nmt_transformers_pipeline) -
Expand Down
73 changes: 73 additions & 0 deletions docs/llm_deployment.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# LLM Deployment with TorchServe

This document describes how to easily serve large language models (LLM) like Meta-Llama3 with TorchServe.
Besides a quick start guide using our VLLM integration we also provide a list of examples which describe other methods to deploy LLMs with TorchServe.

## Quickstart LLM Deployment

TorchServe offers easy LLM deployment through its VLLM integration.
Through the integration of our [LLM launcher script](https://github.com/pytorch/serve/blob/7a9b145204b4d7cfbb114fe737cf980221e6181e/ts/llm_launcher.py) users are able to deploy any model supported by VLLM with a single command.
The launcher can either be used standalone or in combination with our provided TorchServe GPU docker image.

To launch the docker we first need to build it:
```bash
docker build . -f docker/Dockerfile.llm -t ts/llm
```

Models are usually loaded from the HuggingFace hub and are cached in a [docker volume](https://docs.docker.com/storage/volumes/) for faster reload.
If you want to access gated models like the Meta-Llama3 model you need to provide a HuggingFace hub token:
```bash
export token=<HUGGINGFACE_HUB_TOKEN>
```

You can then go ahead and launch a TorchServe instance serving your selected model:
```bash
docker run --rm -ti --gpus all -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:8080 -v data:/data ts/llm --model_id meta-llama/Meta-Llama-3-8B-Instruct --disable_token
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed -e HUGGING_FACE_HUB_TOKEN=$token Why not directly set export HUGGING_FACE_HUB_TOKEN= <HUGGINGFACE_HUB_TOKEN>

Also from the security POV, does the existing command print the token?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK the docker will not pick up env variables from the calling environment. So you would still have

export HUGGING_FACE_HUB_TOKEN= <HUGGINGFACE_HUB_TOKEN>
docker run --rm -ti --gpus all -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN ...

which is even longer and it comes down to the same process. For a RL deployment the token variable would be set through a secret and so the token would not show up in any of the logs.

```

To change the model you just need to exchange the identifier given to the `--model_id` parameter.
You can test the model with:
```bash
curl -X POST -d '{"prompt":"Hello, my name is", "max_new_tokens": 50}' --header "Content-Type: application/json" "http://localhost:8080/predictions/model"
```

You can change any of the sampling argument for the request by using the [VLLM SamplingParams keywords](https://docs.vllm.ai/en/stable/dev/sampling_params.html#vllm.SamplingParams).
E.g. for setting the sampling temperature to 0 we can do:
```bash
curl -X POST -d '{"prompt":"Hello, my name is", "max_new_tokens": 50, "temperature": 0}' --header "Content-Type: application/json" "http://localhost:8080/predictions/model"
```

TorchServe's LLM launcher scripts offers some customization options as well.
To rename the model endpoint from `predictions/model` to something else you can add `--model_name <SOME_NAME>` to the `docker run` command.

The launcher script can also be used outside a docker container by calling this after installing TorchServe following the [installation instruction](https://github.com/pytorch/serve/blob/feature/single_cmd_llm_deployment/README.md#-quick-start-with-torchserve).
```bash
python -m ts.llm_launcher --disable_token
```

Please note that the launcher script as well as the docker command will automatically run on all available GPUs so make sure to restrict the visible number of device by setting CUDA_VISIBLE_DEVICES.

For further customization of the handler and adding 3rd party dependencies you can have a look at out [VLLM example](https://github.com/pytorch/serve/tree/master/examples/large_models/vllm).

## Supported models
The quickstart launcher should allow to launch any model which is [supported by VLLM](https://docs.vllm.ai/en/latest/models/supported_models.html).
Here is a list of model identifiers tested by the TorchServe team:

* meta-llama/Meta-Llama-3-8B
* meta-llama/Meta-Llama-3-8B-Instruct
* meta-llama/Llama-2-7b-hf
* meta-llama/Llama-2-7b-chat-hf
* mistralai/Mistral-7B-v0.1
* mistralai/Mistral-7B-Instruct-v0.1

## Other ways to deploy LLMs with TorchServe

TorchServe offers a variety of example on how to deploy large models.
Here is a list of the current examples:

* [Llama 2/3 chat bot](https://github.com/pytorch/serve/tree/master/examples/LLM/llama)
* [GPT-fast](https://github.com/pytorch/serve/tree/master/examples/large_models/gpt_fast)
* [Inferentia2](https://github.com/pytorch/serve/tree/master/examples/large_models/inferentia2)
* [IPEX optimized](https://github.com/pytorch/serve/tree/master/examples/large_models/ipex_llm_int8)
* [Tensor Parallel Llama](https://github.com/pytorch/serve/tree/master/examples/large_models/tp_llama)
* [VLLM Integration](https://github.com/pytorch/serve/tree/master/examples/large_models/vllm)
1 change: 0 additions & 1 deletion examples/large_models/vllm/config.properties
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,3 @@ inference_address=http://127.0.0.1:8080
management_address=http://127.0.0.1:8081
metrics_address=http://127.0.0.1:8082
enable_envvars_config=true
install_py_dep_per_model=true
2 changes: 1 addition & 1 deletion examples/large_models/vllm/llama3/Readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ python ../../utils/Download_model.py --model_path model --model_name meta-llama/
Add the downloaded path to "model_path:" in `model-config.yaml` and run the following.

```bash
torch-model-archiver --model-name llama3-8b --version 1.0 --handler ../base_vllm_handler.py --config-file model-config.yaml -r ../requirements.txt --archive-format no-archive
torch-model-archiver --model-name llama3-8b --version 1.0 --handler vllm_handler --config-file model-config.yaml --archive-format no-archive
mv model llama3-8b
```

Expand Down
2 changes: 1 addition & 1 deletion examples/large_models/vllm/lora/Readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ cd ..
Add the downloaded path to "model_path:" and "adapter_1:" in `model-config.yaml` and run the following.

```bash
torch-model-archiver --model-name llama-7b-lora --version 1.0 --handler ../base_vllm_handler.py --config-file model-config.yaml -r ../requirements.txt --archive-format no-archive
torch-model-archiver --model-name llama-7b-lora --version 1.0 --handler vllm_handler --config-file model-config.yaml --archive-format no-archive
mv model llama-7b-lora
mv adapters llama-7b-lora
```
Expand Down
2 changes: 1 addition & 1 deletion examples/large_models/vllm/mistral/Readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ python ../../utils/Download_model.py --model_path model --model_name mistralai/M
Add the downloaded path to "model_path:" in `model-config.yaml` and run the following.

```bash
torch-model-archiver --model-name mistral --version 1.0 --handler ../base_vllm_handler.py --config-file model-config.yaml -r ../requirements.txt --archive-format no-archive
torch-model-archiver --model-name mistral --version 1.0 --handler vllm_handler --config-file model-config.yaml --archive-format no-archive
mv model mistral
```

Expand Down
1 change: 0 additions & 1 deletion examples/large_models/vllm/requirements.txt

This file was deleted.

Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,10 @@
public class AsyncWorkerThread extends WorkerThread {
// protected ConcurrentHashMap requestsInBackend;
protected static final Logger logger = LoggerFactory.getLogger(AsyncWorkerThread.class);
protected static final long MODEL_LOAD_TIMEOUT = 10L;

protected boolean loadingFinished;
protected CountDownLatch latch;

public AsyncWorkerThread(
ConfigManager configManager,
Expand Down Expand Up @@ -75,6 +77,17 @@ public void run() {
try {
backendChannel.get(0).writeAndFlush(req).sync();
logger.debug("Successfully flushed req");

if (loadingFinished == false) {
agunapal marked this conversation as resolved.
Show resolved Hide resolved
latch = new CountDownLatch(1);
if (!latch.await(MODEL_LOAD_TIMEOUT, TimeUnit.MINUTES)) {
throw new WorkerInitializationException(
"Worker did not load the model within"
+ MODEL_LOAD_TIMEOUT
+ " mins");
}
}

} catch (InterruptedException e) {
logger.error("Failed to send request to backend", e);
}
Expand Down Expand Up @@ -240,6 +253,7 @@ public void channelRead0(ChannelHandlerContext ctx, ModelWorkerResponse msg) {
setState(WorkerState.WORKER_MODEL_LOADED, HttpURLConnection.HTTP_OK);
backoffIdx = 0;
loadingFinished = true;
latch.countDown();
} else {
setState(WorkerState.WORKER_ERROR, msg.getCode());
}
Expand Down
1 change: 1 addition & 0 deletions model-archiver/model_archiver/model_packaging_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@
"object_detector": "vision",
"image_segmenter": "vision",
"dali_image_classifier": "vision",
"vllm_handler": "text",
}

MODEL_SERVER_VERSION = "1.0"
Expand Down
1 change: 1 addition & 0 deletions requirements/torch_linux.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,4 @@ torch==2.3.0+cpu; sys_platform == 'linux'
torchvision==0.18.0+cpu; sys_platform == 'linux'
torchtext==0.18.0; sys_platform == 'linux'
torchaudio==2.3.0+cpu; sys_platform == 'linux'
vllm==0.5.0; sys_platform == 'linux'
115 changes: 18 additions & 97 deletions test/pytest/test_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,103 +5,25 @@
import subprocess
import sys
import tempfile
import threading
from io import TextIOWrapper
from os import path
from pathlib import Path
from queue import Queue
from subprocess import PIPE, STDOUT, Popen

import orjson
import requests

# To help discover margen modules
REPO_ROOT = os.path.join(os.path.dirname(os.path.abspath(__file__)), "../../")
sys.path.append(REPO_ROOT)

from ts.launcher import register_model, register_model_with_params, start # noqa
from ts.launcher import stop as stop_torchserve
from ts_scripts import marsgen as mg

ROOT_DIR = os.path.join(tempfile.gettempdir(), "workspace")
MODEL_STORE = path.join(ROOT_DIR, "model_store/")
CODEBUILD_WD = path.abspath(path.join(__file__, "../../.."))


class PrintTillTheEnd(threading.Thread):
def __init__(self, queue):
super().__init__()
self._queue = queue

def run(self):
while True:
line = self._queue.get()
if not line:
break
print(line.strip())


class Tee(threading.Thread):
def __init__(self, reader):
super().__init__()
self.reader = reader
self.queue1 = Queue()
self.queue2 = Queue()

def run(self):
for line in self.reader:
self.queue1.put(line)
self.queue2.put(line)
self.queue1.put(None)
self.queue2.put(None)


def start_torchserve(
model_store=None,
snapshot_file=None,
no_config_snapshots=False,
gen_mar=True,
plugin_folder=None,
disable_token=True,
models=None,
model_api_enabled=True,
):
stop_torchserve()
crate_mar_file_table()
cmd = ["torchserve", "--start"]
model_store = model_store if model_store else MODEL_STORE
if gen_mar:
mg.gen_mar(model_store)
cmd.extend(["--model-store", model_store])
if plugin_folder:
cmd.extend(["--plugins-path", plugin_folder])
if snapshot_file:
cmd.extend(["--ts-config", snapshot_file])
if no_config_snapshots:
cmd.extend(["--no-config-snapshots"])
if disable_token:
cmd.append("--disable-token")
if models:
cmd.extend(["--models", models])
if model_api_enabled:
cmd.extend(["--model-api-enabled"])
print(cmd)

p = Popen(cmd, stdin=PIPE, stdout=PIPE, stderr=STDOUT)
for line in p.stdout:
print(line.decode("utf8").strip())
if "Model server started" in str(line).strip():
break

splitter = Tee(TextIOWrapper(p.stdout))
splitter.start()
print_thread = PrintTillTheEnd(splitter.queue1)
print_thread.start()

return splitter.queue2


def stop_torchserve():
subprocess.run(["torchserve", "--stop", "--foreground"])


def delete_all_snapshots():
for f in glob.glob("logs/config/*"):
os.remove(f)
Expand All @@ -115,27 +37,26 @@ def delete_model_store(model_store=None):
os.remove(f)


def start_torchserve(*args, **kwargs):
create_mar_file_table()
# In case someone uses model_store as positional argument
if len(args) == 0:
kwargs.update({"model_store": kwargs.get("model_store", MODEL_STORE)})
if kwargs.get("gen_mar", True):
mg.gen_mar(kwargs.get("model_store"))
if "gen_mar" in kwargs:
del kwargs["gen_mar"]
kwargs.update({"disable_token": kwargs.get("disable_token", True)})
kwargs.update({"model_api_enabled": kwargs.get("model_api_enabled", True)})
return start(*args, **kwargs)


def torchserve_cleanup():
stop_torchserve()
delete_model_store()
delete_all_snapshots()


def register_model(model_name, url):
params = (
("model_name", model_name),
("url", url),
("initial_workers", "1"),
("synchronous", "true"),
)
return register_model_with_params(params)


def register_model_with_params(params):
response = requests.post("http://localhost:8081/models", params=params)
return response


def unregister_model(model_name):
response = requests.delete("http://localhost:8081/models/{}".format(model_name))
return response
Expand Down Expand Up @@ -163,7 +84,7 @@ def delete_mar_file_from_model_store(model_store=None, model_mar=None):
mar_file_table = {}


def crate_mar_file_table():
def create_mar_file_table():
if not mar_file_table:
with open(
os.path.join(os.path.dirname(__file__), *environment_json.split("/")), "rb"
Expand Down
Loading
Loading