opea-project · ftian1 · Jun 12, 2024 · May 30, 2024 · May 31, 2024 · May 31, 2024
diff --git a/README.md b/README.md
@@ -134,8 +134,8 @@ The initially supported `Microservices` are described in the below table. More `
 			<td>Dataprep on Xeon CPU</td>
 		</tr>
 		<tr>
-			<td rowspan="5"><a href="./comps/llms/README.md">LLM</a></td>
-            <td rowspan="5"><a href="https://www.langchain.com">LangChain</a></td>
+			<td rowspan="6"><a href="./comps/llms/README.md">LLM</a></td>
+            <td rowspan="6"><a href="https://www.langchain.com">LangChain</a></td>
 			<td rowspan="2"><a href="https://huggingface.co/Intel/neural-chat-7b-v3-3">Intel/neural-chat-7b-v3-3</a></td>
 			<td><a href="https://github.com/huggingface/tgi-gaudi">TGI Gaudi</a></td>
 			<td>Gaudi2</td>
@@ -147,7 +147,7 @@ The initially supported `Microservices` are described in the below table. More `
 			<td>LLM on Xeon CPU</td>
 		</tr>
 		<tr>
-			<td rowspan="2"><a href="https://huggingface.co/meta-llama/Llama-2-7b-chat-hf">meta-llama/Llama-2-7b-chat-hf</a></td>
+			<td rowspan="2"><a href="https://huggingface.co/Intel/neural-chat-7b-v3-3">Intel/neural-chat-7b-v3-3</a></td>
 			<td rowspan="2"><a href="https://github.com/ray-project/ray">Ray Serve</a></td>
 			<td>Gaudi2</td>
 			<td>LLM on Gaudi2</td>
@@ -157,8 +157,12 @@ The initially supported `Microservices` are described in the below table. More `
 			<td>LLM on Xeon CPU</td>
 		</tr>
 		<tr>
-			<td><a href="https://huggingface.co/mistralai/Mistral-7B-v0.1">mistralai/Mistral-7B-v0.1</a></td>
-			<td><a href="https://github.com/vllm-project/vllm/">vLLM</a></td>
+			<td rowspan="2"><a href="https://huggingface.co/Intel/neural-chat-7b-v3-3">Intel/neural-chat-7b-v3-3</a></td>
+			<td rowspan="2"><a href="https://github.com/vllm-project/vllm/">vLLM</a></td>
+			<td>Gaudi2</td>
+			<td>LLM on Gaudi2</td>
+		</tr>
+		<tr>
 			<td>Xeon</td>
 			<td>LLM on Xeon CPU</td>
 		</tr>

@@ -1,17 +1,19 @@
 # vLLM Endpoint Serve
 
-[vLLM](https://github.com/vllm-project/vllm) is a fast and easy-to-use library for LLM inference and serving, it delivers state-of-the-art serving throughput with a set of advanced features such as PagedAttention, Continuous batching and etc.. Besides GPUs, vLLM already supported [Intel CPUs](https://www.intel.com/content/www/us/en/products/overview.html), Gaudi accelerators support will be added soon. This guide provides an example on how to launch vLLM serving endpoint on CPU.
+[vLLM](https://github.com/vllm-project/vllm) is a fast and easy-to-use library for LLM inference and serving, it delivers state-of-the-art serving throughput with a set of advanced features such as PagedAttention, Continuous batching and etc.. Besides GPUs, vLLM already supported [Intel CPUs](https://www.intel.com/content/www/us/en/products/overview.html) and [Gaudi accelerators](https://habana.ai/products). This guide provides an example on how to launch vLLM serving endpoint on CPU and Gaudi accelerators.
 
 ## Getting Started
 
-### Launch vLLM CPU Service
+### Launch vLLM Service
 
 #### Launch a local server instance:
 
 ```bash
 bash ./serving/vllm/launch_vllm_service.sh
 ```
 
+The `./serving/vllm/launch_vllm_service.sh` accepts one parameter `hw_mode` to specify the hardware mode of the service, with the default being `cpu`, and the optional selection can be `hpu`.
+
 For gated models such as `LLAMA-2`, you will have to pass -e HF_TOKEN=\<token\> to the docker run command above with a valid Hugging Face Hub read token.
 
 Please follow this link [huggingface token](https://huggingface.co/docs/hub/security-tokens) to get the access token and export `HF_TOKEN` environment with the token.
@@ -33,16 +35,17 @@ curl http://127.0.0.1:8080/v1/completions \
   }'
 ```
 
-#### Customize vLLM CPU Service
+#### Customize vLLM Service
 
-The `./serving/vllm/launch_vllm_service.sh` script accepts two parameters:
+The `./serving/vllm/launch_vllm_service.sh` script accepts three parameters:
 
 - port_number: The port number assigned to the vLLM CPU endpoint, with the default being 8080.
-- model_name: The model name utilized for LLM, with the default set to "mistralai/Mistral-7B-v0.1".
+- model_name: The model name utilized for LLM, with the default set to "Intel/neural-chat-7b-v3-3".
+- hw_mode: The hardware mode utilized for LLM, with the default set to "cpu", and the optional selection can be "hpu"
 
-You have the flexibility to customize two parameters according to your specific needs. Additionally, you can set the vLLM CPU endpoint by exporting the environment variable `vLLM_LLM_ENDPOINT`:
+You have the flexibility to customize two parameters according to your specific needs. Additionally, you can set the vLLM endpoint by exporting the environment variable `vLLM_LLM_ENDPOINT`:
 
 ```bash
 export vLLM_LLM_ENDPOINT="http://xxx.xxx.xxx.xxx:8080"
-export LLM_MODEL=<model_name> # example: export LLM_MODEL="mistralai/Mistral-7B-v0.1"
+export LLM_MODEL=<model_name> # example: export LLM_MODEL="Intel/neural-chat-7b-v3-3"
 ```
@@ -0,0 +1,38 @@
+#!/bin/bash
+
+# Copyright (c) 2024 Intel Corporation
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Set default values
+default_hw_mode="cpu"
+
+# Assign arguments to variable
+hw_mode=${1:-$default_hw_mode}
+
+# Check if all required arguments are provided
+if [ "$#" -lt 0 ] || [ "$#" -gt 1 ]; then
+    echo "Usage: $0 [hw_mode]"
+    echo "Please customize the arguments you want to use.
+    - hw_mode: The hardware mode for the Ray Gaudi endpoint, with the default being 'cpu', and the optional selection can be 'cpu' and 'hpu'."
+    exit 1
+fi
+
+# Build the docker image for vLLM based on the hardware mode
+if [ "$hw_mode" = "hpu" ]; then
+    docker build -f docker/Dockerfile.hpu -t vllm:hpu --shm-size=128g . --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy
+else
+    git clone https://github.com/vllm-project/vllm.git
+    cd ./vllm/
+    docker build -f Dockerfile.cpu -t vllm:cpu --shm-size=128g . --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy
+fi
@@ -0,0 +1,20 @@
+FROM vault.habana.ai/gaudi-docker/1.16.0/ubuntu22.04/habanalabs/pytorch-installer-2.2.2:latest
+
+ENV LANG=en_US.UTF-8
+
+WORKDIR /root
+
+RUN pip install --upgrade-strategy eager optimum[habana]
+
+RUN pip install -v git+https://github.com/HabanaAI/vllm-fork.git@ae3d6121
+
+RUN sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config && \
+    service ssh restart
+
+ENV no_proxy=localhost,127.0.0.1
+
+ENV PT_HPU_LAZY_ACC_PAR_MODE=0
+
+ENV PT_HPU_ENABLE_LAZY_COLLECTIVES=true
+
+CMD ["/bin/bash"]
@@ -6,20 +6,29 @@
 
 # Set default values
 default_port=8080
-default_model="mistralai/Mistral-7B-v0.1"
+default_hw_mode="cpu"
+default_model="Intel/neural-chat-7b-v3-3"
 
 # Assign arguments to variables
 port_number=${1:-$default_port}
 model_name=${2:-$default_model}
+hw_mode=${3:-$default_hw_mode}
 
 # Check if all required arguments are provided
-if [ "$#" -lt 0 ] || [ "$#" -gt 2 ]; then
-    echo "Usage: $0 [port_number] [model_name]"
+if [ "$#" -lt 0 ] || [ "$#" -gt 3 ]; then
+    echo "Usage: $0 [port_number] [model_name] [hw_mode]"
+    echo "port_number: The port number assigned to the vLLM CPU endpoint, with the default being 8080."
+    echo "model_name: The model name utilized for LLM, with the default set to 'Intel/neural-chat-7b-v3-3'."
+    echo "hw_mode: The hardware mode utilized for LLM, with the default set to 'cpu', and the optional selection can be 'hpu'"
     exit 1
 fi
 
 # Set the volume variable
 volume=$PWD/data
 
-# Build the Docker run command based on the number of cards
-docker run -it --rm --name="ChatQnA_server" -p $port_number:$port_number --network=host -v $volume:/data -e HTTPS_PROXY=$https_proxy -e HTTP_PROXY=$https_proxy -e HF_TOKEN=${HUGGINGFACEHUB_API_TOKEN} vllm:cpu /bin/bash -c "cd / && export VLLM_CPU_KVCACHE_SPACE=40 && python3 -m vllm.entrypoints.openai.api_server --model $model_name --host 0.0.0.0 --port $port_number"
+# Build the Docker run command based on hardware mode
+if [ "$hw_mode" = "hpu" ]; then
+    docker run -it --runtime=habana --rm --name="ChatQnA_server" -p $port_number:$port_number -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host -e HTTPS_PROXY=$https_proxy -e HTTP_PROXY=$https_proxy -e HF_TOKEN=${HF_TOKEN} vllm:hpu /bin/bash -c "export VLLM_CPU_KVCACHE_SPACE=40 && python3 -m vllm.entrypoints.openai.api_server --enforce-eager --model $model_name --host 0.0.0.0 --port $port_number"
+else
+    docker run -it --rm --name="ChatQnA_server" -p $port_number:$port_number --network=host -v $volume:/data -e HTTPS_PROXY=$https_proxy -e HTTP_PROXY=$https_proxy -e HF_TOKEN=${HF_TOKEN} vllm:cpu /bin/bash -c "cd / && export VLLM_CPU_KVCACHE_SPACE=40 && python3 -m vllm.entrypoints.openai.api_server --enforce-eager --model $model_name --host 0.0.0.0 --port $port_number"
+fi