-
Notifications
You must be signed in to change notification settings - Fork 37
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
3 changed files
with
234 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
# Run Llama 3 8B Model Inference with vLLM on GCP | ||
|
||
This example demonstrates how to run a Llama 3 8B model from Hugging Face with vLLM on GCP using Runhouse. | ||
|
||
Make sure to sign the waiver on the [Hugging Face model page](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) | ||
so that you can access it. | ||
|
||
## Setup credentials and dependencies | ||
|
||
Optionally, set up a virtual environment: | ||
```shell | ||
$ conda create -n llama3-rh python=3.9.15 | ||
$ conda activate llama3-rh | ||
``` | ||
|
||
Install the required dependencies: | ||
|
||
```shell | ||
$ pip install -r requirements.txt | ||
``` | ||
|
||
We'll be launching a GCP instance via [SkyPilot](https://github.com/skypilot-org/skypilot), so we need to make sure your credentials are set up. You may be prompted to pick a cloud project to use after running `gcloud init`. If you don't have one ready yet, you can connect one later by listing your projects with `gcloud projects list` and setting one with `gcloud config set project <PROJECT_ID>`. | ||
|
||
```shell | ||
$ gcloud init | ||
$ gcloud auth application-default login | ||
$ sky check | ||
``` | ||
|
||
We'll be downloading the Llama 3 model from Hugging Face, so we need to set up our Hugging Face token: | ||
|
||
```shell | ||
$ export HF_TOKEN=<your huggingface token> | ||
``` | ||
|
||
## Run the Python script | ||
|
||
```shell | ||
$ python llama3_vllm_gcp.py | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,192 @@ | ||
# # Run Llama 3 8B Model Inference with vLLM on GCP | ||
|
||
# This example demonstrates how to run a Llama 3 8B model from Hugging Face | ||
# with vLLM on GCP using Runhouse. | ||
# | ||
# Make sure to sign the waiver on the [Hugging Face model page](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) | ||
# so that you can access it. | ||
# | ||
# ## Setup credentials and dependencies | ||
# | ||
# Optionally, set up a virtual environment: | ||
# ```shell | ||
# $ conda create -n llama3-rh python=3.9.15 | ||
# $ conda activate llama3-rh | ||
# ``` | ||
# Install the required dependencies: | ||
# ```shell | ||
# $ pip install "runhouse[gcp] asyncio" | ||
# ``` | ||
# | ||
# We'll be launching a GCP instance via [SkyPilot](https://github.com/skypilot-org/skypilot), so we need to | ||
# make sure your credentials are set up. You may be prompted to pick a cloud project to use after running `gcloud init`. | ||
# If you don't have one ready yet, you can connect one later by listing your projects | ||
# with `gcloud projects list` and setting one with `gcloud config set project <PROJECT_ID>`. | ||
# ```shell | ||
# $ gcloud init | ||
# $ gcloud auth application-default login | ||
# $ sky check | ||
# ``` | ||
# We'll be downloading the Llama 3 model from Hugging Face, so we need to set up our Hugging Face token: | ||
# ```shell | ||
# $ export HF_TOKEN=<your huggingface token> | ||
# ``` | ||
# | ||
# ## Define a Llama 3 model class | ||
# We import `runhouse` and `asyncio` because that's all that's needed to run the script locally. | ||
# The actual vLLM imports are defined in the environment on the cluster in which the function itself is served. | ||
|
||
import asyncio | ||
|
||
import runhouse as rh | ||
|
||
# Next, we define a class that will hold the model and allow us to send prompts to it. | ||
# You'll notice this class inherits from `rh.Module`. | ||
# This is a Runhouse class that allows you to run code in your class on a remote machine. | ||
# | ||
# Learn more in the [Runhouse docs on functions and modules](/docs/tutorials/api-modules). | ||
class LlamaModel(rh.Module): | ||
def __init__(self, model_id="meta-llama/Meta-Llama-3-8B-Instruct", **model_kwargs): | ||
super().__init__() | ||
self.model_id, self.model_kwargs = model_id, model_kwargs | ||
self.engine = None | ||
|
||
def load_engine(self): | ||
from vllm.engine.arg_utils import AsyncEngineArgs | ||
from vllm.engine.async_llm_engine import AsyncLLMEngine | ||
|
||
args = AsyncEngineArgs( | ||
model=self.model_id, # Hugging Face Model ID | ||
tensor_parallel_size=1, # Increase if using additional GPUs | ||
trust_remote_code=True, # Trust remote code from Hugging Face | ||
enforce_eager=True, # Set to False for production use cases | ||
) | ||
self.engine = AsyncLLMEngine.from_engine_args(args) | ||
|
||
async def generate(self, prompt: str, **sampling_params): | ||
from vllm.sampling_params import SamplingParams | ||
from vllm.utils import random_uuid | ||
|
||
if not self.engine: | ||
self.load_engine() | ||
|
||
sampling_params = SamplingParams(**sampling_params) | ||
request_id = random_uuid() | ||
results_generator = self.engine.generate(prompt, sampling_params, request_id) | ||
|
||
async for output in results_generator: | ||
final_output = output | ||
responses = [] | ||
for output in final_output.outputs: | ||
responses.append(output.text) | ||
return responses | ||
|
||
|
||
# ## Set up Runhouse primitives | ||
# | ||
# Now, we define the main function that will run locally when we run this script and set up | ||
# our Runhouse module on a remote cluster. First, we create a cluster with the desired instance type and provider. | ||
# Our `instance_type` here is defined as `L4:1`, which is the accelerator type and count that we need. We could | ||
# alternatively specify a specific [GCP instance](https://cloud.google.com/compute/docs/gpus) type, such as `g2-standard-8`. | ||
# | ||
# Learn more in the [Runhouse docs on clusters](/docs/tutorials/api-clusters). | ||
# | ||
# :::note{.info title="Note"} | ||
# The Python code we'll run is contained in an asynchronous function, `main`. To make this guide more readable, it's | ||
# contents are rendered as top-level code snippets, but they should be included in the `main` method for running. | ||
# ::: | ||
async def main(): | ||
gpu_cluster = rh.cluster( | ||
name="rh-l4x", | ||
instance_type="L4:1", | ||
memory="32+", | ||
provider="gcp", | ||
autostop_mins=30, # Number of minutes to keep the cluster up after inactivity | ||
# (Optional) Include the following to create exposed TLS endpoints: | ||
# open_ports=[443], # Expose HTTPS port to public | ||
# server_connection_type="tls", # Specify how runhouse communicates with this cluster | ||
# den_auth=False, # No authentication required to hit this cluster (NOT recommended) | ||
) | ||
|
||
# We'll set an `autostop_mins` of 30 for this example. If you'd like your cluster to run indefinitely, set `autostop_mins=-1`. | ||
# You can use SkyPilot in the terminal to manage your active clusters with `sky status` and `sky down <cluster_id>`. | ||
# | ||
# Next, we define the environment for our module. This includes the required dependencies that need | ||
# to be installed on the remote machine, as well as any secrets that need to be synced up from local to remote. | ||
# Passing `huggingface` to the `secrets` parameter will load the Hugging Face token we set up earlier. | ||
# | ||
# Learn more in the [Runhouse docs on envs](/docs/tutorials/api-envs). | ||
env = rh.env( | ||
reqs=["vllm==0.2.7"], # >=0.3.0 causes Pydantic version error | ||
secrets=["huggingface"], # Needed to download Llama 3 from HuggingFace | ||
name="llama3inference", | ||
working_dir="./", | ||
) | ||
|
||
# Finally, we define our module and run it on the remote cluster. We construct it normally and then call | ||
# `get_or_to` to run it on the remote cluster. Using `get_or_to` allows us to load the exiting Module | ||
# by the name `llama3-8b-model` if it was already put on the cluster. If we want to update the module each | ||
# time we run this script, we can use `to` instead of `get_or_to`. | ||
# | ||
# Note that we also pass the `env` object to the `get_or_to` method, which will ensure that the environment is | ||
# set up on the remote machine before the module is run. | ||
remote_llama_model = LlamaModel().get_or_to( | ||
gpu_cluster, env=env, name="llama3-8b-model" | ||
) | ||
|
||
# ## Calling our remote function | ||
# | ||
# We can call the `generate` method on the model class instance as if it were running locally. | ||
# This will run the function on the remote cluster and return the response to our local machine automatically. | ||
# Further calls will also run on the remote machine, and maintain state that was updated between calls, like | ||
# `self.engine`. | ||
prompt = "The best chocolate chip cookie is" | ||
ans = await remote_llama_model.generate( | ||
prompt=prompt, temperature=0.8, top_p=0.95, max_tokens=100 | ||
) | ||
for text_output in ans: | ||
print(f"... Generated Text:\n{prompt}{text_output}\n") | ||
|
||
# :::note{.info title="Note"} | ||
# Your initial run of this script may take a few minutes to deploy an instance on GCP, set up the environment, | ||
# and load the Llama 3 model. Subsequent runs will reuse the cluster and generally take seconds. | ||
# ::: | ||
|
||
# ### Advanced: Sharing and TLS endpoints | ||
# Runhouse makes it easy to share your module or create a public endpoint you can curl or use in your apps. | ||
# Use the optional settings in your cluster definition above to expose an endpoint. You can additionally | ||
# enable [Runhouse Den](/dashboard) auth to require an auth token and provide access to your teammates. | ||
# | ||
# Fist, create or log in to your Runhouse account. | ||
# ```shell | ||
# $ runhouse login | ||
# ``` | ||
# | ||
# Once you've logged in to an account, use the following lines to enable Den Auth on the cluster, save | ||
# your resources to the Den UI, and grant access to your collaborators. | ||
# ```python | ||
# gpu_cluster.enable_den_auth() # Enable Den Auth | ||
# gpu_cluster.save() | ||
# remote_llama_model.save() # Save the module to Den for easy reloading | ||
# remote_llama_model.share(users=["friend@yourcompany.com"], access_level="read") | ||
# ``` | ||
# | ||
# Learn more: [Sharing](/docs/tutorials/quick-start-den#sharing) | ||
# | ||
# ### OpenAI Compatible Server | ||
# By default, vLLM implements OpenAI's Completions and Chat API. | ||
# This means that you can call your self-hosted Llama 3 model on GCP with OpenAI's Python library. | ||
# Read more about this and implementing chat templates in [vLLM's documentation](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html). | ||
|
||
|
||
# ## Run the script | ||
# Finally, we'll run the script to deploy the model and run inference. | ||
# :::note{.info title="Note"} | ||
# Make sure that your code runs within a `if __name__ == "__main__":` block. | ||
# Otherwise, the script code will run when Runhouse attempts to run code remotely. | ||
# ::: | ||
if __name__ == "__main__": | ||
asyncio.run(main()) | ||
|
||
# Please reference the Github link at the top of this page (if viewing via run.house/examples) | ||
# for the full Python file you can compare to or run yourself. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
asyncio | ||
runhouse[gcp] |