feat: RayServe with vLLM using AWS Neuron on Amazon EKS #607

ratnopamc · 2024-08-08T12:28:21Z

What does this PR do?

Adds capability to deploy LLMs for inference on AWS Inferentia with ray and vLLM
🛑 Please open an issue first to discuss any significant work and flesh out details/direction - we would hate for your time to be wasted.
Consult the CONTRIBUTING guide for submitting pull-requests.

Motivation

#591

More

[] Yes, I have tested the PR using my local account setup (Provide any test evidence report under Additional Notes)
[] Mandatory for new blueprints. Yes, I have added a example to support my blueprint PR
Mandatory for new blueprints. Yes, I have updated the website/docs or website/blog section for this feature
Yes, I ran pre-commit run -a with this PR. Link for installing pre-commit locally

For Moderators

E2E Test successfully complete before merge?

Additional Notes

omrishiv

Thanks for adding this! I think it's going to be really helpful. I left some comments, hopefully they make sense.

omrishiv · 2024-08-12T17:19:46Z

gen-ai/inference/vllm-rayserve-inf2/ray-service.yaml

+data:
+  hf-token: $HUGGING_FACE_HUB_TOKEN
+---
+apiVersion: ray.io/v1


I think we want to be a little more consistent with how we set up the resource configuration. We have 3 max replicas. We should be consistent with deciding if we want to scale nodes or if we want to scale actors. If we want to allow for intra node scaling, I believe you want to set the

neuron_codes = (6 * 2) / 3

or

(NUM_NEURON_DEVICES * 2) / NUM_REPLICAS_PER_NODE

We also want to make sure we set the num_cpus = total_node_cpus / num_replicas

If you want to use only full node scaling, we should max out the single node replica by setting the denominator to 1

gen-ai/inference/vllm-rayserve-inf2/Dockerfile

gen-ai/inference/vllm-rayserve-inf2/vllm_asyncllmengine.py

gen-ai/inference/vllm-rayserve-inf2/Dockerfile

omrishiv · 2024-08-12T17:25:50Z

gen-ai/inference/vllm-rayserve-inf2/Dockerfile

+            gnupg2 \
+      && sudo rm -rf /var/lib/apt/lists/*
+
+RUN sudo wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB > ./GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB


These can be grouped into one layer by having only 1 RUN command and using && \ between all of the commands, like:

RUN sudo wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB > ./GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB && \ sudo gpg --no-default-keyring --keyring ./aws_neuron_keyring.gpg --import ./GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB && \ sudo gpg --no-default-keyring --keyring ./aws_neuron_keyring.gpg --export > ./.etc/apt/trusted.gpg.d/aws_neuron.gpg && \ sudo rm ./GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB

omrishiv · 2024-08-12T17:27:09Z

gen-ai/inference/vllm-rayserve-inf2/Dockerfile

+RUN sudo mv ./aws_neuron.gpg /etc/apt/trusted.gpg.d/
+RUN sudo rm ./GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB
+
+RUN sudo add-apt-repository -y  "deb https://apt.repos.neuron.amazonaws.com jammy main"


You can also group all the apt commands together and follow it up with rm -rf /var/lib/apt/lists/*

omrishiv · 2024-08-12T17:27:27Z

gen-ai/inference/vllm-rayserve-inf2/Dockerfile

+RUN sudo apt-get -y install aws-neuronx-runtime-lib=2.*
+RUN sudo apt-get -y install aws-neuronx-tools=2.*
+
+RUN pip3 config set global.extra-index-url https://pip.repos.neuron.amazonaws.com


Please group all the pip installs

omrishiv · 2024-08-12T17:28:01Z

gen-ai/inference/vllm-rayserve-inf2/Dockerfile

+ENV VLLM_TARGET_DEVICE=neuron
+RUN git clone https://github.com/vllm-project/vllm.git
+RUN cd vllm && git checkout v0.5.0
+COPY patches/vllm_v0.5.0_neuron.patch vllm/vllm_v0.5.0_neuron.patch


If you put the copy ahead of the RUN, you can chain all the RUNs together

gen-ai/inference/vllm-rayserve-inf2/vllm_asyncllmengine.py

vara-bonthu · 2024-08-20T00:22:00Z

gen-ai/inference/vllm-rayserve-inf2/vllm-rayserve-deployment.yaml

+          schedulerName: my-scheduler # Correct placement
+          containers:
+          - name: worker
+            image: public.ecr.aws/data-on-eks/vllm-ray-neuron-mistral7b:latest


@ratnopamc Update this image with new one

Sure, will create a new ecr repo and image.

@vara-bonthu , Pushed the below changes -

Fixed Indentation, whitespacing and formatting of the neuron patch file.

Created a public repository vllm-ray2.32.0-inf2-llama3 under data-on-eks and pushed image vllm-ray2.32.0-inf2-llama3 built with the above patch.

Updated deployment.yaml to include openai install command.

Tested autoscaling across nodes, works fine!

Please review.

vara-bonthu · 2024-08-20T00:52:28Z

Next Steps once this PR is merged

1/ Add HF Token to deployment yaml and config map serving script to handle gated models
2/ Website Doc for the deployment

omrishiv · 2024-08-20T21:03:37Z

gen-ai/inference/vllm-rayserve-inf2/Dockerfile

+    rm -rf /var/lib/apt/lists/* && \
+    wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB | gpg --no-default-keyring --keyring ./aws_neuron_keyring.gpg --import && \
+    gpg --no-default-keyring --keyring ./aws_neuron_keyring.gpg --export > /etc/apt/trusted.gpg.d/aws_neuron.gpg && \
+    add-apt-repository -y "deb https://apt.repos.neuron.amazonaws.com jammy main" && \


I think we'd want to separate this into another layer to make it easier to update

@ratnopamc Could you please update this in your second PR along with Website doc?

* fix: bump data on eks addons to 1.33 to support karpenter helm resources with bottlerocket * feat: RayServe with vLLM using AWS Neuron on Amazon EKS (awslabs#607) Co-authored-by: Vara Bonthu <vara.bonthu@gmail.com> * feat: Mountpoint S3 for loading additional Spark Jars (awslabs#606) Co-authored-by: Karanbir Bains <bainskb@amazon.com> * fixes for pre-commit * fix pre-commit on the merged main * chore: Delete ai-ml/kubeflow directory (awslabs#619) * feat: Updated mountpoint-s3 for spark readme (awslabs#618) Co-authored-by: Karanbir Bains <bainskb@amazon.com> * feat: Trainium blueprint upgrade (awslabs#622) * feat: Neuron scheduler update for trainium-inferentia blueprints (awslabs#624) * feat: Website Updates (awslabs#626) * feat: Updates to the sidebar (awslabs#627) * feat: Added deprecating notes; added Jark stack doc;added warnings for ML p… (awslabs#628) * feat: NVIDIA NIM Updates (awslabs#631) * feat: Udate NVIDIA NIM blueprint with grafana dashboard and docs (awslabs#633) * feat: Add OpenWebUI for vllm-rayserve-inf2 blueprint (awslabs#635) --------- Co-authored-by: Ratnopam Charabarti <ratnopamc@yahoo.com> Co-authored-by: Vara Bonthu <vara.bonthu@gmail.com> Co-authored-by: Karanbir Bains <166257900+bainskb@users.noreply.github.com> Co-authored-by: Karanbir Bains <bainskb@amazon.com> Co-authored-by: Apoorva Kulkarni <kuapoorv@amazon.com>

Co-authored-by: Vara Bonthu <vara.bonthu@gmail.com>

ratnopamc added 2 commits August 7, 2024 20:38

add blueprint for rayserve on inf2 with vllm

df9c501

add blueprint for rayserve on inf2 with vllm

290803d

ratnopamc requested a review from vara-bonthu August 8, 2024 12:29

omrishiv reviewed Aug 12, 2024

View reviewed changes

askulkarni2 reviewed Aug 16, 2024

View reviewed changes

gen-ai/inference/vllm-rayserve-inf2/vllm_asyncllmengine.py Outdated Show resolved Hide resolved

vara-bonthu and others added 3 commits August 19, 2024 17:05

Rayserve vllm updates with OpenAI for Neuron, dockerfile udpates

264635f

pre-commit updates

21a0281

fix typo for github action to pass

78a0e71

vara-bonthu reviewed Aug 20, 2024

View reviewed changes

vara-bonthu changed the title ~~feat: Add blueprint for using rayserve with vLLM on Inferentia2~~ feat: vLLM and RayServe with Neuron on Amazon EKS Aug 20, 2024

added correct neuron patch and updated docker image

6379ac7

ratnopamc changed the title ~~feat: vLLM and RayServe with Neuron on Amazon EKS~~ feat: RayServe with vLLM using AWS Neuron on Amazon EKS Aug 20, 2024

vara-bonthu approved these changes Aug 20, 2024

View reviewed changes

vara-bonthu merged commit 1f682cd into awslabs:main Aug 20, 2024
37 of 38 checks passed

omrishiv reviewed Aug 20, 2024

View reviewed changes

ratnopamc mentioned this pull request Aug 21, 2024

feat: Inference using vLLM with RayServe on Inf2 #591

Closed

ratnopamc deleted the ray-vllm-inf2-updates branch September 3, 2024 22:30

lindarr915 pushed a commit to lindarr915/data-on-eks that referenced this pull request Sep 6, 2024

feat: RayServe with vLLM using AWS Neuron on Amazon EKS (awslabs#607)

83d9378

Co-authored-by: Vara Bonthu <vara.bonthu@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: RayServe with vLLM using AWS Neuron on Amazon EKS #607

feat: RayServe with vLLM using AWS Neuron on Amazon EKS #607

ratnopamc commented Aug 8, 2024

omrishiv left a comment

omrishiv Aug 12, 2024 •

edited

Loading

omrishiv Aug 12, 2024

omrishiv Aug 12, 2024

omrishiv Aug 12, 2024

omrishiv Aug 12, 2024

vara-bonthu Aug 20, 2024

ratnopamc Aug 20, 2024

ratnopamc Aug 20, 2024

vara-bonthu commented Aug 20, 2024

omrishiv Aug 20, 2024

vara-bonthu Aug 21, 2024

feat: RayServe with vLLM using AWS Neuron on Amazon EKS #607

feat: RayServe with vLLM using AWS Neuron on Amazon EKS #607

Conversation

ratnopamc commented Aug 8, 2024

What does this PR do?

Motivation

More

For Moderators

Additional Notes

omrishiv left a comment

Choose a reason for hiding this comment

omrishiv Aug 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vara-bonthu commented Aug 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

omrishiv Aug 12, 2024 •

edited

Loading