-
Notifications
You must be signed in to change notification settings - Fork 314
Feat/basic pipeline parallelism #422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/basic pipeline parallelism #422
Conversation
How it works
|
Example SnippetservingEngineSpec:
runtimeClassName: ""
raySpec:
headNode:
requestCPU: 2
requestMemory: "20Gi"
requestGPU: 1
modelSpec:
- name: "distilgpt2"
repository: "vllm/vllm-openai"
tag: "latest"
modelURL: "distilbert/distilgpt2"
replicaCount: 1
requestCPU: 2
requestMemory: "20Gi"
requestGPU: 1
vllmConfig:
tensorParallelSize: 1
pipelineParallelSize: 2
shmSize: "20Gi"
hf_token: <YOUR HF TOKEN> |
|
c0c7607 to
a671425
Compare
|
I'm going to add tutorials for:
Also, I will test pipeline parallelism on the node to confirm multi node distributed inference. |
|
Confirmed working from Kubernetes cluster of 2 nodes with 2 gpus: |
|
Documentation is still in progress. |
c40bbfc to
af93456
Compare
|
Initial documentation complete.
|
|
@YuhanLiu11 @haitwang-cloud Thanks for your comments and suggestions! This PR contains multiple new files and changes such as:
It took some time for me to test and include tutorial documents (especially initializing K8s cluster with multi nodes as well as installing container runtime and container network interface). |
aa112fe to
29dcb6d
Compare
That's awesome! I'll review it. |
|
@insukim1994 LGTM with a few nice to have comments |
| - Basic understanding of Linux shell commands. | ||
|
|
||
| 4. **Kubernetes Installation:** | ||
| - Follow the instructions in [`00-install-kubernetes-env.md`](00-install-kubernetes-env.md) to set up your Kubernetes environment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to follow tutorials/00-a-install-mulitnode-kubernetes-env.md to install the multi-node k8s cluster before running this tutorial?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh you are right. I should fix it since what we need is a multi-node K8s cluster. I will also add a comment that installation might not be needed if someone already has it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added more explanations on K8s prerequisite for it. Thank you!
|
@insukim1994 This is awesome! Thanks for this awesome PR again. I only left one minor comment. After you fix that I will merge this PR. Thanks again! |
Signed-off-by: insukim1994 <insu.kim@moreh.io>
Signed-off-by: insukim1994 <insu.kim@moreh.io>
…for the helm chart. Signed-off-by: insukim1994 <insu.kim@moreh.io>
…urce creation for kuberay. Signed-off-by: insukim1994 <insu.kim@moreh.io>
|
Seems like I need to resolve my conflicts. I will do it and let you know once it is done! |
Signed-off-by: insukim1994 <insu.kim@moreh.io>
…nd script due to kuberay operator args concatenation. Signed-off-by: insukim1994 <insu.kim@moreh.io>
…rol and worker). Signed-off-by: insukim1994 <insu.kim@moreh.io>
…w comment. Signed-off-by: insukim1994 <insu.kim@moreh.io>
…ith helm template command. Signed-off-by: insukim1994 <insu.kim@moreh.io>
…ith ray). Signed-off-by: insukim1994 <insu.kim@moreh.io>
Signed-off-by: insukim1994 <insu.kim@moreh.io>
… message of docker restart). Signed-off-by: insukim1994 <insu.kim@moreh.io>
…nstallation). Signed-off-by: insukim1994 <insu.kim@moreh.io>
ccd0e60 to
3c7810f
Compare
|
Seems like a typo exists on main branch. I will fix it and include it on my PR: |
Signed-off-by: insukim1994 <insu.kim@moreh.io>
hey @insukim1994 sorry just saw this message. Would be great if you can fix this in your PR too. Thanks! |
YuhanLiu11
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for the awesome PR!
|
Great work! Does the main Chart.yaml need to be updated to run this? I've followed the instructions and pods create, but it creates in the standard production stack manner. It doesn't create according to the new RayCluster template. values.yaml kubectl get pods |
|
@jcrock7 Thank you for letting me know the possible issue. I will check it and will leave a comment here thanks! |
|
@jcrock7 Thank you. I've identified the issue. Seems like vllm |
|
@jcrock7 I should have updated helm chart version, packaged it and be uploaded at repo. I will create a separate issue for it and will solve it at the corresponding PR. Thanks! |
This worked. Thanks again for your work on this - it really extends the capability for multi-node clusters! |
|
Just noticed that one of the pods fails to start due to a multi-attach error on the pvc. I believe the pvc.yaml template needs to be updated to ReadWriteMany. I will create separate PR unless you want to include in yours. |
|
@jcrock7 Thank you! You are right that pvc with RWO option cannot handle cases when it is shared between pods. Yes it will be very nice if you create a PR for it! |
|
Never mind, I confirmed the existing pvc.yaml works. I had a small typo in my values.yaml file. |
* [Feat] Added kuberay installation script via helm. Initial commit. Signed-off-by: insukim1994 <insu.kim@moreh.io> * Added initial helm chart template file for ray cluster creation. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Feat] Fixed typo at ray cluster template file. Added example values for the helm chart. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Feat] Removed unused fields at the moment. Bugfixed conflicting resource creation for kuberay. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Feat] Added startup probe to check if all ray cluster nodes are up. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Feat] Added vllm command composing and execute logic in the background script due to kuberay operator args concatenation. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Feat] Added pod relevant settings from servingEngineSpec for both head and worker grouops. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Feat] Added env templates for head and worker spec. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Feat] Added volumemounts template for head and worker spec. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Feat] Adeed templates for resource, probe, port and etc. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Feat] Initial working example. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Added documentation to run vllm with kuberay for pipeline parallelism. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Elaborated tutorial documentation. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Chore] Fixed typo in kuberay operator installation tutorial document. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Chore] Fixed a wording in kuberay operator installation tutorial document. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Chore] Fixed typo in kuberay operator installation tutorial document. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Chore] Removed unused value from helm chart default value. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Chore] Elaborated expression on tutorial document. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Chore] Elaborated expression on tutorial document. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Feat] Set readiness httpGet probe for ray head node. Removed unused container ports from ray worker nodes. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Feat] Added VLLM_HOST_IP based on official vllm docs. Added ray installation step. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Feat] Added missing dashboard related setting and a step for reinstalling ray. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Feat] Removed initContainer section that will be overwritted by kuberay operator. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Feat] Kuberay operator version updated needed. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Minor fix in tutorial. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Added sample gpu usage example for each ray head and worker node. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Chore] Fixed typo in basic pipeline parallel tutorial doc. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Chore] Reverted unnecessary change. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Chore] Fixed typo in kuberay install util script. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Added utility script to install kubeadm. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Added cri-o container runtime installation script & a script to create a control plane node. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Added script to join worker nodes. Elaborated control plane init script and cni installation. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Added nvidia gpu setup script for each node. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Script modification during testing. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Elaborated k8s controlplane initialization and worker node join script. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Elaborated basic pipeline parallelism tutorial document. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Added guide for settig up kubernetes cluster with 2 nodes (control and worker). Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Elaborated K8s cluster initialization guide and applied a review comment. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Chore] Strict total number of ray node checking. Tested helm chart with helm template command. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Elaborated important note when applying pipeline parallelism (with ray). Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Elaborated basic pipeline parallelism tutorial example. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Review updates (prevent duplicated line appends & added warning message of docker restart). Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Review updates (elaborated prerequisites for kuberay operator installation). Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Bugfix] Fixed version typo of lmcache from toml file. Signed-off-by: insukim1994 <insu.kim@moreh.io> --------- Signed-off-by: insukim1994 <insu.kim@moreh.io>
* [Feat] Added kuberay installation script via helm. Initial commit. Signed-off-by: insukim1994 <insu.kim@moreh.io> * Added initial helm chart template file for ray cluster creation. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Feat] Fixed typo at ray cluster template file. Added example values for the helm chart. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Feat] Removed unused fields at the moment. Bugfixed conflicting resource creation for kuberay. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Feat] Added startup probe to check if all ray cluster nodes are up. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Feat] Added vllm command composing and execute logic in the background script due to kuberay operator args concatenation. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Feat] Added pod relevant settings from servingEngineSpec for both head and worker grouops. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Feat] Added env templates for head and worker spec. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Feat] Added volumemounts template for head and worker spec. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Feat] Adeed templates for resource, probe, port and etc. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Feat] Initial working example. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Added documentation to run vllm with kuberay for pipeline parallelism. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Elaborated tutorial documentation. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Chore] Fixed typo in kuberay operator installation tutorial document. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Chore] Fixed a wording in kuberay operator installation tutorial document. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Chore] Fixed typo in kuberay operator installation tutorial document. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Chore] Removed unused value from helm chart default value. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Chore] Elaborated expression on tutorial document. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Chore] Elaborated expression on tutorial document. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Feat] Set readiness httpGet probe for ray head node. Removed unused container ports from ray worker nodes. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Feat] Added VLLM_HOST_IP based on official vllm docs. Added ray installation step. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Feat] Added missing dashboard related setting and a step for reinstalling ray. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Feat] Removed initContainer section that will be overwritted by kuberay operator. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Feat] Kuberay operator version updated needed. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Minor fix in tutorial. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Added sample gpu usage example for each ray head and worker node. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Chore] Fixed typo in basic pipeline parallel tutorial doc. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Chore] Reverted unnecessary change. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Chore] Fixed typo in kuberay install util script. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Added utility script to install kubeadm. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Added cri-o container runtime installation script & a script to create a control plane node. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Added script to join worker nodes. Elaborated control plane init script and cni installation. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Added nvidia gpu setup script for each node. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Script modification during testing. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Elaborated k8s controlplane initialization and worker node join script. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Elaborated basic pipeline parallelism tutorial document. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Added guide for settig up kubernetes cluster with 2 nodes (control and worker). Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Elaborated K8s cluster initialization guide and applied a review comment. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Chore] Strict total number of ray node checking. Tested helm chart with helm template command. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Elaborated important note when applying pipeline parallelism (with ray). Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Elaborated basic pipeline parallelism tutorial example. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Review updates (prevent duplicated line appends & added warning message of docker restart). Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Review updates (elaborated prerequisites for kuberay operator installation). Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Bugfix] Fixed version typo of lmcache from toml file. Signed-off-by: insukim1994 <insu.kim@moreh.io> --------- Signed-off-by: insukim1994 <insu.kim@moreh.io> Signed-off-by: allytotheson <82621261+allytotheson@users.noreply.github.com>
* [Feat] Added kuberay installation script via helm. Initial commit. Signed-off-by: insukim1994 <insu.kim@moreh.io> * Added initial helm chart template file for ray cluster creation. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Feat] Fixed typo at ray cluster template file. Added example values for the helm chart. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Feat] Removed unused fields at the moment. Bugfixed conflicting resource creation for kuberay. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Feat] Added startup probe to check if all ray cluster nodes are up. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Feat] Added vllm command composing and execute logic in the background script due to kuberay operator args concatenation. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Feat] Added pod relevant settings from servingEngineSpec for both head and worker grouops. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Feat] Added env templates for head and worker spec. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Feat] Added volumemounts template for head and worker spec. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Feat] Adeed templates for resource, probe, port and etc. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Feat] Initial working example. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Added documentation to run vllm with kuberay for pipeline parallelism. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Elaborated tutorial documentation. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Chore] Fixed typo in kuberay operator installation tutorial document. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Chore] Fixed a wording in kuberay operator installation tutorial document. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Chore] Fixed typo in kuberay operator installation tutorial document. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Chore] Removed unused value from helm chart default value. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Chore] Elaborated expression on tutorial document. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Chore] Elaborated expression on tutorial document. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Feat] Set readiness httpGet probe for ray head node. Removed unused container ports from ray worker nodes. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Feat] Added VLLM_HOST_IP based on official vllm docs. Added ray installation step. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Feat] Added missing dashboard related setting and a step for reinstalling ray. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Feat] Removed initContainer section that will be overwritted by kuberay operator. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Feat] Kuberay operator version updated needed. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Minor fix in tutorial. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Added sample gpu usage example for each ray head and worker node. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Chore] Fixed typo in basic pipeline parallel tutorial doc. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Chore] Reverted unnecessary change. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Chore] Fixed typo in kuberay install util script. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Added utility script to install kubeadm. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Added cri-o container runtime installation script & a script to create a control plane node. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Added script to join worker nodes. Elaborated control plane init script and cni installation. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Added nvidia gpu setup script for each node. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Script modification during testing. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Elaborated k8s controlplane initialization and worker node join script. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Elaborated basic pipeline parallelism tutorial document. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Added guide for settig up kubernetes cluster with 2 nodes (control and worker). Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Elaborated K8s cluster initialization guide and applied a review comment. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Chore] Strict total number of ray node checking. Tested helm chart with helm template command. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Elaborated important note when applying pipeline parallelism (with ray). Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Elaborated basic pipeline parallelism tutorial example. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Review updates (prevent duplicated line appends & added warning message of docker restart). Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Review updates (elaborated prerequisites for kuberay operator installation). Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Bugfix] Fixed version typo of lmcache from toml file. Signed-off-by: insukim1994 <insu.kim@moreh.io> --------- Signed-off-by: insukim1994 <insu.kim@moreh.io> Signed-off-by: allytotheson <82621261+allytotheson@users.noreply.github.com>
* [Feat] Added kuberay installation script via helm. Initial commit. Signed-off-by: insukim1994 <insu.kim@moreh.io> * Added initial helm chart template file for ray cluster creation. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Feat] Fixed typo at ray cluster template file. Added example values for the helm chart. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Feat] Removed unused fields at the moment. Bugfixed conflicting resource creation for kuberay. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Feat] Added startup probe to check if all ray cluster nodes are up. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Feat] Added vllm command composing and execute logic in the background script due to kuberay operator args concatenation. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Feat] Added pod relevant settings from servingEngineSpec for both head and worker grouops. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Feat] Added env templates for head and worker spec. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Feat] Added volumemounts template for head and worker spec. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Feat] Adeed templates for resource, probe, port and etc. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Feat] Initial working example. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Added documentation to run vllm with kuberay for pipeline parallelism. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Elaborated tutorial documentation. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Chore] Fixed typo in kuberay operator installation tutorial document. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Chore] Fixed a wording in kuberay operator installation tutorial document. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Chore] Fixed typo in kuberay operator installation tutorial document. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Chore] Removed unused value from helm chart default value. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Chore] Elaborated expression on tutorial document. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Chore] Elaborated expression on tutorial document. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Feat] Set readiness httpGet probe for ray head node. Removed unused container ports from ray worker nodes. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Feat] Added VLLM_HOST_IP based on official vllm docs. Added ray installation step. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Feat] Added missing dashboard related setting and a step for reinstalling ray. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Feat] Removed initContainer section that will be overwritted by kuberay operator. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Feat] Kuberay operator version updated needed. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Minor fix in tutorial. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Added sample gpu usage example for each ray head and worker node. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Chore] Fixed typo in basic pipeline parallel tutorial doc. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Chore] Reverted unnecessary change. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Chore] Fixed typo in kuberay install util script. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Added utility script to install kubeadm. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Added cri-o container runtime installation script & a script to create a control plane node. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Added script to join worker nodes. Elaborated control plane init script and cni installation. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Added nvidia gpu setup script for each node. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Script modification during testing. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Elaborated k8s controlplane initialization and worker node join script. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Elaborated basic pipeline parallelism tutorial document. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Added guide for settig up kubernetes cluster with 2 nodes (control and worker). Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Elaborated K8s cluster initialization guide and applied a review comment. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Chore] Strict total number of ray node checking. Tested helm chart with helm template command. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Elaborated important note when applying pipeline parallelism (with ray). Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Elaborated basic pipeline parallelism tutorial example. Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Review updates (prevent duplicated line appends & added warning message of docker restart). Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Doc] Review updates (elaborated prerequisites for kuberay operator installation). Signed-off-by: insukim1994 <insu.kim@moreh.io> * [Bugfix] Fixed version typo of lmcache from toml file. Signed-off-by: insukim1994 <insu.kim@moreh.io> --------- Signed-off-by: insukim1994 <insu.kim@moreh.io> Signed-off-by: senne.mennes@capgemini.com <senne.mennes@capgemini.com>

FILL IN THE PR DESCRIPTION HERE
FIX #101 (link existing issues this PR will resolve)
BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE
-swhen doinggit commit[Bugfix],[Feat], and[CI].Detailed Checklist (Click to Expand)
Thank you for your contribution to production-stack! Before submitting the pull request, please ensure the PR meets the following criteria. This helps us maintain the code quality and improve the efficiency of the review process.
PR Title and Classification
Please try to classify PRs for easy understanding of the type of changes. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:
[Bugfix]for bug fixes.[CI/Build]for build or continuous integration improvements.[Doc]for documentation fixes and improvements.[Feat]for new features in the cluster (e.g., autoscaling, disaggregated prefill, etc.).[Router]for changes to thevllm_router(e.g., routing algorithm, router observability, etc.).[Misc]for PRs that do not fit the above categories. Please use this sparingly.Note: If the PR spans more than one category, please include all relevant prefixes.
Code Quality
The PR need to meet the following code quality standards:
pre-committo format your code. SeeREADME.mdfor installation.DCO and Signed-off-by
When contributing changes to this project, you must agree to the DCO. Commits must include a
Signed-off-by:header which certifies agreement with the terms of the DCO.Using
-swithgit commitwill automatically add this header.What to Expect for the Reviews
We aim to address all PRs in a timely manner. If no one reviews your PR within 5 days, please @-mention one of YuhanLiu11
, Shaoting-Feng or ApostaC.