-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Serve] On kuberay, vLLM-0.7.2 reports "No CUDA GPUs are available" while vllm-0.6.6.post1 works fine when deploy rayservice #51154
Comments
same issue |
Issue Summary: I think I have identified the issue in RayDistributedExecutor. Specifically, the Therefore, in the Ray Serve framework, the deployment replica actor itself (which uses the first bundle in the placement group) will need GPU resources to create the local worker that will be initialized by In Ray's demo code, no GPU resources are allocated for the deployment replica actor, which leads to a CUDA error when the method of pg_resources.append({"CPU": 4}) # for the deployment replica, CPU ONLY! Current Problem: However, simply adding GPU resources to the first bundle may not resolve the issue. This is because vLLM creates a dummy Ray actor to hold the resources used by the local actor. If GPU resources are already allocated for the deployment replica actor, this results in an extra resource allocation, causing an infinite waiting state. Request for Feedback: I’m just starting to investigate this issue, so there may be inaccuracies in my understanding. Welcome to any comments, suggestions, or corrections! Let me know if you have insights or ideas to address this problem. |
We run into same issue when trying to serve Qwen2.5 VL AWQ in KubeRay. Running vllm serve in the same pod don't have the problem. |
Hi! There is a HACK technique that works for me. (In vllm 0.6.3, I'm not sure if it is still available in 0.7.0+, once I figure out it I will open a PR) Based on the given understanding, I altimately find that we can write a branch to hack the code that is related to dummy worker(the resource placeholder for local worker). The key point is that Firstly, we skip one bundle in the worker creation loop to avoid extra resource allocation, add the following code: for bundle_id, bundle in enumerate(placement_group.bundle_specs):
if not bundle.get("GPU", 0):
continue
#### BEGIN
if ray.get_runtime_context().get_actor_id():
# since we are in an actor, we should not create another dummy worker.
self.driver_worker = RayWorkerWrapper(**worker_wrapper_kwargs)
self.driver_dummy_worker = 1 # HACK!! just becase there will be a None check for dummy_worker
if not ray.get_gpu_ids():
# instead of checking the dummy worker, we directly check gpu allocation of the current actor
raise ValueError(
"Ray does not allocate any GPUs on the driver node. Consider "
"adjusting the Ray placement group or running the driver on a "
"GPU node.")
continue
#### END Then, we get the ray context info without dummy worker: origin: worker_node_and_gpu_ids = self._run_workers("get_node_and_gpu_ids",
use_dummy_driver=True) changed: ### BEGIN
if ray.get_runtime_context().get_actor_id():
# worker_driver should be enough to get ray context
worker_node_and_gpu_ids = self._run_workers("get_node_and_gpu_ids")
else:
worker_node_and_gpu_ids = self._run_workers("get_node_and_gpu_ids",
use_dummy_driver=True)
### END finally, change the ray serve entrypoint, note that we reduce one bundle and add pg_resources = []
# Deployment replica will also use GPU for AsyncLLMEngine.
### BEGIN
for i in range(tp):
pg_resources.append({"CPU": 1, "GPU": 1}) # for the vLLM actors,
# We use the "STRICT_PACK" strategy below to ensure all vLLM actors are placed on
# the same Ray node.
return VLLMDeployment.options(
# allocate resource for the deployment replica actor
ray_actor_options={
"num_gpus": 1,
"num_cpus": 1,
},
placement_group_bundles=pg_resources,
placement_group_strategy="STRICT_PACK",
).bind(
### END |
@huiyeruzhou can you try the new native llm api in ray serve and see if the issue persists? |
Hi! Here are my experiment findings. I discovered that the The default PG configuration is: For detailed analysis: #51242
|
What happened + What you expected to happen
Description
When deploying Qwen2.5-0.5B model using kuberay with vLLM 0.7.2, encountering "RuntimeError: No CUDA GPUs are available" error. However, the same deployment works fine with vLLM 0.6.6.post1 under identical environment conditions.
Environment Information
Steps to Reproduce
Using kuberay to deploy
RayService
with imagerayproject/ray:2.43.0-py39-cu124
, the RayService is:and the
latest-serve.py
inhttps://xxx/vllm_script.zip
is from: https://github.com/ray-project/ray/blob/master/doc/source/serve/doc_code/vllm_openai_example.pyThe exception traceback:
Related issue
I've been searching for solutions and found two issues that match my symptoms, but the solutions provided in those issues don't work in my case:
vllm-project/vllm#6896
#50275
Versions / Dependencies
Reproduction script
the vLLM deployment code:
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: