-
-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: trying to run vllm inference behind the fastapi's server, but it stucks #3747
Comments
Hi, please paste your environment using https://github.com/vllm-project/vllm/blob/main/collect_env.py , so that we can help you better. |
@youkaichao I tried to run it before but got this error - can you help me out
|
This is strange. Your environment might be broken. What happens when you manually execute |
@youkaichao I am using uv manager, which is Rust-based python package manager. and here's the uv pip freeze:
|
@youkaichao this issue happens same in Docker container |
I don't know if |
@youkaichao uv uses virtualenv under the hood, so you mean only conda is supported for vllm library? |
I would say |
okay, are there anyone trying to run vllm in docker settings? I encountered the most of my times dealing with |
First I suggest you switch to Second, which version of vllm do you use? We recently removed the cupy dependency , and also released v0.4.0 . You can try the new version. |
@youkaichao okay, will try new version. |
@sigridjineth Just curious - why not run just with the vllm api server as opposed to rebuilding your own? The API server code you have written is not the right way to use the LLM class. In your The way our API server works is that we [ load the models weights from disk, runs the profiler steps to see how much memory there is, allocates the full KV cache ] once, then during inference time we use this state. If you really do need to build an API server yourself rather than using the interfaces we provide, I would suggest looking in But you should have a very good reason for remaking this yourself |
Hey @sigridjineth , regarding you "stuck init" issue, how are you starting your container?
and, if it still fails, before you load the model, add:
|
We have added documentation for this situation in #5430. Please take a look. |
Your current environment
A100 x 8, ubuntu
🐛 Describe the bug
hello, I am trying to run
vllm
inference behind the fastapi's server, but it stucks atUsing model weights format ['*.safetensors']
. Are there anyone experiencing such a case?The code I am using is like the below.
The text was updated successfully, but these errors were encountered: