-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix RAM OOM when load large models in tensor parallel mode. #1395
Conversation
bb0773c
to
304e006
Compare
This PR seems like it hasn’t been reviewed yet. Just wanted to bring this to attention in case it slipped through the cracks. If anyone has some time to take a look and provide feedback, that would be great. I believe getting more eyes on this would really help in enhancing the quality of vllm project. |
@boydfd Hey, thanks for your work! Error info: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your contribution! The code in general looks good. Left some small comments on naming.
d0c6bbf
to
1422828
Compare
…t: model_load_batch_size, max_parallel_loading_workers -> max_parallel_loading_workers, batch_size -> max_concurrent_workers.
1422828
to
5ee5bcd
Compare
@jaywongs
|
already updated all commented naming in another commit. |
thank you for your reply:
|
since you have only 2 GPUs and set tensor-parallel-size to 2, you should set tensor-parallel-model-load-batch-size to 1 which means model loading will happen one by one. I still can't know why OOM happened. Can you try these two methods to help find out the root cause:
last question: do you run your code in K8s or docker? there might be memory limits, so you should consider setting a higher value. |
@boydfd How to use your patch? Here are my codes. Could you provide some tips about this?
|
If you met RAM OOM, try to set the max_parallel_loading_workers to a smaller number (like 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thank you for your contribution!
…ject#1395) Co-authored-by: ran_lin <rlin@thoughtworks.com>
…ject#1395) Co-authored-by: ran_lin <rlin@thoughtworks.com>
Hey, do we know why this was removed in the recent versions? If we do --max-parallel-loading-workers 1 in the engine args we get NotImplementedError( |
fix bug: #322 #872
Test log:
Memory usage in model loading