-
-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BugFix] Fix multiprocessing shutdown errors #7041
Conversation
1. Don't use daemon threads in the multiproc worker machinery 2. Ensure that the LLMEngine is garbage collected properly, so that the executor and its non-daemon threads are shut down and don't cause the process to hang There are still two warnings that appear consistently but I think that these are benign and we can investigate as a follow-on to this.
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge). To run full CI, you can do one of these:
🚀 |
@youkaichao with some more experimentation I found that the try/finally block there wasn't really sufficient anyhow. I've changed it now to include an excepthook to run the multiproc shutdown if the main thread exits abnormally. And with the other fix, the GC seems to work when it exits normally. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have the full expertise to figure out the root cause, and will see how ci tells.
Thanks for the hard working! Debugging gc-related problems is quite a pain.
# Clean up globals | ||
for var in ("openai_serving_chat", "openai_serving_completion", | ||
"openai_serving_embedding", "openai_serving_tokenization", | ||
"engine_args", "engine"): | ||
globals().pop(var, None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does del
work? globals().pop
is too hacky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@youkaichao we'd have to check and del
each one individually so it would be a lot more code. I want this to work whether or not each is defined, in case some error occurs after setting some and not others.
The way globals are used here is already hacky imo and I think we'll clean it up later. I wanted to keep this change as simple as possible as there are overlapping changes in #6883 which will be merged very soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it. this is a minor concern, and you can skip it if it is difficult to solve.
the most important part is still make the ci pass 🙏
Hi, any update on it? 👀 |
Especially with multiprocessing Replaces vllm-project#7041
Especially with multiprocessing Replaces vllm-project#7041
Superseded by #8492 |
sys.is_finalizing()
to avoid logging any error messages in case they are killed non-cleanly prior to the main process (though this should no longer happen with changes 1 and 2)There are still two warnings that appear consistently but I think that these are benign and we can investigate as a follow-on to this: