-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
Description
🚀 The feature, motivation and pitch
Feedback from RL community that vLLM weight loading in fp8 is bad for RL
The cause is clear: in fp8.py in process_weights_after_loading there is a lot of parameter wrapping that drops .weight_loader attribute.
There's a patch from the Moonshot team that fixes this issue and there's a PR with this patch that never got any comments. The patch only works on top of v0.10.2rc1. Shortly after that tag, this PR made fp8 weight updates even trickier by transposing weight_inv_scale parameter for CUTLASS.
I don't know how to patch any vLLM version after this PR to be able to call model.load_weights after the engine has started. It is a bummer, because DeepSeek wide EP inference is quite a bit faster in v0.11.0.
We need to fix this ASAP
Alternatives
No response
Additional context
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.