-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
DeepSeek fix: awq x mergedreplicatedlinear #23764
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DeepSeek fix: awq x mergedreplicatedlinear #23764
Conversation
Signed-off-by: Mickael Seznec <mickael@mistral.ai>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to fix an issue with AWQ quantization in MergedReplicatedLinear
layers by refactoring the weight loading to use a dedicated method, load_merged_column_weight
. This is a good approach for handling specialized logic in custom parameter classes. However, the current implementation introduces a critical regression for unquantized models. It unconditionally calls param.load_merged_column_weight
, but for unquantized layers, the parameter is a standard torch.nn.Parameter
which lacks this method, and will cause an AttributeError
during model loading. A check on the parameter type is needed to maintain backward compatibility for unquantized models.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Mickaël Seznec <mickael.seznec@gmail.com>
It would be great if you could assign a reviewer to get it applied quickly! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have also checked that this PR works for awq and fp8, thanks for the fix.
BTW we may not need the isinstance(param, PerTensorScaleParameter)
check now?
param.data[shard_offset:shard_offset + shard_size] = loaded_weight | ||
if isinstance(param, BasevLLMParameter): | ||
param.load_merged_column_weight(loaded_weight=loaded_weight, | ||
shard_id=loaded_shard_id, | ||
shard_offset=shard_offset, | ||
shard_size=shard_size, | ||
tp_rank=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I do think calling the kwarg tp_rank
is misleading, it does the trick and I can't come up with a better naming.
@mickaelseznec We can ping the corresponding reviewers by prepending |
Signed-off-by: Mickael Seznec <mickael@mistral.ai>
Don't know who would be best to review, @mgoin @robertgshaw2-redhat because it's quantization related? Feel free to dispatch to others as well :) |
@tlrmchlsmth @yewentao256 This PR fixes weight loading logic of DeepSeek V2/V3 AWQ quantized models which is an oversight from the fused MLA qkv kernel update #21116. |
elif isinstance(param, PerTensorScaleParameter): | ||
shard_offset = loaded_shard_id | ||
shard_size = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PerTensorScaleParameter
is a subclass of BasevLLMParameter
, and PerTensorScaleParameter.load_merged_column_weight
is doing the right weight loading, so I reckon it a code quality improvement.
TLDR: We have previously set shard_offset and shard_size to something not what the name represents just to make the following weight overloading part(param.data[shard_offset:shard_offset + shard_size] = loaded_weight
) working, but PerTensorScaleParameter.load_merged_column_weight
which is just BasevLLMParameter._assert_and_load
is doing exactly the same.
Can we get this PR reviewed and merged in 0.10.2? The content of the fix is not too complicated, if one needs e2e validation (beyond what is already given in the PR body) for the review I can help. |
LGTM |
Hello, I would like to know which version of vllm this PR will be merged into. |
+1 |
Fix should already be there with #23024 |
Purpose
Fixing #23530
Test Plan
Check manually that DeepSeek AWQ output is correct
Test Result
vllm (pretrained=/models/DeepSeek-R1-AWQ,tensor_parallel_size=8,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.