Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fix] Skip lm_head quantization for q4f16_ft #1731

Merged
merged 1 commit into from
Feb 9, 2024

Conversation

CharlieFRuan
Copy link
Contributor

This PR fixes the issue brought up in #1723.

The issue is due to how the lm_head of Llama has out_features defined as a tir.Var("vocab_size"). However, in the FT quantization scheme, the linear layer would be quatnzied to shape (in_features, tir.ceildiv(out_features, config.num_elem_per_storage)),. This causes the shape of the parameter to be a FloorDiv operation rather than just a tir.Var. Later it would lead to issue during shape-lowering:

tvm-unity/src/relax/backend/vm/vm_shape_lower.cc", line 305
InternalError: Check failed: (it != slot_map_.end()) is false: Var vocab_sizeis not defined in the function but is referenced by (vocab_size + T.int64(2) - T.int64(1)) // T.int64(2)

Instead, we skip quantizing lm_head here.

@MasterJH5574 MasterJH5574 merged commit 3feed05 into mlc-ai:main Feb 9, 2024
1 check passed
MasterJH5574 pushed a commit that referenced this pull request Feb 17, 2024
This PR reverts the change introduced in #1731 where we skip quantizing `lm_head` to avoid the dynamic vocab issue.

This led to performance degradation as pointed out in #1723.

Instead, we fall back to `GroupQuantizeLinear` for `lm_head`, which preserves performance and avoids the dynamic vocab size issue.

Performance change on RTX 4090

Prefill: 
throughput: **950.792 --> 973.859 tok/s**
total tokens: 7 tok

Decode:
throughput: **214.372 --> 223.491 tok/s**
total tokens: 256 tok
smickey040404 added a commit to smickey040404/mlc-llm that referenced this pull request Feb 11, 2025
This PR reverts the change introduced in mlc-ai/mlc-llm#1731 where we skip quantizing `lm_head` to avoid the dynamic vocab issue.

This led to performance degradation as pointed out in mlc-ai/mlc-llm#1723.

Instead, we fall back to `GroupQuantizeLinear` for `lm_head`, which preserves performance and avoids the dynamic vocab size issue.

Performance change on RTX 4090

Prefill: 
throughput: **950.792 --> 973.859 tok/s**
total tokens: 7 tok

Decode:
throughput: **214.372 --> 223.491 tok/s**
total tokens: 256 tok
tristankincaid added a commit to tristankincaid/mlc-llm that referenced this pull request Feb 16, 2025
This PR reverts the change introduced in mlc-ai/mlc-llm#1731 where we skip quantizing `lm_head` to avoid the dynamic vocab issue.

This led to performance degradation as pointed out in mlc-ai/mlc-llm#1723.

Instead, we fall back to `GroupQuantizeLinear` for `lm_head`, which preserves performance and avoids the dynamic vocab size issue.

Performance change on RTX 4090

Prefill: 
throughput: **950.792 --> 973.859 tok/s**
total tokens: 7 tok

Decode:
throughput: **214.372 --> 223.491 tok/s**
total tokens: 256 tok
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants