[Fix] Skip lm_head quantization for q4f16_ft #1731

CharlieFRuan · 2024-02-08T19:04:57Z

This PR fixes the issue brought up in #1723.

The issue is due to how the lm_head of Llama has out_features defined as a tir.Var("vocab_size"). However, in the FT quantization scheme, the linear layer would be quatnzied to shape (in_features, tir.ceildiv(out_features, config.num_elem_per_storage)),. This causes the shape of the parameter to be a FloorDiv operation rather than just a tir.Var. Later it would lead to issue during shape-lowering:

tvm-unity/src/relax/backend/vm/vm_shape_lower.cc", line 305
InternalError: Check failed: (it != slot_map_.end()) is false: Var vocab_sizeis not defined in the function but is referenced by (vocab_size + T.int64(2) - T.int64(1)) // T.int64(2)

Instead, we skip quantizing lm_head here.

This PR reverts the change introduced in #1731 where we skip quantizing `lm_head` to avoid the dynamic vocab issue. This led to performance degradation as pointed out in #1723. Instead, we fall back to `GroupQuantizeLinear` for `lm_head`, which preserves performance and avoids the dynamic vocab size issue. Performance change on RTX 4090 Prefill: throughput: **950.792 --> 973.859 tok/s** total tokens: 7 tok Decode: throughput: **214.372 --> 223.491 tok/s** total tokens: 256 tok

This PR reverts the change introduced in mlc-ai/mlc-llm#1731 where we skip quantizing `lm_head` to avoid the dynamic vocab issue. This led to performance degradation as pointed out in mlc-ai/mlc-llm#1723. Instead, we fall back to `GroupQuantizeLinear` for `lm_head`, which preserves performance and avoids the dynamic vocab size issue. Performance change on RTX 4090 Prefill: throughput: **950.792 --> 973.859 tok/s** total tokens: 7 tok Decode: throughput: **214.372 --> 223.491 tok/s** total tokens: 256 tok

Do not quantize lm_head due to dynamic vocab size

f6becad

MasterJH5574 approved these changes Feb 9, 2024

View reviewed changes

MasterJH5574 merged commit 3feed05 into mlc-ai:main Feb 9, 2024
1 check passed

CharlieFRuan mentioned this pull request Feb 9, 2024

[Bug] SLM - mlc_chat convert_weight has errors with q4f16_ft quantization #1723

Closed

CharlieFRuan mentioned this pull request Feb 16, 2024

[FT] Fallback to GroupQuantize for lm head in FasterTransformer #1774

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix] Skip lm_head quantization for q4f16_ft #1731

[Fix] Skip lm_head quantization for q4f16_ft #1731

CharlieFRuan commented Feb 8, 2024

[Fix] Skip lm_head quantization for q4f16_ft #1731

[Fix] Skip lm_head quantization for q4f16_ft #1731

Conversation

CharlieFRuan commented Feb 8, 2024