[Model] add kimi-k2-thinking fp4 support #167

gbyu-amd · 2026-01-27T08:41:12Z

Motivation

This PR added support for amd/Kimi-K2-Thinking-MXFP4, which only quantize the linear layers in MoE experts to MXFP4 format, i.e., attn layers, MoE gate, dense MLP layers, MoE shared experts and lm_head remains in bf16 weight.

Test Result

TP4 result

python -m atom.entrypoints.openai_server --model /data/models/Kimi-K2-Thinking-MXFP4 --trust-remote-code -tp 4 --kv_cache_dtype fp8

Accuracy

NOTE: torch.compile will hit an error with triton==3.5.1. It's a known issue as detailed in pytorch/pytorch#161618, and already fixed by pytorch/pytorch@05eeb29.
Either upgrade torch or downgrade triton can resolve the issue.
The result above is obtained by downgrading triton to 3.4.0.

TP8 result

Since Kimi-k2 has 64 num_heads, each rank will handle 8 heads when running TP8. Some existing kernels are not applicable in this case:

get_mla_metadata_info_v1() missing support for num_heads=8
New MLA decode kernel is needed for num_heads=8. One impl is available at https://github.com/ROCm/aiter/tree/zan_triton_mla_ps_mi355, and integration example refer to https://github.com/zhyajie/vllm/blob/fe76b4b7c183af745d656685611d91b297a0eee1/vllm/v1/attention/backends/mla/rocm_aiter_mla.py#L283. Need to confirm if this kernel is performant enough and can be merged to main.

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

ChuanLi1101 · 2026-02-02T05:51:21Z

atom/models/utils.py

+
+
+def should_ignore_layer(quantization_config: Optional[QuantizationConfig], prefix: str) -> bool:
+    exclude_layers: List[str] = quantization_config["exclude_layers"]


If quantization_config is None, accessing ["exclude_layers"] will raise TypeError.

ChuanLi1101 · 2026-02-02T05:52:44Z

atom/models/utils.py

+            regex_pattern = exclude_layer[3:]
+            if re.search(regex_pattern, prefix):
+                return True
+        elif exclude_layer.startswith(prefix):


exclude_layer.startswith(prefix) is backwards - should be prefix.startswith(exclude_layers)

ChuanLi1101 · 2026-02-02T05:53:32Z

atom/models/utils.py

+            # case "lm_head". Common practice won't quant lm_head, however.
+            if prefix.split(".")[-1] == exclude_layer:
+                return True
+    return False


Consider sth like this?

def should_ignore_layer(quantization_config: Optional[QuantizationConfig], prefix: str) -> bool:
if quantization_config is None:
return False
exclude_layers: List[str] = quantization_config.get("exclude_layers", [])
if not exclude_layers:
return False
for exclude_layer in exclude_layers:
if exclude_layer.startswith("re:"):
# case "re:model.layers.self_attn.", remove the 're:' prefix
regex_pattern = exclude_layer[3:]
if re.search(regex_pattern, prefix):
return True
elif prefix.startswith(exclude_layer):
# case "model.layers.0.self_attn.q_a_proj"
return True
elif prefix.split(".")[-1] == exclude_layer:
# case "lm_head". Common practice won't quant lm_head, however.
return True
return False

ChuanLi1101 · 2026-02-02T05:55:50Z

atom/models/deepseek_v2.py

            hidden_size,
            bias=False,
-            quant_config=quant_config,
+            quant_config=None if should_ignore_layer(quant_config, prefix=f"{prefix}.down_proj") else quant_config,


The pattern None if should_ignore_layer(...) else quant_config is repeated 4 times. Consider a helper function or local variable.

Guanbao Yu and others added 4 commits January 27, 2026 08:27

add kimi-k2-thinking fp4 support

d652873

Merge branch 'main' into guanbao/kimi_k2_fp4

d35b266

code refine

3d4778e

update

f775e6f

gbyu-amd requested review from ChuanLi1101 and valarLip January 27, 2026 14:37

Merge branch 'main' into guanbao/kimi_k2_fp4

725c72c

ChuanLi1101 reviewed Feb 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Model] add kimi-k2-thinking fp4 support #167

[Model] add kimi-k2-thinking fp4 support #167

Uh oh!

gbyu-amd commented Jan 27, 2026 •

edited

Loading

Uh oh!

ChuanLi1101 Feb 2, 2026

Uh oh!

ChuanLi1101 Feb 2, 2026

Uh oh!

ChuanLi1101 Feb 2, 2026

Uh oh!

ChuanLi1101 Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants



		def should_ignore_layer(quantization_config: Optional[QuantizationConfig], prefix: str) -> bool:
		exclude_layers: List[str] = quantization_config["exclude_layers"]

[Model] add kimi-k2-thinking fp4 support #167

Are you sure you want to change the base?

[Model] add kimi-k2-thinking fp4 support #167

Uh oh!

Conversation

gbyu-amd commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Test Result

TP4 result

Accuracy

TP8 result

Submission Checklist

Uh oh!

ChuanLi1101 Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

ChuanLi1101 Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

ChuanLi1101 Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

ChuanLi1101 Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gbyu-amd commented Jan 27, 2026 •

edited

Loading