[ROCm][FEAT] Fuse DeepSeek shared experts into AITER fused_moe ops #24097

kliuae · 2025-09-02T12:10:40Z

Purpose

This PR targets ROCm AITER, introducing a flag‑gated path that fuses DeepSeek models’ shared_experts into the AITER's FusedMoE kernel, reducing separate MLP and addition overhead while preserving numeric behavior.

When shared experts fusion is enabled, the shared experts are viewed as synthetic routed experts after the original routed experts and receive allocated top‑k slots through grouped_topk, enabling a single fused MoE dispatch for both shared and routed experts.

This feature can be controlled by the environment flag VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS which is only effective when VLLM_ROCM_USE_AITER_MOE is set.

Test Plan

The following tests validate DeepSeek models by collecting benchmark metrics and performning correctness tests through lm_eval.

vLLM server launch command:

# Toggle VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS

VLLM_USE_V1=1 \
VLLM_ROCM_USE_AITER=1 \
VLLM_ROCM_USE_AITER_MOE=1 \
VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=${FUSE_SHARED_EXPERTS} \
VLLM_DISABLE_COMPILE_CACHE=1 \
vllm serve ${model_name} --tensor-parallel-size ${tp_size} --block-size 1 --compilation-config '{"cuadgraph_mode": "FULL_AND_PIECEWISE"}'

Benchmark commands:

# sharegpt dataset
vllm bench serve --model ${model_name} --dataset-name sharegpt --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --percentile-metrics ttft,tpot,itl,e2el --request-rate ${request_rate}

# random dataset
vllm bench serve --model ${model_name} --percentile-metrics ttft,tpot,itl,e2el --request-rate 10 --num-prompts 100 --dataset-name random --random-input-len 1024 --random-output-len 1024

lm_eval command:

lm_eval --model local-completions --tasks gsm8k --model_args model=${model_name},base_url=http://localhost:8000/v1/completions,num_concurrent=128,max_retries=3,tokenized_requests=False

Test Result

Benchmark results

deepseek-ai/DeepSeek-R1 on sharegpt dataset

Aiter Fused Shared Experts	Request Rate	QPS	Throughput	TTFT		TOPT		ITL
				P50	P99	P50	P99	P50	P99
No	4	3.75	1479.63	120.81	380.14	45.37	83.91	35.65	147.64
Yes	4	3.76	1479.35	119.48	363.16	44.02	74.72	34.87	138.54
+		0.27%	-0.02%	1.11%	4.68%	3.07%	12.30%	2.24%	6.57%
No	8	6.82	2696.90	175.49	411.43	72.03	127.79	48.63	164.63
Yes	8	6.93	2729.55	161.74	457.21	65.76	122.38	44.60	157.29
+		1.61%	1.21%	8.50%	-10.01%	9.53%	4.42%	9.04%	4.67%
No	inf	14.97	5805.71	9457.86	17323.00	127.69	683.91	75.30	574.26
Yes	inf	15.05	5959.04	9240.45	16705.93	126.12	659.70	73.93	541.18
+		0.53%	2.64%	2.35%	3.69%	1.24%	3.67%	1.85%	6.11%

deepseek-ai/DeepSeek-R1 on random dataset, input-len/output-len: 1k/1k

Aiter Fused Shared Experts	Request Rate	QPS	Throughput	TTFT		TOPT		ITL
				P50	P99	P50	P99	P50	P99
No	10	1.81	3450.08	641.56	1254.48	47.84	625.85	42.68	218.86
Yes	10	1.92	3719.75	576.87	1277.88	44.18	513.07	40.21	287.79
+		6.08%	7.82%	11.21%	-1.83%	8.28%	21.98%	6.14%	-23.95%

Accuracy test

deepseek-ai/DeepSeek-R1

Aiter Fused Shared Experts	Tasks	Version	Filter	n-shot	Metric		Value		Stderr
No	gsm8k	3	flexible-extract	5	exact_match	_	0.9545	_	0.0057
			strict-match	5	exact_match	_	0.9538	_	0.0058
Yes	gsm8k	3	flexible-extract	5	exact_match	_	0.9568	_	0.0056
			strict-match	5	exact_match	_	0.9560	_	0.0056

deepseek-ai/DeepSeek-V3

Aiter Fused Shared Experts	Tasks	Version	Filter	n-shot	Metric		Value		Stderr
No	gsm8k	3	flexible-extract	5	exact_match	_	0.9492	_	0.006
			strict-match	5	exact_match	_	0.9492	_	0.006
Yes	gsm8k	3	flexible-extract	5	exact_match	_	0.9507	_	0.0060
			strict-match	5	exact_match	_	0.9484	_	0.0061

deepseek-ai/DeepSeek-V2-Lite-Chat

Aiter Fused Shared Experts	Tasks	Version	Filter	n-shot	Metric		Value		Stderr
No	gsm8k	3	flexible-extract	5	exact_match	_	0.4845	_	0.0138
			strict-match	5	exact_match	_	0.4776	_	0.0138
Yes	gsm8k	3	flexible-extract	5	exact_match	_	0.4936	_	0.0138
			strict-match	5	exact_match	_	0.4890	_	0.0138

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Deepseek 085 sharedexperts aiter jun new Signed-off-by: chenjun <junchen2@amd.com> Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

mergify · 2025-09-02T12:11:19Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @kliuae.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

This pull request introduces an optimization for DeepSeek models on ROCm by fusing shared experts into the AITER FusedMoE kernel. This is controlled by a new environment flag. The changes span across environment variable setup, the core FusedMoE layer, quantization layers, and the DeepSeek model implementation to correctly handle the fused logic and weight loading.

The implementation looks solid and the changes are consistent with the goal of the PR. I've found one area for improvement in the initialization logic for the shared expert metadata, which could be made more memory and performance efficient. My detailed feedback is in the comment below.

gemini-code-assist · 2025-09-02T12:15:14Z

vllm/model_executor/layers/fused_moe/rocm_aiter_fused_moe.py

+    if is_EP:
+        s_topk_ids_list = [[fake_expertid] *
+                           (n_shared_experts + is_EP)] * max_num_tokens
+        for i in range(tp_rank, max_num_tokens, tp_size):
+            s_topk_ids_list[i] = shared_expert_ids
+    else:
+        s_topk_ids_list = [range(n_routed_experts, fake_expertid)
+                           ] * max_num_tokens
+    s_topk_ids[:] = torch.tensor(s_topk_ids_list,
+                                 dtype=torch.int32,
+                                 device='cuda')


The current implementation for initializing s_topk_ids can be inefficient. It constructs a large Python list of lists (s_topk_ids_list) on the host, which is then converted to a PyTorch tensor on the CPU before being moved to the GPU. For a large max_num_tokens, this can lead to significant host memory consumption and slow down the initialization process.

A more efficient approach would be to perform these operations directly on the GPU tensor, avoiding the large intermediate host-side data structures. This can be achieved using tensor broadcasting and slicing.

Suggested change

if is_EP:

s_topk_ids_list = [[fake_expertid] *

(n_shared_experts + is_EP)] * max_num_tokens

for i in range(tp_rank, max_num_tokens, tp_size):

s_topk_ids_list[i] = shared_expert_ids

else:

s_topk_ids_list = [range(n_routed_experts, fake_expertid)

] * max_num_tokens

s_topk_ids[:] = torch.tensor(s_topk_ids_list,

dtype=torch.int32,

device='cuda')

if is_EP:

s_topk_ids.fill_(fake_expertid)

shared_expert_ids_tensor = torch.tensor(shared_expert_ids,

dtype=torch.int32,

device='cuda')

s_topk_ids[tp_rank::tp_size] = shared_expert_ids_tensor

else:

s_topk_ids_row = torch.arange(n_routed_experts,

fake_expertid,

dtype=torch.int32,

device='cuda')

s_topk_ids.copy_(s_topk_ids_row.expand(max_num_tokens, -1))

gemini-code-assist

Code Review

This pull request introduces a significant performance optimization for DeepSeek models on ROCm by fusing shared experts into the AITER MoE kernel. The implementation is gated behind environment variables and includes comprehensive benchmark and accuracy tests, which is great. However, I've identified two critical issues that need to be addressed. The first is related to the use of a global variable for model-specific metadata, which can lead to race conditions and incorrect behavior when serving multiple models. The second is a bug in the weight loading logic for the fused shared experts, which fails to correctly track loaded parameters and will likely cause errors. Addressing these issues will ensure the stability and correctness of this new feature.

gemini-code-assist · 2025-09-02T12:15:19Z

vllm/model_executor/layers/fused_moe/rocm_aiter_fused_moe.py

+aiter_topK_meta_data = None
+
+
+@lru_cache(maxsize=1)
+def init_aiter_topK_meta_data(n_routed_experts: int,
+                              n_shared_experts: int,
+                              top_k: int,
+                              tp_rank: int,
+                              tp_size: int,
+                              shared_experts_score: float = 1.0,
+                              max_num_tokens: int = 32768,
+                              is_EP: bool = False):
+    global aiter_topK_meta_data
+    fake_expertid = n_routed_experts + n_shared_experts
+
+    # all layers reuse same buffer
+    total_topk_ids = torch.empty(
+        (max_num_tokens, top_k + n_shared_experts + is_EP),
+        dtype=torch.int32,
+        device='cuda')
+    ns_topk_ids, s_topk_ids = total_topk_ids.split(
+        [top_k, n_shared_experts + is_EP], dim=1)
+    shared_expert_ids = [
+        n_routed_experts + i for i in range(n_shared_experts + is_EP)
+    ]
+    if is_EP:
+        s_topk_ids_list = [[fake_expertid] *
+                           (n_shared_experts + is_EP)] * max_num_tokens
+        for i in range(tp_rank, max_num_tokens, tp_size):
+            s_topk_ids_list[i] = shared_expert_ids
+    else:
+        s_topk_ids_list = [range(n_routed_experts, fake_expertid)
+                           ] * max_num_tokens
+    s_topk_ids[:] = torch.tensor(s_topk_ids_list,
+                                 dtype=torch.int32,
+                                 device='cuda')
+
+    total_topk_weights = torch.empty(
+        (max_num_tokens, top_k + n_shared_experts + is_EP),
+        dtype=torch.float32,
+        device='cuda')
+    ns_topk_weights, s_topk_weights = total_topk_weights.split(
+        [top_k, n_shared_experts + is_EP], dim=1)
+    s_topk_weights.fill_(shared_experts_score)
+    aiter_topK_meta_data = (total_topk_weights, total_topk_ids)


The use of a global variable aiter_topK_meta_data to store model-specific metadata is problematic. If vLLM serves multiple models with different MoE configurations in the same process, this global variable will be overwritten, leading to incorrect behavior for one of the models. This can cause race conditions and hard-to-debug errors.

The metadata should be managed without using a global variable. A better approach would be:

Modify init_aiter_topK_meta_data to return the metadata tuple instead of modifying a global variable. The @lru_cache decorator should then be used on a function that is pure (has no side effects).

In FusedMoE.__init__, store the returned metadata in an instance attribute, e.g., self.aiter_topK_meta_data.

Pass this instance attribute down through the call chain (forward_cuda -> select_experts -> rocm_aiter_grouped_topk).

rocm_aiter_grouped_topk should then use the passed metadata instead of the global variable.

This change will ensure that each model's metadata is properly encapsulated and avoids race conditions.

gemini-code-assist · 2025-09-02T12:15:20Z

vllm/model_executor/models/deepseek_v2.py

            else:
                is_expert_weight = False
-                for mapping in expert_params_mapping:
-                    param_name, weight_name, expert_id, shard_id = mapping
-                    if weight_name not in name:
-                        continue
-
-                    # Anyway, this is an expert weight and should not be
-                    # attempted to load as other weights later
-                    is_expert_weight = True
-
-                    # Do not modify `name` since the loop may continue here
-                    # Instead, create a new variable
-                    name_mapped = name.replace(weight_name, param_name)
-
-                    if is_pp_missing_parameter(name_mapped, self):
-                        continue
-
-                    param = params_dict[name_mapped]
-                    # We should ask the weight loader to return success or not
-                    # here since otherwise we may skip experts with other
-                    # available replicas.
-                    weight_loader = typing.cast(Callable[..., bool],
-                                                param.weight_loader)
-                    success = weight_loader(param,
-                                            loaded_weight,
-                                            name_mapped,
-                                            shard_id=shard_id,
-                                            expert_id=expert_id,
-                                            return_success=True)
-                    if success:
-                        name = name_mapped
-                        break
-                else:
-                    if is_expert_weight:
-                        # We've checked that this is an expert weight
-                        # However it's not mapped locally to this rank
-                        # So we simply skip it
-                        continue
-
-                    # Skip loading extra bias for GPTQ models.
-                    if name.endswith(".bias") and name not in params_dict:
-                        continue
-
-                    # Remapping the name of FP8 kv-scale.
-                    name = maybe_remap_kv_scale_name(name, params_dict)
-                    if name is None:
-                        continue
-
-                    if is_pp_missing_parameter(name, self):
-                        continue
-
-                    param = params_dict[name]
-                    weight_loader = getattr(param, "weight_loader",
-                                            default_weight_loader)
-                    weight_loader(param, loaded_weight)
+
+                # Special handling: when AITER fusion_shared_experts is enabled,
+                # checkpoints may provide a single widened shared_experts tensor
+                # without explicit expert indices
+                # (e.g. ...mlp.shared_experts.gate_proj.weight).
+                # For models with multiple shared experts, split that tensor
+                # evenly into per-shared-expert slices and load them into
+                # appended expert slots mlp.experts.{n_routed_experts + j}.*
+                # accordingly.
+                num_chunks = 1
+                if is_fuse_shared_experts_layer:
+                    num_chunks = getattr(self.config, "n_shared_experts",
+                                         1) or 1
+                    # Determine split axis based on op type
+                    # gate/up: ColumnParallel → split along dim 0
+                    # down: RowParallel → split along dim 1
+                    split_dim = 1 if "down_proj.weight" in name else 0
+                    total = loaded_weight.shape[split_dim]
+                    assert total % num_chunks == 0, (
+                        f"Shared expert weight dim {total} "
+                        f"not divisible by num_chunks {num_chunks}")
+                    chunk_size = total // num_chunks
+
+                for j in range(num_chunks):
+                    chunk_name = name
+                    weight_to_load = loaded_weight
+
+                    if is_fuse_shared_experts_layer:
+                        if split_dim == 0:
+                            weight_to_load = loaded_weight[j *
+                                                           chunk_size:(j + 1) *
+                                                           chunk_size, :]
+                        else:
+                            weight_to_load = loaded_weight[:, j *
+                                                           chunk_size:(j + 1) *
+                                                           chunk_size]
+                        # Synthesize an expert-style name so expert mapping
+                        # can route it
+                        chunk_name = name.replace(
+                            "mlp.shared_experts",
+                            f"mlp.experts.{self.config.n_routed_experts + j}")
+
+                    # Use expert_params_mapping to locate the destination
+                    # param and delegate to its expert-aware weight_loader
+                    # with expert_id.
+                    for mapping in expert_params_mapping:
+                        param_name, weight_name, expert_id, shard_id = mapping
+                        if weight_name not in chunk_name:
+                            continue
+
+                        # Anyway, this is an expert weight and should not be
+                        # attempted to load as other weights later
+                        is_expert_weight = True
+
+                        # Do not modify `name` since the loop may continue here
+                        # Instead, create a new variable
+                        name_mapped = chunk_name.replace(
+                            weight_name, param_name)
+
+                        if is_pp_missing_parameter(name_mapped, self):
+                            continue
+
+                        param = params_dict[name_mapped]
+                        # We should ask the weight loader to return success or
+                        # not here since otherwise we may skip experts with
+                        # other available replicas.
+                        weight_loader = typing.cast(Callable[..., bool],
+                                                    param.weight_loader)
+                        success = weight_loader(param,
+                                                weight_to_load,
+                                                name_mapped,
+                                                shard_id=shard_id,
+                                                expert_id=expert_id,
+                                                return_success=True)
+                        if success:
+                            if not is_fuse_shared_experts_layer:
+                                name = name_mapped
+                            break
+                    else:
+                        if is_expert_weight:
+                            # We've checked that this is an expert weight
+                            # However it's not mapped locally to this rank
+                            # So we simply skip it
+                            continue
+
+                        # Skip loading extra bias for GPTQ models.
+                        if name.endswith(".bias") and name not in params_dict:
+                            continue
+
+                        # Remapping the name of FP8 kv-scale.
+                        name = maybe_remap_kv_scale_name(name, params_dict)
+                        if name is None:
+                            continue
+
+                        if is_pp_missing_parameter(name, self):
+                            continue
+
+                        param = params_dict[name]
+                        weight_loader = getattr(param, "weight_loader",
+                                                default_weight_loader)
+                        weight_loader(param, loaded_weight)
            loaded_params.add(name)



The logic for loading shared expert weights when is_fuse_shared_experts_layer is true does not correctly update the loaded_params set. It adds the original shared expert tensor name (e.g., ...mlp.shared_experts.gate_proj.weight) to loaded_params, but it actually loads the weights into multiple, chunked expert parameters (e.g., mlp.experts.64.*, mlp.experts.65.*, etc.).

As a result, vLLM will not be aware that these chunked expert parameters have been loaded, which will likely lead to "missing keys" errors at the end of the weight loading process or incorrect model behavior if those checks are bypassed.

The fix is to add each name_mapped to loaded_params as it is successfully loaded within the for j in range(num_chunks): loop, and then prevent the original shared expert name from being added to loaded_params at the end of the outer loop over weights.

@kliuae , is this comment relevant? I'm not familiar enough with the weight loading to know.

Currently it's not checked in vLLM, and yes down the line it'd be good to align the loaded_params with the names actually loaded. We'll make updates accordingly to reflect this.

gshtras · 2025-09-02T15:47:37Z

cc @qli88

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

mergify · 2025-09-10T18:46:17Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @kliuae.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

bnellnm · 2025-10-10T20:25:06Z

vllm/model_executor/models/deepseek_v2.py

-        if config.n_shared_experts is None:
+        if (
+            config.n_shared_experts is None
+            or is_rocm_aiter_fusion_shared_expert_enabled()
+        ):


Note that I refactored this recently so that there's only an instance of SharedFusedMoE. It can handle when self.shared_experts is None so it should be simple to keep it mostly the same except for passing n_shared_experts when config.n_shared_experts() is true.

bnellnm · 2025-10-10T20:28:50Z

vllm/model_executor/models/deepseek_v2.py

+                    # Use expert_params_mapping to locate the destination
+                    # param and delegate to its expert-aware weight_loader
+                    # with expert_id.
+                    for mapping in expert_params_mapping:
+                        param_name, weight_name, expert_id, shard_id = mapping
+                        if weight_name not in chunk_name:
+                            continue
+
+                        # Anyway, this is an expert weight and should not be
+                        # attempted to load as other weights later
+                        is_expert_weight = True
+
+                        # Do not modify `name` since the loop may continue here
+                        # Instead, create a new variable
+                        name_mapped = chunk_name.replace(weight_name, param_name)
+
+                        if is_pp_missing_parameter(name_mapped, self):
+                            continue
+
+                        param = params_dict[name_mapped]
+                        # We should ask the weight loader to return success or
+                        # not here since otherwise we may skip experts with
+                        # other available replicas.
+                        weight_loader = typing.cast(
+                            Callable[..., bool], param.weight_loader
+                        )
+                        success = weight_loader(
+                            param,
+                            weight_to_load,
+                            name_mapped,
+                            shard_id=shard_id,
+                            expert_id=expert_id,
+                            return_success=True,
+                        )
+                        if success:
+                            if not is_fuse_shared_experts_layer:
+                                name = name_mapped
+                            break
+                    else:
+                        if is_expert_weight:
+                            # We've checked that this is an expert weight
+                            # However it's not mapped locally to this rank
+                            # So we simply skip it
+                            continue
+
+                        # Skip loading extra bias for GPTQ models.
+                        if name.endswith(".bias") and name not in params_dict:
+                            continue
+
+                        # Remapping the name of FP8 kv-scale.
+                        name = maybe_remap_kv_scale_name(name, params_dict)
+                        if name is None:
+                            continue
+
+                        if is_pp_missing_parameter(name, self):
+                            continue
+
+                        param = params_dict[name]
+                        weight_loader = getattr(
+                            param, "weight_loader", default_weight_loader
+                        )
+                        weight_loader(param, loaded_weight)


Is this bit basically the same as before? It's a little hard to tell the way the diff shows up.

Yes this is largely the same as before. The changes are that, since in deepseek-v2-lite the multiple shared experts' weights are provided as single weights tensors, load_weights chunks the tensors by the number of shared experts and wraps their weight loading in a loop. Other than that, for the other layers and when shared experts fusion is not enabled, this number of chunks is set to one, and their loading logic should remain the same.

bnellnm · 2025-10-10T20:50:25Z

vllm/model_executor/layers/fused_moe/rocm_aiter_fused_moe.py

+        [top_k, n_shared_experts + is_EP], dim=1
+    )
+    s_topk_weights.fill_(shared_experts_score)
+    aiter_topK_meta_data = (total_topk_weights, total_topk_ids)


Can you assert aiter_topK_meta_data is None here so that if we run into the situation where the parameters to the init function change, we don't silently overwrite the global with a different value? I assume since the init function is cached it should only ever do this assignment once as long as the input parameters remain unchanged.

bnellnm

LGTM, nice work! I had a few final questions/comments though.

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

…m/EmbeddedLLM/vllm into upstream-aiter-fmoe-sharedexperts

…llm-project#24097) Signed-off-by: chenjun <junchen2@amd.com> Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> Co-authored-by: valarLip <103567126+valarLip@users.noreply.github.com> Co-authored-by: TJian <tunjian.tan@embeddedllm.com>

…llm-project#24097) Signed-off-by: chenjun <junchen2@amd.com> Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> Co-authored-by: valarLip <103567126+valarLip@users.noreply.github.com> Co-authored-by: TJian <tunjian.tan@embeddedllm.com> Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>

…llm-project#24097) Signed-off-by: chenjun <junchen2@amd.com> Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> Co-authored-by: valarLip <103567126+valarLip@users.noreply.github.com> Co-authored-by: TJian <tunjian.tan@embeddedllm.com>

### What this PR does / why we need it? This is the step 1 of refactoring code to adapt with vllm main, and this pr aligned with vllm-project/vllm@17c540a 1. refactor deepseek to the latest code arch as of vllm-project/vllm@17c540a 2. bunches of fixes due to vllm changes - Fix `AscendScheduler` `__post_init__`, caused by vllm-project/vllm#25075 - Fix `AscendScheduler` init got an unexpected arg `block_size`, caused by vllm-project/vllm#26296 - Fix `KVCacheManager` `get_num_common_prefix_blocks` arg, caused by vllm-project/vllm#23485 - Fix `MLAAttention` import,caused by vllm-project/vllm#25103 - Fix `SharedFusedMoE` import, caused by vllm-project/vllm#26145 - Fix `LazyLoader` improt, caused by vllm-project/vllm#27022 - Fix `vllm.utils.swap_dict_values` improt, caused by vllm-project/vllm#26990 - Fix `Backend` enum import, caused by vllm-project/vllm#25893 - Fix `CompilationLevel` renaming to `CompilationMode` issue introduced by vllm-project/vllm#26355 - Fix fused_moe ops, caused by vllm-project/vllm#24097 - Fix bert model because of `inputs_embeds`, caused by vllm-project/vllm#25922 - Fix MRope because of `get_input_positions_tensor` to `get_mrope_input_positions`, caused by vllm-project/vllm#24172 - Fix `splitting_ops` changes introduced by vllm-project/vllm#25845 - Fix multi-modality changes introduced by vllm-project/vllm#16229 - Fix lora bias dropping issue introduced by vllm-project/vllm#25807 - Fix structured ouput break introduced by vllm-project/vllm#26737 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? CI passed with existing test. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: Icey <1790571317@qq.com> Co-authored-by: Icey <1790571317@qq.com>

…llm-project#24097) Signed-off-by: chenjun <junchen2@amd.com> Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> Co-authored-by: valarLip <103567126+valarLip@users.noreply.github.com> Co-authored-by: TJian <tunjian.tan@embeddedllm.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

…llm-project#24097) Signed-off-by: chenjun <junchen2@amd.com> Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> Co-authored-by: valarLip <103567126+valarLip@users.noreply.github.com> Co-authored-by: TJian <tunjian.tan@embeddedllm.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>

valarLip and others added 9 commits August 13, 2025 09:13

Merge pull request from ROCm/deepseek_085_sharedexperts_aiter_jun_new

c954011

Deepseek 085 sharedexperts aiter jun new Signed-off-by: chenjun <junchen2@amd.com> Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

merge upstream

487960d

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

fix deepseekv2

dc038e3

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

pass extra args

7a0fe48

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

add assert

4724152

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

fix deepseekr1 weight loading

ff85b25

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

split weight_scale_inv

1970a87

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

simplify weight loading logic

5642f33

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

clean up

0f67d7e

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

kliuae requested review from gshtras, mgoin, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners September 2, 2025 12:10

mergify bot added deepseek Related to DeepSeek models rocm Related to AMD ROCm labels Sep 2, 2025

mergify bot added the needs-rebase label Sep 2, 2025

gemini-code-assist bot reviewed Sep 2, 2025

View reviewed changes

kliuae added 5 commits September 3, 2025 14:47

merge upstream

db28ff8

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

fix aiter routed scaling factor

53ca3c9

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

fix aiter routed scaling factor

e3760c0

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

fix aiter routed scaling factor

fcceeb0

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

merge upstream

257e504

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

mergify bot removed the needs-rebase label Sep 5, 2025

precommit

333a06c

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

mergify bot added the needs-rebase label Sep 10, 2025

bnellnm reviewed Oct 10, 2025

View reviewed changes

merge upstream

87a7aa0

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

wuhuikx mentioned this pull request Oct 14, 2025

[Performance]: Deepseek-V3 Performance Uplift Plan on ROCm Backend #26768

Open

30 tasks

fix and add comments

fdd3036

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

mergify bot removed the needs-rebase label Oct 14, 2025

bnellnm approved these changes Oct 14, 2025

View reviewed changes

DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 15, 2025

kliuae and others added 4 commits October 15, 2025 10:04

pass in default value to num fused shared experts

3b6c324

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

Merge branch 'main' into upstream-aiter-fmoe-sharedexperts

af4ca2e

ci

38d5075

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

Merge branch 'upstream-aiter-fmoe-sharedexperts' of https://github.co…

3d70b23

…m/EmbeddedLLM/vllm into upstream-aiter-fmoe-sharedexperts

kliuae requested a review from WoosukKwon as a code owner October 15, 2025 12:30

DarkLight1337 approved these changes Oct 16, 2025

View reviewed changes

DarkLight1337 merged commit 1317034 into vllm-project:main Oct 16, 2025
64 checks passed

wxsIcey mentioned this pull request Oct 17, 2025

[CI] Upgrade vllm to newest commit vllm-project/vllm-ascend#3423

Closed

MengqingCao mentioned this pull request Oct 22, 2025

[1/N][Refactor] Refactor code to adapt with vllm main vllm-project/vllm-ascend#3612

Merged

zhyajie mentioned this pull request Oct 27, 2025

[Bugfix][Rocm] Fix shared expert weight loading failure in DeepSeek-MTP #27563

Open

Uh oh!

Uh oh!

[ROCm][FEAT] Fuse DeepSeek shared experts into AITER fused_moe ops #24097

[ROCm][FEAT] Fuse DeepSeek shared experts into AITER fused_moe ops #24097

Uh oh!

Conversation

kliuae commented Sep 2, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Benchmark results

Accuracy test

Uh oh!

mergify bot commented Sep 2, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

bnellnm Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

kliuae Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

gshtras commented Sep 2, 2025

Uh oh!

mergify bot commented Sep 10, 2025

Uh oh!

bnellnm Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bnellnm Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

kliuae Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

bnellnm Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bnellnm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

kliuae commented Sep 2, 2025 •

edited by github-actions bot

Loading

bnellnm Oct 10, 2025 •

edited

Loading

bnellnm Oct 10, 2025 •

edited

Loading