[model] support Qwen3-VL (transformers/megatron) #5805

Jintao-Huang · 2025-09-14T15:14:06Z

shell: https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_vl

Use Transformers

Training:

# 8 * 70GiB
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
swift sft \
    --model Qwen/Qwen3-VL-235B-A22B-Instruct \
    --dataset 'AI-ModelScope/LaTeX_OCR:human_handwrite#20000' \
    --split_dataset_ratio 0.01 \
    --train_type lora \
    --torch_dtype bfloat16 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 16 \
    --per_device_eval_batch_size 16 \
    --attn_impl flash_attn \
    --padding_free true \
    --learning_rate 1e-4 \
    --lora_rank 8 \
    --lora_alpha 32 \
    --target_modules all-linear \
    --router_aux_loss_coef 1e-3 \
    --freeze_vit true \
    --freeze_aligner true \
    --gradient_accumulation_steps 4 \
    --gradient_checkpointing true \
    --eval_steps 50 \
    --save_steps 50 \
    --save_total_limit 2 \
    --logging_steps 5 \
    --max_length 2048 \
    --output_dir output \
    --warmup_ratio 0.05 \
    --dataset_num_proc 4 \
    --dataloader_num_workers 4

Inference:

Perform inference using the validation set:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
swift infer \
    --adapters output/vx-xxx/checkpoint-xxx \
    --stream true \
    --load_data_args true \
    --max_new_tokens 512

Use Megatron

HF -> MCore checkpoint

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
swift export \
    --model Qwen/Qwen3-VL-235B-A22B-Instruct \
    --to_mcore true \
    --torch_dtype bfloat16 \
    --output_dir Qwen3-VL-235B-A22B-Instruct-mcore

Training

# 8 * 80GiB; 45min
PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
NPROC_PER_NODE=8 \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
megatron sft \
    --load Qwen3-VL-235B-A22B-Instruct-mcore \
    --dataset 'AI-ModelScope/LaTeX_OCR:human_handwrite#20000' \
    --split_dataset_ratio 0.01 \
    --train_type lora \
    --lora_rank 8 \
    --lora_alpha 32 \
    --target_modules all-linear \
    --moe_permute_fusion true \
    --tensor_model_parallel_size 2 \
    --expert_tensor_parallel_size 1 \
    --expert_model_parallel_size 8 \
    --moe_grouped_gemm true \
    --moe_shared_expert_overlap true \
    --moe_aux_loss_coeff 1e-6 \
    --micro_batch_size 1 \
    --global_batch_size 4 \
    --recompute_granularity full \
    --recompute_method uniform \
    --recompute_num_layers 1 \
    --max_epochs 1 \
    --finetune true \
    --cross_entropy_loss_fusion true \
    --lr 1e-4 \
    --lr_warmup_fraction 0.05 \
    --min_lr 1e-5 \
    --save megatron_output/Qwen3-VL-235B-A22B-Instruct \
    --eval_interval 200 \
    --save_interval 200 \
    --max_length 2048 \
    --packing true \
    --num_workers 8 \
    --dataset_num_proc 8 \
    --no_save_optim true \
    --no_save_rng true \
    --sequence_parallel true \
    --attention_backend flash

MCore checkpoint -> HF

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
swift export \
    --mcore_adapters megatron_output/Qwen3-VL-235B-A22B-Instruct /vx-xxx \
    --to_hf true \
    --torch_dtype bfloat16 \
    --output_dir megatron_output/Qwen3-VL-235B-A22B-Instruct/vx-xxx-hf

Inference:

Perform inference using the validation set:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
swift infer \
    --model megatron_output/Qwen3-VL-235B-A22B-Instruct/vx-xxx-hf \
    --stream true \
    --load_data_args true \
    --max_new_tokens 512

…nto support_qwen3_vl

gemini-code-assist

Summary of Changes

Hello @Jintao-Huang, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates the Qwen3-VL and Qwen3-MoE-VL models into the system, enabling their use for multimodal tasks. The changes involve updating model constants, registering new model architectures, and implementing a specific template to manage the unique vision processing requirements of these models. Additionally, new test cases have been added to ensure the correct functionality of the newly supported models for both image and video inputs.

Highlights

New Model Support: Added support for Qwen3-VL and Qwen3-MoE-VL multimodal large language models, expanding the range of available models.
Vision Processing Enhancements: Introduced new parameters for vision processing within the Qwen models, including image and video token management, and spatial merging.
Qwen3-VL Template Integration: Implemented a dedicated template (Qwen3VLTemplate) to correctly handle image and video token encoding for the Qwen3-VL model, including dynamic token length adjustments.
Updated Dependencies: Updated the required version for qwen_vl_utils to 0.0.12 to support the new Qwen3-VL features.
New Test Cases: Added comprehensive test cases for both image and video inference to validate the integration of Qwen3-VL and Qwen3-MoE-VL models.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds support for the Qwen3-VL and Qwen3-Moe-VL models. The changes are mostly about adding new model types, templates, and architecture definitions. While the implementation is largely complete, I've found a couple of critical issues that could lead to loading incorrect models or bugs when processing multiple videos. Additionally, there are several TODO comments that should be addressed, and some of the new tests are either incomplete or disabled. My review includes suggestions to fix these issues.

gemini-code-assist · 2025-09-14T15:15:46Z

swift/llm/template/template/qwen.py

+                        token_len = (media_grid_thw[i].prod() // merge_length)
+                        return [media_token] * token_len
+                    else:
+                        return splited_tokens[0]


When processing multiple videos, splited_tokens contains token lists for each video. However, splited_tokens[0] is always used, which is incorrect for any video after the first one. It should use splited_tokens[i] to get the tokens for the i-th video.

Suggested change

return splited_tokens[0]

return splited_tokens[i]

gemini-code-assist · 2025-09-14T15:15:46Z

swift/llm/model/model/qwen.py

+        get_model_tokenizer_qwen3_vl,
+        model_arch=ModelArch.qwen3_vl,
+        architectures=['Qwen3VLForConditionalGeneration'],
+        requires=[],  # TODO


The requires list is empty and marked with a TODO. This should be populated with the correct dependencies to ensure environment compatibility. Based on the code, it should likely include transformers, qwen_vl_utils, and decord with appropriate version constraints.

Suggested change

requires=[], # TODO

requires=['transformers>=4.51', 'qwen_vl_utils>=0.0.12', 'decord'],

gemini-code-assist · 2025-09-14T15:15:46Z

swift/llm/model/model/qwen.py

+        get_model_tokenizer_qwen3_moe_vl,
+        model_arch=ModelArch.qwen3_vl,
+        architectures=['Qwen3VLMoeForConditionalGeneration'],
+        requires=[],  # TODO


The requires list is empty and marked with a TODO. This should be populated with the correct dependencies to ensure environment compatibility. Based on the code, it should likely include transformers, qwen_vl_utils, and decord with appropriate version constraints.

Suggested change

requires=[], # TODO

requires=['transformers>=4.51', 'qwen_vl_utils>=0.0.12', 'decord'],

gemini-code-assist · 2025-09-14T15:15:46Z

swift/llm/template/template/qwen.py

+                        videos=mm_data,
+                        return_tensors='pt',
+                        do_resize=False,
+                        **inputs.mm_processor_kwargs)  # TODO: check


This TODO comment should be addressed. Please verify if mm_processor_kwargs are being passed and handled correctly before merging.

gemini-code-assist · 2025-09-14T15:15:46Z

swift/llm/template/template/qwen.py

+        return encoded
+
+    def _post_encode(self, model, inputs: Dict[str, Any]) -> Dict[str, Any]:
+        return inputs  # TODO: Waiting for transformers to support passing deepstack_visual_embeds.


This TODO indicates that the implementation is incomplete and waiting for an upstream change in transformers. It would be good to track this and update the implementation once the feature is available, or remove the comment if it's no longer relevant.

gemini-code-assist · 2025-09-14T15:15:47Z

tests/test_align/test_template/test_video.py



+def test_qwen3_vl():
+    # TODO: fix


This test is marked with a TODO: fix, which suggests it might be failing or is incomplete. This should be resolved before merging the pull request.

gemini-code-assist · 2025-09-14T15:15:47Z

tests/test_align/test_template/test_video.py

-    test_minicpmv4_5()
+    # test_minicpmv4_5()
+    test_qwen3_vl()
+    # test_qwen3_moe_vl()


The test test_qwen3_moe_vl is commented out and will not be run. Please enable it to ensure the functionality for qwen3-moe-vl is properly tested.

Suggested change

# test_qwen3_moe_vl()

test_qwen3_moe_vl()

acdart · 2025-09-15T02:39:28Z

会有megatron版本支持吗

Jintao-Huang · 2025-09-15T03:11:02Z

会有megatron版本支持吗

会支持的

Jintao-Huang · 2025-09-15T07:36:31Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces support for the Qwen3-VL and Qwen3-VL-Moe models. The changes are comprehensive, spanning model type definitions, architecture registration, and new templates. The Megatron-LM integration is also updated to accommodate these models, with modifications for mROPE and MoE layer handling. My review highlights several TODO comments in the code, which point to areas that seem incomplete or require further verification. Addressing these points will help ensure the new models are robustly and correctly integrated.

gemini-code-assist · 2025-09-15T07:37:39Z

swift/llm/model/model/qwen.py

+        get_model_tokenizer_qwen3_vl,
+        model_arch=ModelArch.qwen3_vl,
+        architectures=['Qwen3VLForConditionalGeneration'],
+        requires=[],  # TODO


The requires list is empty and marked with a TODO. This should be populated with the necessary dependencies for qwen3_vl and qwen3_moe_vl (on line 843) to ensure users have the correct environment. For example, it likely needs transformers and qwen_vl_utils with appropriate version constraints.

gemini-code-assist · 2025-09-15T07:37:39Z

swift/llm/template/template/qwen.py

+                        videos=mm_data,
+                        return_tensors='pt',
+                        do_resize=False,
+                        **inputs.mm_processor_kwargs)  # TODO: check


The TODO: check comment suggests uncertainty about passing **inputs.mm_processor_kwargs. It's important to verify if this is the correct way to pass these arguments to the processor for videos and remove the comment once confirmed.

gemini-code-assist · 2025-09-15T07:37:39Z

swift/llm/template/template/qwen.py

+        return encoded
+
+    def _post_encode(self, model, inputs: Dict[str, Any]) -> Dict[str, Any]:
+        return inputs  # TODO: Waiting for transformers to support passing deepstack_visual_embeds.


The TODO comment indicates that this part of the implementation is a workaround, pending support for deepstack_visual_embeds in the transformers library. While this is an external dependency, it would be beneficial to clarify the expected behavior or any potential limitations of the current implementation until the upstream support is available.

gemini-code-assist · 2025-09-15T07:37:39Z

tests/test_align/test_template/test_video.py



+def test_qwen3_vl():
+    # TODO: fix


The TODO: fix comment in the test_qwen3_vl test function indicates a known issue or an incomplete test. This should be addressed to ensure that the new functionality is properly validated and to prevent merging potentially broken or untested code.

acdart · 2025-09-21T02:30:20Z

开context_parallel_size报错

setting：
context_parallel_size=4
expert_model_parallel_size=4
pipeline_model_parallel_size=8
tensor_model_parallel_size=2

报错：
File "/root/code/ms-swift/swift/megatron/model/mm_gpt/qwen3_vl.py", line 422, in _deepstack_process
local_this = hidden_states[visual_pos_masks, :].clone() + visual_embeds
IndexError: The shape of the mask [16384, 1] at index 0 does not match the shape of the indexed tensor [4096, 1, 4096] at index 0

Jintao-Huang · 2025-09-23T22:34:55Z

/gemini review

gemini-code-assist

Code Review

This pull request adds support for the Qwen3-VL model series. The changes are comprehensive, covering model definitions, templates, documentation, and tests for both transformers and Megatron backends. The implementation correctly follows the existing patterns for adding new models.

I've identified a critical issue in how the automodel_class is determined, which could lead to loading models with an incorrect class. I've provided suggestions to fix this. Additionally, I've pointed out a naming issue in the Megatron implementation where class names are overly specific for their shared usage, which could impact future maintainability. I've also noted in this summary that a file was renamed to qwen3_vl.py but contains logic for both qwen3_vl and qwen3_omni, which can be confusing; a more generic name like qwen3_mm.py would be more appropriate.

Overall, the PR is well-structured, but addressing these points will improve correctness and code clarity.

gemini-code-assist · 2025-09-23T22:37:49Z

swift/llm/model/model/qwen.py

+    from transformers import Qwen3VLForConditionalGeneration
+    require_version('qwen_vl_utils>=0.0.14')
+    kwargs['automodel_class'] = kwargs['automodel_class'] or Qwen3VLForConditionalGeneration
+    kwargs['_check_qwen_vl_utils'] = False
+    return get_model_tokenizer_qwen2_vl(model_dir, *args, **kwargs)


The logic kwargs['automodel_class'] or ... is incorrect because the default automodel_class from the caller is AutoModelForCausalLM, which is truthy. This prevents the intended model class Qwen3VLForConditionalGeneration from being used, leading to the model being loaded with the wrong class. The suggested change correctly checks for the default class and replaces it, ensuring the correct model architecture is loaded.

Suggested change

from transformers import Qwen3VLForConditionalGeneration

require_version('qwen_vl_utils>=0.0.14')

kwargs['automodel_class'] = kwargs['automodel_class'] or Qwen3VLForConditionalGeneration

kwargs['_check_qwen_vl_utils'] = False

return get_model_tokenizer_qwen2_vl(model_dir, *args, **kwargs)

from transformers import Qwen3VLForConditionalGeneration, AutoModelForCausalLM

require_version('qwen_vl_utils>=0.0.14')

if kwargs.get('automodel_class') is AutoModelForCausalLM:

kwargs['automodel_class'] = Qwen3VLForConditionalGeneration

kwargs['_check_qwen_vl_utils'] = False

return get_model_tokenizer_qwen2_vl(model_dir, *args, **kwargs)

gemini-code-assist · 2025-09-23T22:37:49Z

swift/llm/model/model/qwen.py

+    from transformers import Qwen3VLMoeForConditionalGeneration
+    require_version('qwen_vl_utils>=0.0.14')
+    kwargs['automodel_class'] = kwargs['automodel_class'] or Qwen3VLMoeForConditionalGeneration
+    kwargs['_check_qwen_vl_utils'] = False
+    return get_model_tokenizer_qwen2_vl(model_dir, *args, **kwargs)


The logic kwargs['automodel_class'] or ... is incorrect because the default automodel_class from the caller is AutoModelForCausalLM, which is truthy. This prevents the intended model class Qwen3VLMoeForConditionalGeneration from being used, leading to the model being loaded with the wrong class. The suggested change correctly checks for the default class and replaces it, ensuring the correct model architecture is loaded.

Suggested change

from transformers import Qwen3VLMoeForConditionalGeneration

require_version('qwen_vl_utils>=0.0.14')

kwargs['automodel_class'] = kwargs['automodel_class'] or Qwen3VLMoeForConditionalGeneration

kwargs['_check_qwen_vl_utils'] = False

return get_model_tokenizer_qwen2_vl(model_dir, *args, **kwargs)

from transformers import Qwen3VLMoeForConditionalGeneration, AutoModelForCausalLM

require_version('qwen_vl_utils>=0.0.14')

if kwargs.get('automodel_class') is AutoModelForCausalLM:

kwargs['automodel_class'] = Qwen3VLMoeForConditionalGeneration

kwargs['_check_qwen_vl_utils'] = False

return get_model_tokenizer_qwen2_vl(model_dir, *args, **kwargs)

gemini-code-assist · 2025-09-23T22:37:49Z

swift/megatron/model/mm_gpt/qwen3_vl.py



-class Qwen3OmniGPTModel(MultimodalGPTModel):
+class Qwen3VLGPTModel(MultimodalGPTModel):


The class Qwen3VLGPTModel (and Qwen3VLTransformerBlock) is used for both qwen3_vl and qwen3_omni models. The VL suffix is misleading and harms maintainability. Consider renaming them to something more generic, like Qwen3MMGPTModel and Qwen3MMTransformerBlock respectively, to reflect their shared usage across different Qwen3 multimodal models.

Jintao-Huang added 14 commits September 7, 2025 15:10

update

ef5b8fc

update

71b58a4

update

2260460

update

e3c0f00

Merge branch 'main' into support_qwen3_vl

dfa9f16

update

3aeb839

update

18fc3d5

Merge branch 'main' into support_qwen3_vl

c990e05

update

52442d1

update

659cb92

Merge branch 'main' into support_qwen3_vl

4cbdb0a

Merge branch 'main' into support_qwen3_vl

6da2b99

Merge remote-tracking branch 'refs/remotes/origin/support_qwen3_vl' i…

af15812

…nto support_qwen3_vl

Merge branch 'main' into support_qwen3_vl

45e48f0

gemini-code-assist bot reviewed Sep 14, 2025

View reviewed changes

Jintao-Huang added 3 commits September 14, 2025 23:34

fix test

c13e83a

update

f353311

update

5a2825c

modelscope deleted a comment from gemini-code-assist bot Sep 14, 2025

fix

8b52c5d

Jintao-Huang added 4 commits September 15, 2025 11:31

update

7147f16

support megatron

923d332

fix

1671242

lint pass

435a6be

tastelikefeet approved these changes Sep 15, 2025

View reviewed changes

gemini-code-assist bot reviewed Sep 15, 2025

View reviewed changes

hjh0119 approved these changes Sep 15, 2025

View reviewed changes

Jintao-Huang added 2 commits September 15, 2025 16:39

fix

b7dd523

fix sp

dc8a520

Jintao-Huang changed the title ~~[model] support Qwen3-VL~~ [model] support Qwen3-VL (transformers/megatron) Sep 15, 2025

Jintao-Huang added 6 commits September 18, 2025 11:11

Merge branch 'main' into support_qwen3_vl

4c80e86

fix

021873b

Merge branch 'main' into support_qwen3_vl

109e4d6

fix

470cc3c

Merge branch 'main' into support_qwen3_vl

5d7043a

update

2cd3631

Jintao-Huang added 4 commits September 22, 2025 14:20

Merge branch 'main' into support_qwen3_vl

8e80d58

Merge branch 'main' into support_qwen3_vl

ccbf9fd

fix

ebf3504

update

da5f5c7

Jintao-Huang mentioned this pull request Sep 22, 2025

when support qwen3vl #5904

Closed

Jintao-Huang added 7 commits September 22, 2025 21:00

Merge branch 'main' into support_qwen3_vl

91af1f4

update

7ae359d

Merge branch 'main' into support_qwen3_vl

6c2d14c

update

9b3f5d0

fix

1879d11

update

0b73b55

update

44dbe0b

gemini-code-assist bot reviewed Sep 23, 2025

View reviewed changes

Jintao-Huang added 2 commits September 24, 2025 06:50

update

5f00a77

update

1b72c0f

Jintao-Huang merged commit 9f5fcf5 into modelscope:main Sep 24, 2025
1 of 2 checks passed

	requires=[], # TODO
	requires=['transformers>=4.51', 'qwen_vl_utils>=0.0.12', 'decord'],



		class Qwen3OmniGPTModel(MultimodalGPTModel):
		class Qwen3VLGPTModel(MultimodalGPTModel):

[model] support Qwen3-VL (transformers/megatron) #5805

[model] support Qwen3-VL (transformers/megatron) #5805

Uh oh!

Conversation

Jintao-Huang commented Sep 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Use Transformers

Training:

Inference:

Use Megatron

HF -> MCore checkpoint

Training

MCore checkpoint -> HF

Inference:

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 14, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 14, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 14, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 14, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 14, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 14, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 14, 2025

Choose a reason for hiding this comment

Uh oh!

acdart commented Sep 15, 2025

Uh oh!

Jintao-Huang commented Sep 15, 2025

Uh oh!

Jintao-Huang commented Sep 15, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

acdart commented Sep 21, 2025

Uh oh!

Jintao-Huang commented Sep 23, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 23, 2025

Jintao-Huang commented Sep 14, 2025 •

edited

Loading