Skip to content

Conversation

Jintao-Huang
Copy link
Collaborator

@Jintao-Huang Jintao-Huang commented Sep 14, 2025

huggingface/transformers#40795

shell: https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_vl

Use Transformers

Training:

# 8 * 70GiB
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
swift sft \
    --model Qwen/Qwen3-VL-235B-A22B-Instruct \
    --dataset 'AI-ModelScope/LaTeX_OCR:human_handwrite#20000' \
    --split_dataset_ratio 0.01 \
    --train_type lora \
    --torch_dtype bfloat16 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 16 \
    --per_device_eval_batch_size 16 \
    --attn_impl flash_attn \
    --padding_free true \
    --learning_rate 1e-4 \
    --lora_rank 8 \
    --lora_alpha 32 \
    --target_modules all-linear \
    --router_aux_loss_coef 1e-3 \
    --freeze_vit true \
    --freeze_aligner true \
    --gradient_accumulation_steps 4 \
    --gradient_checkpointing true \
    --eval_steps 50 \
    --save_steps 50 \
    --save_total_limit 2 \
    --logging_steps 5 \
    --max_length 2048 \
    --output_dir output \
    --warmup_ratio 0.05 \
    --dataset_num_proc 4 \
    --dataloader_num_workers 4

Inference:

Perform inference using the validation set:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
swift infer \
    --adapters output/vx-xxx/checkpoint-xxx \
    --stream true \
    --load_data_args true \
    --max_new_tokens 512

Use Megatron

HF -> MCore checkpoint

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
swift export \
    --model Qwen/Qwen3-VL-235B-A22B-Instruct \
    --to_mcore true \
    --torch_dtype bfloat16 \
    --output_dir Qwen3-VL-235B-A22B-Instruct-mcore

Training

# 8 * 80GiB; 45min
PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
NPROC_PER_NODE=8 \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
megatron sft \
    --load Qwen3-VL-235B-A22B-Instruct-mcore \
    --dataset 'AI-ModelScope/LaTeX_OCR:human_handwrite#20000' \
    --split_dataset_ratio 0.01 \
    --train_type lora \
    --lora_rank 8 \
    --lora_alpha 32 \
    --target_modules all-linear \
    --moe_permute_fusion true \
    --tensor_model_parallel_size 2 \
    --expert_tensor_parallel_size 1 \
    --expert_model_parallel_size 8 \
    --moe_grouped_gemm true \
    --moe_shared_expert_overlap true \
    --moe_aux_loss_coeff 1e-6 \
    --micro_batch_size 1 \
    --global_batch_size 4 \
    --recompute_granularity full \
    --recompute_method uniform \
    --recompute_num_layers 1 \
    --max_epochs 1 \
    --finetune true \
    --cross_entropy_loss_fusion true \
    --lr 1e-4 \
    --lr_warmup_fraction 0.05 \
    --min_lr 1e-5 \
    --save megatron_output/Qwen3-VL-235B-A22B-Instruct \
    --eval_interval 200 \
    --save_interval 200 \
    --max_length 2048 \
    --packing true \
    --num_workers 8 \
    --dataset_num_proc 8 \
    --no_save_optim true \
    --no_save_rng true \
    --sequence_parallel true \
    --attention_backend flash
截屏2025-09-24 07 40 31 截屏2025-09-24 07 40 45

MCore checkpoint -> HF

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
swift export \
    --mcore_adapters megatron_output/Qwen3-VL-235B-A22B-Instruct /vx-xxx \
    --to_hf true \
    --torch_dtype bfloat16 \
    --output_dir megatron_output/Qwen3-VL-235B-A22B-Instruct/vx-xxx-hf

Inference:

Perform inference using the validation set:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
swift infer \
    --model megatron_output/Qwen3-VL-235B-A22B-Instruct/vx-xxx-hf \
    --stream true \
    --load_data_args true \
    --max_new_tokens 512
截屏2025-09-24 08 46 13

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @Jintao-Huang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates the Qwen3-VL and Qwen3-MoE-VL models into the system, enabling their use for multimodal tasks. The changes involve updating model constants, registering new model architectures, and implementing a specific template to manage the unique vision processing requirements of these models. Additionally, new test cases have been added to ensure the correct functionality of the newly supported models for both image and video inputs.

Highlights

  • New Model Support: Added support for Qwen3-VL and Qwen3-MoE-VL multimodal large language models, expanding the range of available models.
  • Vision Processing Enhancements: Introduced new parameters for vision processing within the Qwen models, including image and video token management, and spatial merging.
  • Qwen3-VL Template Integration: Implemented a dedicated template (Qwen3VLTemplate) to correctly handle image and video token encoding for the Qwen3-VL model, including dynamic token length adjustments.
  • Updated Dependencies: Updated the required version for qwen_vl_utils to 0.0.12 to support the new Qwen3-VL features.
  • New Test Cases: Added comprehensive test cases for both image and video inference to validate the integration of Qwen3-VL and Qwen3-MoE-VL models.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the Qwen3-VL and Qwen3-Moe-VL models. The changes are mostly about adding new model types, templates, and architecture definitions. While the implementation is largely complete, I've found a couple of critical issues that could lead to loading incorrect models or bugs when processing multiple videos. Additionally, there are several TODO comments that should be addressed, and some of the new tests are either incomplete or disabled. My review includes suggestions to fix these issues.

token_len = (media_grid_thw[i].prod() // merge_length)
return [media_token] * token_len
else:
return splited_tokens[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

When processing multiple videos, splited_tokens contains token lists for each video. However, splited_tokens[0] is always used, which is incorrect for any video after the first one. It should use splited_tokens[i] to get the tokens for the i-th video.

Suggested change
return splited_tokens[0]
return splited_tokens[i]

get_model_tokenizer_qwen3_vl,
model_arch=ModelArch.qwen3_vl,
architectures=['Qwen3VLForConditionalGeneration'],
requires=[], # TODO
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The requires list is empty and marked with a TODO. This should be populated with the correct dependencies to ensure environment compatibility. Based on the code, it should likely include transformers, qwen_vl_utils, and decord with appropriate version constraints.

Suggested change
requires=[], # TODO
requires=['transformers>=4.51', 'qwen_vl_utils>=0.0.12', 'decord'],

get_model_tokenizer_qwen3_moe_vl,
model_arch=ModelArch.qwen3_vl,
architectures=['Qwen3VLMoeForConditionalGeneration'],
requires=[], # TODO
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The requires list is empty and marked with a TODO. This should be populated with the correct dependencies to ensure environment compatibility. Based on the code, it should likely include transformers, qwen_vl_utils, and decord with appropriate version constraints.

Suggested change
requires=[], # TODO
requires=['transformers>=4.51', 'qwen_vl_utils>=0.0.12', 'decord'],

videos=mm_data,
return_tensors='pt',
do_resize=False,
**inputs.mm_processor_kwargs) # TODO: check
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This TODO comment should be addressed. Please verify if mm_processor_kwargs are being passed and handled correctly before merging.

return encoded

def _post_encode(self, model, inputs: Dict[str, Any]) -> Dict[str, Any]:
return inputs # TODO: Waiting for transformers to support passing deepstack_visual_embeds.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This TODO indicates that the implementation is incomplete and waiting for an upstream change in transformers. It would be good to track this and update the implementation once the feature is available, or remove the comment if it's no longer relevant.



def test_qwen3_vl():
# TODO: fix
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This test is marked with a TODO: fix, which suggests it might be failing or is incomplete. This should be resolved before merging the pull request.

test_minicpmv4_5()
# test_minicpmv4_5()
test_qwen3_vl()
# test_qwen3_moe_vl()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The test test_qwen3_moe_vl is commented out and will not be run. Please enable it to ensure the functionality for qwen3-moe-vl is properly tested.

Suggested change
# test_qwen3_moe_vl()
test_qwen3_moe_vl()

@modelscope modelscope deleted a comment from gemini-code-assist bot Sep 14, 2025
@acdart
Copy link

acdart commented Sep 15, 2025

会有megatron版本支持吗

@Jintao-Huang
Copy link
Collaborator Author

会有megatron版本支持吗

会支持的

@Jintao-Huang
Copy link
Collaborator Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the Qwen3-VL and Qwen3-VL-Moe models. The changes are comprehensive, spanning model type definitions, architecture registration, and new templates. The Megatron-LM integration is also updated to accommodate these models, with modifications for mROPE and MoE layer handling. My review highlights several TODO comments in the code, which point to areas that seem incomplete or require further verification. Addressing these points will help ensure the new models are robustly and correctly integrated.

get_model_tokenizer_qwen3_vl,
model_arch=ModelArch.qwen3_vl,
architectures=['Qwen3VLForConditionalGeneration'],
requires=[], # TODO
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The requires list is empty and marked with a TODO. This should be populated with the necessary dependencies for qwen3_vl and qwen3_moe_vl (on line 843) to ensure users have the correct environment. For example, it likely needs transformers and qwen_vl_utils with appropriate version constraints.

videos=mm_data,
return_tensors='pt',
do_resize=False,
**inputs.mm_processor_kwargs) # TODO: check
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The TODO: check comment suggests uncertainty about passing **inputs.mm_processor_kwargs. It's important to verify if this is the correct way to pass these arguments to the processor for videos and remove the comment once confirmed.

return encoded

def _post_encode(self, model, inputs: Dict[str, Any]) -> Dict[str, Any]:
return inputs # TODO: Waiting for transformers to support passing deepstack_visual_embeds.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The TODO comment indicates that this part of the implementation is a workaround, pending support for deepstack_visual_embeds in the transformers library. While this is an external dependency, it would be beneficial to clarify the expected behavior or any potential limitations of the current implementation until the upstream support is available.



def test_qwen3_vl():
# TODO: fix
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The TODO: fix comment in the test_qwen3_vl test function indicates a known issue or an incomplete test. This should be addressed to ensure that the new functionality is properly validated and to prevent merging potentially broken or untested code.

@Jintao-Huang Jintao-Huang changed the title [model] support Qwen3-VL [model] support Qwen3-VL (transformers/megatron) Sep 15, 2025
@acdart
Copy link

acdart commented Sep 21, 2025

开context_parallel_size报错

setting:
context_parallel_size=4
expert_model_parallel_size=4
pipeline_model_parallel_size=8
tensor_model_parallel_size=2

报错:
File "/root/code/ms-swift/swift/megatron/model/mm_gpt/qwen3_vl.py", line 422, in _deepstack_process
local_this = hidden_states[visual_pos_masks, :].clone() + visual_embeds
IndexError: The shape of the mask [16384, 1] at index 0 does not match the shape of the indexed tensor [4096, 1, 4096] at index 0

@Jintao-Huang
Copy link
Collaborator Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the Qwen3-VL model series. The changes are comprehensive, covering model definitions, templates, documentation, and tests for both transformers and Megatron backends. The implementation correctly follows the existing patterns for adding new models.

I've identified a critical issue in how the automodel_class is determined, which could lead to loading models with an incorrect class. I've provided suggestions to fix this. Additionally, I've pointed out a naming issue in the Megatron implementation where class names are overly specific for their shared usage, which could impact future maintainability. I've also noted in this summary that a file was renamed to qwen3_vl.py but contains logic for both qwen3_vl and qwen3_omni, which can be confusing; a more generic name like qwen3_mm.py would be more appropriate.

Overall, the PR is well-structured, but addressing these points will improve correctness and code clarity.

Comment on lines +830 to +834
from transformers import Qwen3VLForConditionalGeneration
require_version('qwen_vl_utils>=0.0.14')
kwargs['automodel_class'] = kwargs['automodel_class'] or Qwen3VLForConditionalGeneration
kwargs['_check_qwen_vl_utils'] = False
return get_model_tokenizer_qwen2_vl(model_dir, *args, **kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The logic kwargs['automodel_class'] or ... is incorrect because the default automodel_class from the caller is AutoModelForCausalLM, which is truthy. This prevents the intended model class Qwen3VLForConditionalGeneration from being used, leading to the model being loaded with the wrong class. The suggested change correctly checks for the default class and replaces it, ensuring the correct model architecture is loaded.

Suggested change
from transformers import Qwen3VLForConditionalGeneration
require_version('qwen_vl_utils>=0.0.14')
kwargs['automodel_class'] = kwargs['automodel_class'] or Qwen3VLForConditionalGeneration
kwargs['_check_qwen_vl_utils'] = False
return get_model_tokenizer_qwen2_vl(model_dir, *args, **kwargs)
from transformers import Qwen3VLForConditionalGeneration, AutoModelForCausalLM
require_version('qwen_vl_utils>=0.0.14')
if kwargs.get('automodel_class') is AutoModelForCausalLM:
kwargs['automodel_class'] = Qwen3VLForConditionalGeneration
kwargs['_check_qwen_vl_utils'] = False
return get_model_tokenizer_qwen2_vl(model_dir, *args, **kwargs)

Comment on lines +849 to +853
from transformers import Qwen3VLMoeForConditionalGeneration
require_version('qwen_vl_utils>=0.0.14')
kwargs['automodel_class'] = kwargs['automodel_class'] or Qwen3VLMoeForConditionalGeneration
kwargs['_check_qwen_vl_utils'] = False
return get_model_tokenizer_qwen2_vl(model_dir, *args, **kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The logic kwargs['automodel_class'] or ... is incorrect because the default automodel_class from the caller is AutoModelForCausalLM, which is truthy. This prevents the intended model class Qwen3VLMoeForConditionalGeneration from being used, leading to the model being loaded with the wrong class. The suggested change correctly checks for the default class and replaces it, ensuring the correct model architecture is loaded.

Suggested change
from transformers import Qwen3VLMoeForConditionalGeneration
require_version('qwen_vl_utils>=0.0.14')
kwargs['automodel_class'] = kwargs['automodel_class'] or Qwen3VLMoeForConditionalGeneration
kwargs['_check_qwen_vl_utils'] = False
return get_model_tokenizer_qwen2_vl(model_dir, *args, **kwargs)
from transformers import Qwen3VLMoeForConditionalGeneration, AutoModelForCausalLM
require_version('qwen_vl_utils>=0.0.14')
if kwargs.get('automodel_class') is AutoModelForCausalLM:
kwargs['automodel_class'] = Qwen3VLMoeForConditionalGeneration
kwargs['_check_qwen_vl_utils'] = False
return get_model_tokenizer_qwen2_vl(model_dir, *args, **kwargs)



class Qwen3OmniGPTModel(MultimodalGPTModel):
class Qwen3VLGPTModel(MultimodalGPTModel):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The class Qwen3VLGPTModel (and Qwen3VLTransformerBlock) is used for both qwen3_vl and qwen3_omni models. The VL suffix is misleading and harms maintainability. Consider renaming them to something more generic, like Qwen3MMGPTModel and Qwen3MMTransformerBlock respectively, to reflect their shared usage across different Qwen3 multimodal models.

@Jintao-Huang Jintao-Huang merged commit 9f5fcf5 into modelscope:main Sep 24, 2025
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants