-
Notifications
You must be signed in to change notification settings - Fork 907
[model] support Qwen3-VL (transformers/megatron) #5805
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…nto support_qwen3_vl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @Jintao-Huang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request integrates the Qwen3-VL and Qwen3-MoE-VL models into the system, enabling their use for multimodal tasks. The changes involve updating model constants, registering new model architectures, and implementing a specific template to manage the unique vision processing requirements of these models. Additionally, new test cases have been added to ensure the correct functionality of the newly supported models for both image and video inputs.
Highlights
- New Model Support: Added support for Qwen3-VL and Qwen3-MoE-VL multimodal large language models, expanding the range of available models.
- Vision Processing Enhancements: Introduced new parameters for vision processing within the Qwen models, including image and video token management, and spatial merging.
- Qwen3-VL Template Integration: Implemented a dedicated template (
Qwen3VLTemplate
) to correctly handle image and video token encoding for the Qwen3-VL model, including dynamic token length adjustments. - Updated Dependencies: Updated the required version for
qwen_vl_utils
to0.0.12
to support the new Qwen3-VL features. - New Test Cases: Added comprehensive test cases for both image and video inference to validate the integration of Qwen3-VL and Qwen3-MoE-VL models.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for the Qwen3-VL and Qwen3-Moe-VL models. The changes are mostly about adding new model types, templates, and architecture definitions. While the implementation is largely complete, I've found a couple of critical issues that could lead to loading incorrect models or bugs when processing multiple videos. Additionally, there are several TODO
comments that should be addressed, and some of the new tests are either incomplete or disabled. My review includes suggestions to fix these issues.
swift/llm/template/template/qwen.py
Outdated
token_len = (media_grid_thw[i].prod() // merge_length) | ||
return [media_token] * token_len | ||
else: | ||
return splited_tokens[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When processing multiple videos, splited_tokens
contains token lists for each video. However, splited_tokens[0]
is always used, which is incorrect for any video after the first one. It should use splited_tokens[i]
to get the tokens for the i-th video.
return splited_tokens[0] | |
return splited_tokens[i] |
swift/llm/model/model/qwen.py
Outdated
get_model_tokenizer_qwen3_vl, | ||
model_arch=ModelArch.qwen3_vl, | ||
architectures=['Qwen3VLForConditionalGeneration'], | ||
requires=[], # TODO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The requires
list is empty and marked with a TODO
. This should be populated with the correct dependencies to ensure environment compatibility. Based on the code, it should likely include transformers
, qwen_vl_utils
, and decord
with appropriate version constraints.
requires=[], # TODO | |
requires=['transformers>=4.51', 'qwen_vl_utils>=0.0.12', 'decord'], |
swift/llm/model/model/qwen.py
Outdated
get_model_tokenizer_qwen3_moe_vl, | ||
model_arch=ModelArch.qwen3_vl, | ||
architectures=['Qwen3VLMoeForConditionalGeneration'], | ||
requires=[], # TODO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The requires
list is empty and marked with a TODO
. This should be populated with the correct dependencies to ensure environment compatibility. Based on the code, it should likely include transformers
, qwen_vl_utils
, and decord
with appropriate version constraints.
requires=[], # TODO | |
requires=['transformers>=4.51', 'qwen_vl_utils>=0.0.12', 'decord'], |
swift/llm/template/template/qwen.py
Outdated
videos=mm_data, | ||
return_tensors='pt', | ||
do_resize=False, | ||
**inputs.mm_processor_kwargs) # TODO: check |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
swift/llm/template/template/qwen.py
Outdated
return encoded | ||
|
||
def _post_encode(self, model, inputs: Dict[str, Any]) -> Dict[str, Any]: | ||
return inputs # TODO: Waiting for transformers to support passing deepstack_visual_embeds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
|
||
def test_qwen3_vl(): | ||
# TODO: fix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_minicpmv4_5() | ||
# test_minicpmv4_5() | ||
test_qwen3_vl() | ||
# test_qwen3_moe_vl() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
会有megatron版本支持吗 |
会支持的 |
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for the Qwen3-VL and Qwen3-VL-Moe models. The changes are comprehensive, spanning model type definitions, architecture registration, and new templates. The Megatron-LM integration is also updated to accommodate these models, with modifications for mROPE and MoE layer handling. My review highlights several TODO
comments in the code, which point to areas that seem incomplete or require further verification. Addressing these points will help ensure the new models are robustly and correctly integrated.
swift/llm/model/model/qwen.py
Outdated
get_model_tokenizer_qwen3_vl, | ||
model_arch=ModelArch.qwen3_vl, | ||
architectures=['Qwen3VLForConditionalGeneration'], | ||
requires=[], # TODO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
swift/llm/template/template/qwen.py
Outdated
videos=mm_data, | ||
return_tensors='pt', | ||
do_resize=False, | ||
**inputs.mm_processor_kwargs) # TODO: check |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
swift/llm/template/template/qwen.py
Outdated
return encoded | ||
|
||
def _post_encode(self, model, inputs: Dict[str, Any]) -> Dict[str, Any]: | ||
return inputs # TODO: Waiting for transformers to support passing deepstack_visual_embeds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TODO
comment indicates that this part of the implementation is a workaround, pending support for deepstack_visual_embeds
in the transformers
library. While this is an external dependency, it would be beneficial to clarify the expected behavior or any potential limitations of the current implementation until the upstream support is available.
|
||
|
||
def test_qwen3_vl(): | ||
# TODO: fix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
开context_parallel_size报错 setting: 报错: |
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for the Qwen3-VL model series. The changes are comprehensive, covering model definitions, templates, documentation, and tests for both transformers and Megatron backends. The implementation correctly follows the existing patterns for adding new models.
I've identified a critical issue in how the automodel_class
is determined, which could lead to loading models with an incorrect class. I've provided suggestions to fix this. Additionally, I've pointed out a naming issue in the Megatron implementation where class names are overly specific for their shared usage, which could impact future maintainability. I've also noted in this summary that a file was renamed to qwen3_vl.py
but contains logic for both qwen3_vl
and qwen3_omni
, which can be confusing; a more generic name like qwen3_mm.py
would be more appropriate.
Overall, the PR is well-structured, but addressing these points will improve correctness and code clarity.
from transformers import Qwen3VLForConditionalGeneration | ||
require_version('qwen_vl_utils>=0.0.14') | ||
kwargs['automodel_class'] = kwargs['automodel_class'] or Qwen3VLForConditionalGeneration | ||
kwargs['_check_qwen_vl_utils'] = False | ||
return get_model_tokenizer_qwen2_vl(model_dir, *args, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic kwargs['automodel_class'] or ...
is incorrect because the default automodel_class
from the caller is AutoModelForCausalLM
, which is truthy. This prevents the intended model class Qwen3VLForConditionalGeneration
from being used, leading to the model being loaded with the wrong class. The suggested change correctly checks for the default class and replaces it, ensuring the correct model architecture is loaded.
from transformers import Qwen3VLForConditionalGeneration | |
require_version('qwen_vl_utils>=0.0.14') | |
kwargs['automodel_class'] = kwargs['automodel_class'] or Qwen3VLForConditionalGeneration | |
kwargs['_check_qwen_vl_utils'] = False | |
return get_model_tokenizer_qwen2_vl(model_dir, *args, **kwargs) | |
from transformers import Qwen3VLForConditionalGeneration, AutoModelForCausalLM | |
require_version('qwen_vl_utils>=0.0.14') | |
if kwargs.get('automodel_class') is AutoModelForCausalLM: | |
kwargs['automodel_class'] = Qwen3VLForConditionalGeneration | |
kwargs['_check_qwen_vl_utils'] = False | |
return get_model_tokenizer_qwen2_vl(model_dir, *args, **kwargs) |
from transformers import Qwen3VLMoeForConditionalGeneration | ||
require_version('qwen_vl_utils>=0.0.14') | ||
kwargs['automodel_class'] = kwargs['automodel_class'] or Qwen3VLMoeForConditionalGeneration | ||
kwargs['_check_qwen_vl_utils'] = False | ||
return get_model_tokenizer_qwen2_vl(model_dir, *args, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic kwargs['automodel_class'] or ...
is incorrect because the default automodel_class
from the caller is AutoModelForCausalLM
, which is truthy. This prevents the intended model class Qwen3VLMoeForConditionalGeneration
from being used, leading to the model being loaded with the wrong class. The suggested change correctly checks for the default class and replaces it, ensuring the correct model architecture is loaded.
from transformers import Qwen3VLMoeForConditionalGeneration | |
require_version('qwen_vl_utils>=0.0.14') | |
kwargs['automodel_class'] = kwargs['automodel_class'] or Qwen3VLMoeForConditionalGeneration | |
kwargs['_check_qwen_vl_utils'] = False | |
return get_model_tokenizer_qwen2_vl(model_dir, *args, **kwargs) | |
from transformers import Qwen3VLMoeForConditionalGeneration, AutoModelForCausalLM | |
require_version('qwen_vl_utils>=0.0.14') | |
if kwargs.get('automodel_class') is AutoModelForCausalLM: | |
kwargs['automodel_class'] = Qwen3VLMoeForConditionalGeneration | |
kwargs['_check_qwen_vl_utils'] = False | |
return get_model_tokenizer_qwen2_vl(model_dir, *args, **kwargs) |
|
||
|
||
class Qwen3OmniGPTModel(MultimodalGPTModel): | ||
class Qwen3VLGPTModel(MultimodalGPTModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The class Qwen3VLGPTModel
(and Qwen3VLTransformerBlock
) is used for both qwen3_vl
and qwen3_omni
models. The VL
suffix is misleading and harms maintainability. Consider renaming them to something more generic, like Qwen3MMGPTModel
and Qwen3MMTransformerBlock
respectively, to reflect their shared usage across different Qwen3 multimodal models.
huggingface/transformers#40795
shell: https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_vl
Use Transformers
Training:
Inference:
Perform inference using the validation set:
Use Megatron
HF -> MCore checkpoint
Training
MCore checkpoint -> HF
Inference:
Perform inference using the validation set: