|  | 
|  | 1 | +<!--Copyright 2025 The Qwen Team and The HuggingFace Inc. team. All rights reserved. | 
|  | 2 | +
 | 
|  | 3 | +Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | 
|  | 4 | +the License. You may obtain a copy of the License at | 
|  | 5 | +
 | 
|  | 6 | +http://www.apache.org/licenses/LICENSE-2.0 | 
|  | 7 | +
 | 
|  | 8 | +Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | 
|  | 9 | +an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | 
|  | 10 | +specific language governing permissions and limitations under the License. | 
|  | 11 | +
 | 
|  | 12 | +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be | 
|  | 13 | +rendered properly in your Markdown viewer. | 
|  | 14 | +
 | 
|  | 15 | +--> | 
|  | 16 | + | 
|  | 17 | +# Qwen2.5-Omni | 
|  | 18 | + | 
|  | 19 | +<div class="flex flex-wrap space-x-1"> | 
|  | 20 | +<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white"> | 
|  | 21 | +<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat"> | 
|  | 22 | +<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white"> | 
|  | 23 | +</div> | 
|  | 24 | + | 
|  | 25 | +## Overview | 
|  | 26 | + | 
|  | 27 | +The [Qwen2.5-Omni](https://qwenlm.github.io/blog/) model is a unified multiple modalities model proposed in [Qwen2.5-Omni Technical Report]() from Qwen team, Alibaba Group. | 
|  | 28 | + | 
|  | 29 | +The abstract from the technical report is the following: | 
|  | 30 | + | 
|  | 31 | +*We present Qwen2.5-Omni, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. To enable the streaming of multimodal information inputs, both audio and visual encoders utilize a block-wise processing approach. This strategy effectively decouples the handling of long sequences of multimodal data, assigning the perceptual responsibilities to the multimodal encoder and entrusting the modeling of extended sequences to a large language model. Such a division of labor enhances the fusion of different modalities via the shared attention mechanism. To synchronize the timestamps of video inputs with audio, we organized the audio and video sequentially in an interleaved manner and propose a novel position embedding approach, named TMRoPE (Time-aligned Multimodal RoPE). To concurrently generate text and speech while avoiding interference between the two modalities, we propose Thinker-Talker architecture. In this framework, Thinker functions as a large language model tasked with text generation, while Talker is a dual-track autoregressive model that directly utilizes the hidden representations from the Thinker to produce audio tokens as output. Both the Thinker and Talker models are designed to be trained and inferred in an end-to-end manner. For decoding audio tokens in a streaming manner, we introduce a sliding-window DiT that restricts the receptive field, aiming to reduce the initial package delay. Qwen2.5-Omni outperforms the similarly sized Qwen2-VL and Qwen2-Audio in both image and audio capabilities. Furthermore, Qwen2.5-Omni achieves state-of-the-art performance on multimodal benchmarks like Omni-Bench. Notably, Qwen2.5-Omni is the first open-source model to achieve a level of performance in end-to-end speech instruction following that is comparable to its capabilities with text inputs, as evidenced by benchmarks such as MMLU and GSM8K. As for speech generation, Qwen2.5-Omni’s streaming Talker outperform most existing streaming and non-streaming alternatives in robustness and naturalness.* | 
|  | 32 | + | 
|  | 33 | +## Usage example | 
|  | 34 | +`Qwen2.5-Omni` can be found on the [Huggingface Hub](https://huggingface.co/Qwen). | 
|  | 35 | +### Single Media inference | 
|  | 36 | + | 
|  | 37 | +The model can accept text, images, audio and videos as input. Here's an example code for inference. | 
|  | 38 | + | 
|  | 39 | +```python | 
|  | 40 | +import soundfile as sf | 
|  | 41 | + | 
|  | 42 | +from transformers import Qwen2_5OmniModel, Qwen2_5OmniProcessor | 
|  | 43 | +from qwen_omni_utils import process_mm_info | 
|  | 44 | + | 
|  | 45 | +model = Qwen2_5OmniModel.from_pretrained( | 
|  | 46 | +    "Qwen/Qwen2.5-Omni-7B", | 
|  | 47 | +    torch_dtype="auto", | 
|  | 48 | +    device_map="auto" | 
|  | 49 | +) | 
|  | 50 | +processor = Qwen2_5OmniProcessor.from_pretrained("Qwen/Qwen2.5-Omni-7B") | 
|  | 51 | +USE_AUDIO_IN_VIDEO = True | 
|  | 52 | + | 
|  | 53 | +conversation = [ | 
|  | 54 | +    { | 
|  | 55 | +        "role": "system", | 
|  | 56 | +        "content": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.", | 
|  | 57 | +    }, | 
|  | 58 | +    { | 
|  | 59 | +        "role": "user", | 
|  | 60 | +        "content": [ | 
|  | 61 | +            {"type": "video", "video": "/path/to/video.mp4"}, | 
|  | 62 | +            {"type": "text", "text": "What cant you hear and see in this video?"}, | 
|  | 63 | +        ], | 
|  | 64 | +    }, | 
|  | 65 | +] | 
|  | 66 | + | 
|  | 67 | +text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False) | 
|  | 68 | + | 
|  | 69 | +# Need install ffmpeg to read non wav&flac audios | 
|  | 70 | +audios, images, videos = process_mm_info(conversation, USE_AUDIO_IN_VIDEO) | 
|  | 71 | + | 
|  | 72 | +inputs = processor(text=text, audios=audios, images=images, videos=videos, return_tensors="pt", padding=True) | 
|  | 73 | +inputs = inputs.to(model.device) | 
|  | 74 | + | 
|  | 75 | +text_ids, audio = model.generate(**inputs, use_audio_in_video=USE_AUDIO_IN_VIDEO) | 
|  | 76 | + | 
|  | 77 | +text = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False) | 
|  | 78 | +sf.write( | 
|  | 79 | +    "output.wav", | 
|  | 80 | +    audio.reshape(-1).detach().cpu().numpy(), | 
|  | 81 | +    samplerate=24000, | 
|  | 82 | +) | 
|  | 83 | +print(text) | 
|  | 84 | +``` | 
|  | 85 | + | 
|  | 86 | +### Batch Mixed Media Inference | 
|  | 87 | + | 
|  | 88 | +The model can batch inputs composed of mixed samples of various types such as text, images, audio and videos as input when `return_audio=False` is set. Here is an example. | 
|  | 89 | + | 
|  | 90 | +```python | 
|  | 91 | +import soundfile as sf | 
|  | 92 | + | 
|  | 93 | +from transformers import Qwen2_5OmniModel, Qwen2_5OmniProcessor | 
|  | 94 | +from qwen_omni_utils import process_mm_info | 
|  | 95 | + | 
|  | 96 | +model = Qwen2_5OmniModel.from_pretrained( | 
|  | 97 | +    "Qwen/Qwen2.5-Omni-7B", | 
|  | 98 | +    torch_dtype="auto", | 
|  | 99 | +    device_map="auto" | 
|  | 100 | +) | 
|  | 101 | +processor = Qwen2_5OmniProcessor.from_pretrained("Qwen/Qwen2.5-Omni-7B") | 
|  | 102 | +USE_AUDIO_IN_VIDEO = True | 
|  | 103 | + | 
|  | 104 | +# Conversation with video only | 
|  | 105 | +conversation1 = [ | 
|  | 106 | +    { | 
|  | 107 | +        "role": "system", | 
|  | 108 | +        "content": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.", | 
|  | 109 | +    }, | 
|  | 110 | +    { | 
|  | 111 | +        "role": "user", | 
|  | 112 | +        "content": [ | 
|  | 113 | +            {"type": "video", "video": "/path/to/video.mp4"}, | 
|  | 114 | +        ] | 
|  | 115 | +    } | 
|  | 116 | +] | 
|  | 117 | + | 
|  | 118 | +# Conversation with audio only | 
|  | 119 | +conversation2 = [ | 
|  | 120 | +    { | 
|  | 121 | +        "role": "system", | 
|  | 122 | +        "content": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.", | 
|  | 123 | +    }, | 
|  | 124 | +    { | 
|  | 125 | +        "role": "user", | 
|  | 126 | +        "content": [ | 
|  | 127 | +            {"type": "audio", "audio": "/path/to/audio.wav"}, | 
|  | 128 | +        ] | 
|  | 129 | +    } | 
|  | 130 | +] | 
|  | 131 | + | 
|  | 132 | +# Conversation with pure text | 
|  | 133 | +conversation3 = [ | 
|  | 134 | +    { | 
|  | 135 | +        "role": "system", | 
|  | 136 | +        "content": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.", | 
|  | 137 | +    }, | 
|  | 138 | +    { | 
|  | 139 | +        "role": "user", | 
|  | 140 | +        "content": "who are you?" | 
|  | 141 | +    } | 
|  | 142 | +] | 
|  | 143 | + | 
|  | 144 | + | 
|  | 145 | +# Conversation with mixed media | 
|  | 146 | +conversation4 = [ | 
|  | 147 | +    { | 
|  | 148 | +        "role": "system", | 
|  | 149 | +        "content": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.", | 
|  | 150 | +    }, | 
|  | 151 | +    { | 
|  | 152 | +        "role": "user", | 
|  | 153 | +        "content": [ | 
|  | 154 | +            {"type": "image", "image": "/path/to/image.jpg"}, | 
|  | 155 | +            {"type": "video", "video": "/path/to/video.mp4"}, | 
|  | 156 | +            {"type": "audio", "audio": "/path/to/audio.wav"}, | 
|  | 157 | +            {"type": "text", "text": "What are the elements can you see and hear in these medias?"}, | 
|  | 158 | +        ], | 
|  | 159 | +    } | 
|  | 160 | +] | 
|  | 161 | + | 
|  | 162 | +conversations = [conversation1, conversation2, conversation3, conversation4] | 
|  | 163 | + | 
|  | 164 | +text = processor.apply_chat_template(conversations, add_generation_prompt=True, tokenize=False) | 
|  | 165 | +audios, images, videos = process_mm_info(conversations, USE_AUDIO_IN_VIDEO) | 
|  | 166 | + | 
|  | 167 | +inputs = processor(text=text, audios=audios, images=images, videos=videos, return_tensors="pt", padding=True) | 
|  | 168 | +inputs = inputs.to(model.thinker.device) | 
|  | 169 | + | 
|  | 170 | +text_ids = model.generate(**inputs, use_audio_in_video=USE_AUDIO_IN_VIDEO, return_audio=False) | 
|  | 171 | +text = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False) | 
|  | 172 | + | 
|  | 173 | +print(text) | 
|  | 174 | +``` | 
|  | 175 | + | 
|  | 176 | +### Usage Tips | 
|  | 177 | + | 
|  | 178 | +#### Prompt for audio output | 
|  | 179 | +If users need audio output, the system prompt must be set as "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.", otherwise the audio output may not work as expected. | 
|  | 180 | +``` | 
|  | 181 | +{ | 
|  | 182 | +    "role": "system", | 
|  | 183 | +    "content": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.", | 
|  | 184 | +} | 
|  | 185 | +``` | 
|  | 186 | + | 
|  | 187 | +#### Use audio output or not | 
|  | 188 | + | 
|  | 189 | +The model supports both text and audio outputs, if users do not need audio outputs, they can set `enable_audio_output` in the `from_pretrained` function. This option will save about `~2GB` of GPU memory but the `return_audio` option for `generate` function will only allow to be set at `False`. | 
|  | 190 | +```python | 
|  | 191 | +model = Qwen2_5OmniModel.from_pretrained( | 
|  | 192 | +    "Qwen/Qwen2.5-Omni-7B", | 
|  | 193 | +    torch_dtype="auto", | 
|  | 194 | +    device_map="auto", | 
|  | 195 | +    enable_audio_output=False, | 
|  | 196 | +) | 
|  | 197 | +``` | 
|  | 198 | + | 
|  | 199 | +In order to obtain a flexible experience, we recommend that users set `enable_audio_output` at `True` when initializing the model through `from_pretrained` function, and then decide whether to return audio when `generate` function is called. When `return_audio` is set to `False`, the model will only return text outputs to get text responses faster. | 
|  | 200 | + | 
|  | 201 | +```python | 
|  | 202 | +model = Qwen2_5OmniModel.from_pretrained( | 
|  | 203 | +    "Qwen/Qwen2.5-Omni-7B", | 
|  | 204 | +    torch_dtype="auto", | 
|  | 205 | +    device_map="auto", | 
|  | 206 | +    enable_audio_output=True, | 
|  | 207 | +) | 
|  | 208 | +... | 
|  | 209 | +text_ids = model.generate(**inputs, return_audio=False) | 
|  | 210 | +``` | 
|  | 211 | + | 
|  | 212 | +#### Change voice type of output audio | 
|  | 213 | +Qwen2.5-Omni supports the ability to change the voice of the output audio. Users can use the `spk` parameter of `generate` function to specify the voice type. The `"Qwen/Qwen2.5-Omni-7B"` checkpoint support two voice types: `Cherry` and `Ehtan`, while `Cherry` is a female voice and `Ehtan` is a male voice. By defalut, if `spk` is not specified, the default voice type is `Cherry`. | 
|  | 214 | + | 
|  | 215 | +```python | 
|  | 216 | +text_ids, audio = model.generate(**inputs, spk="Cherry") | 
|  | 217 | +``` | 
|  | 218 | + | 
|  | 219 | +```python | 
|  | 220 | +text_ids, audio = model.generate(**inputs, spk="Ehtan") | 
|  | 221 | +``` | 
|  | 222 | + | 
|  | 223 | +#### Flash-Attention 2 to speed up generation | 
|  | 224 | + | 
|  | 225 | +First, make sure to install the latest version of Flash Attention 2: | 
|  | 226 | + | 
|  | 227 | +```bash | 
|  | 228 | +pip install -U flash-attn --no-build-isolation | 
|  | 229 | +``` | 
|  | 230 | + | 
|  | 231 | +Also, you should have hardware that is compatible with FlashAttention 2. Read more about it in the official documentation of the [flash attention repository](https://github.com/Dao-AILab/flash-attention). FlashAttention-2 can only be used when a model is loaded in `torch.float16` or `torch.bfloat16`. | 
|  | 232 | + | 
|  | 233 | +To load and run a model using FlashAttention-2, add `attn_implementation="flash_attention_2"` when loading the model: | 
|  | 234 | + | 
|  | 235 | +```python | 
|  | 236 | +from transformers import Qwen2_5OmniModel | 
|  | 237 | + | 
|  | 238 | +model = Qwen2_5OmniModel.from_pretrained( | 
|  | 239 | +    "Qwen/Qwen2.5-Omni-7B", | 
|  | 240 | +    device_map="auto", | 
|  | 241 | +    torch_dtype=torch.bfloat16, | 
|  | 242 | +    attn_implementation="flash_attention_2", | 
|  | 243 | +) | 
|  | 244 | +``` | 
|  | 245 | + | 
|  | 246 | + | 
|  | 247 | + | 
|  | 248 | +## Qwen2_5OmniConfig | 
|  | 249 | + | 
|  | 250 | +[[autodoc]] Qwen2_5OmniConfig | 
|  | 251 | + | 
|  | 252 | +## Qwen2_5OmniProcessor | 
|  | 253 | + | 
|  | 254 | +[[autodoc]] Qwen2_5OmniProcessor | 
|  | 255 | + | 
|  | 256 | +## Qwen2_5OmniModel | 
|  | 257 | + | 
|  | 258 | +[[autodoc]] Qwen2_5OmniModel | 
|  | 259 | +    - forward | 
0 commit comments