Skip to content

Commit b4ff115

Browse files
author
lvyuanjun.lyj
committed
Add qwen2.5-omni
1 parent fc8764c commit b4ff115

17 files changed

+9178
-0
lines changed

docs/source/en/_toctree.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -971,6 +971,8 @@
971971
title: Pix2Struct
972972
- local: model_doc/pixtral
973973
title: Pixtral
974+
- local: model_doc/qwen2_5_omni
975+
title: Qwen2.5-Omni
974976
- local: model_doc/qwen2_5_vl
975977
title: Qwen2.5-VL
976978
- local: model_doc/qwen2_audio
Lines changed: 259 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,259 @@
1+
<!--Copyright 2025 The Qwen Team and The HuggingFace Inc. team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
12+
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13+
rendered properly in your Markdown viewer.
14+
15+
-->
16+
17+
# Qwen2.5-Omni
18+
19+
<div class="flex flex-wrap space-x-1">
20+
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
21+
<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
22+
<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
23+
</div>
24+
25+
## Overview
26+
27+
The [Qwen2.5-Omni](https://qwenlm.github.io/blog/) model is a unified multiple modalities model proposed in [Qwen2.5-Omni Technical Report]() from Qwen team, Alibaba Group.
28+
29+
The abstract from the technical report is the following:
30+
31+
*We present Qwen2.5-Omni, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. To enable the streaming of multimodal information inputs, both audio and visual encoders utilize a block-wise processing approach. This strategy effectively decouples the handling of long sequences of multimodal data, assigning the perceptual responsibilities to the multimodal encoder and entrusting the modeling of extended sequences to a large language model. Such a division of labor enhances the fusion of different modalities via the shared attention mechanism. To synchronize the timestamps of video inputs with audio, we organized the audio and video sequentially in an interleaved manner and propose a novel position embedding approach, named TMRoPE (Time-aligned Multimodal RoPE). To concurrently generate text and speech while avoiding interference between the two modalities, we propose Thinker-Talker architecture. In this framework, Thinker functions as a large language model tasked with text generation, while Talker is a dual-track autoregressive model that directly utilizes the hidden representations from the Thinker to produce audio tokens as output. Both the Thinker and Talker models are designed to be trained and inferred in an end-to-end manner. For decoding audio tokens in a streaming manner, we introduce a sliding-window DiT that restricts the receptive field, aiming to reduce the initial package delay. Qwen2.5-Omni outperforms the similarly sized Qwen2-VL and Qwen2-Audio in both image and audio capabilities. Furthermore, Qwen2.5-Omni achieves state-of-the-art performance on multimodal benchmarks like Omni-Bench. Notably, Qwen2.5-Omni is the first open-source model to achieve a level of performance in end-to-end speech instruction following that is comparable to its capabilities with text inputs, as evidenced by benchmarks such as MMLU and GSM8K. As for speech generation, Qwen2.5-Omni’s streaming Talker outperform most existing streaming and non-streaming alternatives in robustness and naturalness.*
32+
33+
## Usage example
34+
`Qwen2.5-Omni` can be found on the [Huggingface Hub](https://huggingface.co/Qwen).
35+
### Single Media inference
36+
37+
The model can accept text, images, audio and videos as input. Here's an example code for inference.
38+
39+
```python
40+
import soundfile as sf
41+
42+
from transformers import Qwen2_5OmniModel, Qwen2_5OmniProcessor
43+
from qwen_omni_utils import process_mm_info
44+
45+
model = Qwen2_5OmniModel.from_pretrained(
46+
"Qwen/Qwen2.5-Omni-7B",
47+
torch_dtype="auto",
48+
device_map="auto"
49+
)
50+
processor = Qwen2_5OmniProcessor.from_pretrained("Qwen/Qwen2.5-Omni-7B")
51+
USE_AUDIO_IN_VIDEO = True
52+
53+
conversation = [
54+
{
55+
"role": "system",
56+
"content": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.",
57+
},
58+
{
59+
"role": "user",
60+
"content": [
61+
{"type": "video", "video": "/path/to/video.mp4"},
62+
{"type": "text", "text": "What cant you hear and see in this video?"},
63+
],
64+
},
65+
]
66+
67+
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
68+
69+
# Need install ffmpeg to read non wav&flac audios
70+
audios, images, videos = process_mm_info(conversation, USE_AUDIO_IN_VIDEO)
71+
72+
inputs = processor(text=text, audios=audios, images=images, videos=videos, return_tensors="pt", padding=True)
73+
inputs = inputs.to(model.device)
74+
75+
text_ids, audio = model.generate(**inputs, use_audio_in_video=USE_AUDIO_IN_VIDEO)
76+
77+
text = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
78+
sf.write(
79+
"output.wav",
80+
audio.reshape(-1).detach().cpu().numpy(),
81+
samplerate=24000,
82+
)
83+
print(text)
84+
```
85+
86+
### Batch Mixed Media Inference
87+
88+
The model can batch inputs composed of mixed samples of various types such as text, images, audio and videos as input when `return_audio=False` is set. Here is an example.
89+
90+
```python
91+
import soundfile as sf
92+
93+
from transformers import Qwen2_5OmniModel, Qwen2_5OmniProcessor
94+
from qwen_omni_utils import process_mm_info
95+
96+
model = Qwen2_5OmniModel.from_pretrained(
97+
"Qwen/Qwen2.5-Omni-7B",
98+
torch_dtype="auto",
99+
device_map="auto"
100+
)
101+
processor = Qwen2_5OmniProcessor.from_pretrained("Qwen/Qwen2.5-Omni-7B")
102+
USE_AUDIO_IN_VIDEO = True
103+
104+
# Conversation with video only
105+
conversation1 = [
106+
{
107+
"role": "system",
108+
"content": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.",
109+
},
110+
{
111+
"role": "user",
112+
"content": [
113+
{"type": "video", "video": "/path/to/video.mp4"},
114+
]
115+
}
116+
]
117+
118+
# Conversation with audio only
119+
conversation2 = [
120+
{
121+
"role": "system",
122+
"content": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.",
123+
},
124+
{
125+
"role": "user",
126+
"content": [
127+
{"type": "audio", "audio": "/path/to/audio.wav"},
128+
]
129+
}
130+
]
131+
132+
# Conversation with pure text
133+
conversation3 = [
134+
{
135+
"role": "system",
136+
"content": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.",
137+
},
138+
{
139+
"role": "user",
140+
"content": "who are you?"
141+
}
142+
]
143+
144+
145+
# Conversation with mixed media
146+
conversation4 = [
147+
{
148+
"role": "system",
149+
"content": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.",
150+
},
151+
{
152+
"role": "user",
153+
"content": [
154+
{"type": "image", "image": "/path/to/image.jpg"},
155+
{"type": "video", "video": "/path/to/video.mp4"},
156+
{"type": "audio", "audio": "/path/to/audio.wav"},
157+
{"type": "text", "text": "What are the elements can you see and hear in these medias?"},
158+
],
159+
}
160+
]
161+
162+
conversations = [conversation1, conversation2, conversation3, conversation4]
163+
164+
text = processor.apply_chat_template(conversations, add_generation_prompt=True, tokenize=False)
165+
audios, images, videos = process_mm_info(conversations, USE_AUDIO_IN_VIDEO)
166+
167+
inputs = processor(text=text, audios=audios, images=images, videos=videos, return_tensors="pt", padding=True)
168+
inputs = inputs.to(model.thinker.device)
169+
170+
text_ids = model.generate(**inputs, use_audio_in_video=USE_AUDIO_IN_VIDEO, return_audio=False)
171+
text = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
172+
173+
print(text)
174+
```
175+
176+
### Usage Tips
177+
178+
#### Prompt for audio output
179+
If users need audio output, the system prompt must be set as "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.", otherwise the audio output may not work as expected.
180+
```
181+
{
182+
"role": "system",
183+
"content": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.",
184+
}
185+
```
186+
187+
#### Use audio output or not
188+
189+
The model supports both text and audio outputs, if users do not need audio outputs, they can set `enable_audio_output` in the `from_pretrained` function. This option will save about `~2GB` of GPU memory but the `return_audio` option for `generate` function will only allow to be set at `False`.
190+
```python
191+
model = Qwen2_5OmniModel.from_pretrained(
192+
"Qwen/Qwen2.5-Omni-7B",
193+
torch_dtype="auto",
194+
device_map="auto",
195+
enable_audio_output=False,
196+
)
197+
```
198+
199+
In order to obtain a flexible experience, we recommend that users set `enable_audio_output` at `True` when initializing the model through `from_pretrained` function, and then decide whether to return audio when `generate` function is called. When `return_audio` is set to `False`, the model will only return text outputs to get text responses faster.
200+
201+
```python
202+
model = Qwen2_5OmniModel.from_pretrained(
203+
"Qwen/Qwen2.5-Omni-7B",
204+
torch_dtype="auto",
205+
device_map="auto",
206+
enable_audio_output=True,
207+
)
208+
...
209+
text_ids = model.generate(**inputs, return_audio=False)
210+
```
211+
212+
#### Change voice type of output audio
213+
Qwen2.5-Omni supports the ability to change the voice of the output audio. Users can use the `spk` parameter of `generate` function to specify the voice type. The `"Qwen/Qwen2.5-Omni-7B"` checkpoint support two voice types: `Cherry` and `Ehtan`, while `Cherry` is a female voice and `Ehtan` is a male voice. By defalut, if `spk` is not specified, the default voice type is `Cherry`.
214+
215+
```python
216+
text_ids, audio = model.generate(**inputs, spk="Cherry")
217+
```
218+
219+
```python
220+
text_ids, audio = model.generate(**inputs, spk="Ehtan")
221+
```
222+
223+
#### Flash-Attention 2 to speed up generation
224+
225+
First, make sure to install the latest version of Flash Attention 2:
226+
227+
```bash
228+
pip install -U flash-attn --no-build-isolation
229+
```
230+
231+
Also, you should have hardware that is compatible with FlashAttention 2. Read more about it in the official documentation of the [flash attention repository](https://github.com/Dao-AILab/flash-attention). FlashAttention-2 can only be used when a model is loaded in `torch.float16` or `torch.bfloat16`.
232+
233+
To load and run a model using FlashAttention-2, add `attn_implementation="flash_attention_2"` when loading the model:
234+
235+
```python
236+
from transformers import Qwen2_5OmniModel
237+
238+
model = Qwen2_5OmniModel.from_pretrained(
239+
"Qwen/Qwen2.5-Omni-7B",
240+
device_map="auto",
241+
torch_dtype=torch.bfloat16,
242+
attn_implementation="flash_attention_2",
243+
)
244+
```
245+
246+
247+
248+
## Qwen2_5OmniConfig
249+
250+
[[autodoc]] Qwen2_5OmniConfig
251+
252+
## Qwen2_5OmniProcessor
253+
254+
[[autodoc]] Qwen2_5OmniProcessor
255+
256+
## Qwen2_5OmniModel
257+
258+
[[autodoc]] Qwen2_5OmniModel
259+
- forward

src/transformers/__init__.py

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -720,6 +720,13 @@
720720
"Qwen2Config",
721721
"Qwen2Tokenizer",
722722
],
723+
"models.qwen2_5_omni": [
724+
"Qwen2_5OmniThinkerConfig",
725+
"Qwen2_5OmniTalkerConfig",
726+
"Qwen2_5OmniToken2WavConfig",
727+
"Qwen2_5OmniConfig",
728+
"Qwen2_5OmniProcessor",
729+
],
723730
"models.qwen2_5_vl": [
724731
"Qwen2_5_VLConfig",
725732
"Qwen2_5_VLProcessor",
@@ -1294,6 +1301,7 @@
12941301
_import_structure["models.pixtral"].append("PixtralImageProcessor")
12951302
_import_structure["models.poolformer"].extend(["PoolFormerFeatureExtractor", "PoolFormerImageProcessor"])
12961303
_import_structure["models.pvt"].extend(["PvtImageProcessor"])
1304+
_import_structure["models.qwen2_5_omni"].extend(["Qwen2_5OmniProcessor"])
12971305
_import_structure["models.qwen2_vl"].extend(["Qwen2VLImageProcessor"])
12981306
_import_structure["models.rt_detr"].extend(["RTDetrImageProcessor"])
12991307
_import_structure["models.sam"].extend(["SamImageProcessor"])
@@ -3358,6 +3366,14 @@
33583366
"Qwen2PreTrainedModel",
33593367
]
33603368
)
3369+
_import_structure["models.qwen2_5_omni"].extend(
3370+
[
3371+
"Qwen2_5OmniModel",
3372+
"Qwen2_5OmniTalkerForConditionalGeneration",
3373+
"Qwen2_5OmniThinkerForConditionalGeneration",
3374+
"Qwen2_5OmniToken2WavModel",
3375+
]
3376+
)
33613377
_import_structure["models.qwen2_5_vl"].extend(
33623378
[
33633379
"Qwen2_5_VLForConditionalGeneration",
@@ -5915,6 +5931,13 @@
59155931
from .models.pvt import PvtConfig
59165932
from .models.pvt_v2 import PvtV2Config
59175933
from .models.qwen2 import Qwen2Config, Qwen2Tokenizer
5934+
from .models.qwen2_5_omni import (
5935+
Qwen2_5OmniConfig,
5936+
Qwen2_5OmniProcessor,
5937+
Qwen2_5OmniTalkerConfig,
5938+
Qwen2_5OmniThinkerConfig,
5939+
Qwen2_5OmniToken2WavConfig,
5940+
)
59185941
from .models.qwen2_5_vl import (
59195942
Qwen2_5_VLConfig,
59205943
Qwen2_5_VLProcessor,
@@ -6513,6 +6536,7 @@
65136536
PoolFormerImageProcessor,
65146537
)
65156538
from .models.pvt import PvtImageProcessor
6539+
from .models.qwen2_5_omni import Qwen2_5OmniProcessor
65166540
from .models.qwen2_vl import Qwen2VLImageProcessor
65176541
from .models.rt_detr import RTDetrImageProcessor
65186542
from .models.sam import SamImageProcessor
@@ -8171,6 +8195,12 @@
81718195
Qwen2Model,
81728196
Qwen2PreTrainedModel,
81738197
)
8198+
from .models.qwen2_5_omni import (
8199+
Qwen2_5OmniModel,
8200+
Qwen2_5OmniTalkerForConditionalGeneration,
8201+
Qwen2_5OmniThinkerForConditionalGeneration,
8202+
Qwen2_5OmniToken2WavModel,
8203+
)
81748204
from .models.qwen2_5_vl import (
81758205
Qwen2_5_VLForConditionalGeneration,
81768206
Qwen2_5_VLModel,

src/transformers/models/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -222,6 +222,7 @@
222222
pvt,
223223
pvt_v2,
224224
qwen2,
225+
qwen2_5_omni,
225226
qwen2_5_vl,
226227
qwen2_audio,
227228
qwen2_moe,

src/transformers/models/auto/configuration_auto.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -245,6 +245,7 @@
245245
("pvt_v2", "PvtV2Config"),
246246
("qdqbert", "QDQBertConfig"),
247247
("qwen2", "Qwen2Config"),
248+
("qwen2_5_omni", "Qwen2_5OmniConfig"),
248249
("qwen2_5_vl", "Qwen2_5_VLConfig"),
249250
("qwen2_audio", "Qwen2AudioConfig"),
250251
("qwen2_audio_encoder", "Qwen2AudioEncoderConfig"),
@@ -595,6 +596,7 @@
595596
("pvt_v2", "PVTv2"),
596597
("qdqbert", "QDQBert"),
597598
("qwen2", "Qwen2"),
599+
("qwen2_5_omni", "Qwen2_5Omni"),
598600
("qwen2_5_vl", "Qwen2_5_VL"),
599601
("qwen2_audio", "Qwen2Audio"),
600602
("qwen2_audio_encoder", "Qwen2AudioEncoder"),

src/transformers/models/auto/modeling_auto.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -226,6 +226,7 @@
226226
("pvt_v2", "PvtV2Model"),
227227
("qdqbert", "QDQBertModel"),
228228
("qwen2", "Qwen2Model"),
229+
("qwen2_5_omni", "Qwen2_5OmniModel"),
229230
("qwen2_5_vl", "Qwen2_5_VLModel"),
230231
("qwen2_audio_encoder", "Qwen2AudioEncoder"),
231232
("qwen2_moe", "Qwen2MoeModel"),
@@ -1394,6 +1395,7 @@
13941395
("fastspeech2_conformer", "FastSpeech2ConformerWithHifiGan"),
13951396
("musicgen", "MusicgenForConditionalGeneration"),
13961397
("musicgen_melody", "MusicgenMelodyForConditionalGeneration"),
1398+
("qwen2_5_omni", "Qwen2_5OmniModel"),
13971399
("seamless_m4t", "SeamlessM4TForTextToSpeech"),
13981400
("seamless_m4t_v2", "SeamlessM4Tv2ForTextToSpeech"),
13991401
("vits", "VitsModel"),

src/transformers/models/auto/processing_auto.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -93,6 +93,7 @@
9393
("pix2struct", "Pix2StructProcessor"),
9494
("pixtral", "PixtralProcessor"),
9595
("pop2piano", "Pop2PianoProcessor"),
96+
("qwen2_5_omni", "Qwen2_5OmniProcessor"),
9697
("qwen2_5_vl", "Qwen2_5_VLProcessor"),
9798
("qwen2_audio", "Qwen2AudioProcessor"),
9899
("qwen2_vl", "Qwen2VLProcessor"),

src/transformers/models/auto/tokenization_auto.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -430,6 +430,7 @@
430430
"Qwen2TokenizerFast" if is_tokenizers_available() else None,
431431
),
432432
),
433+
("qwen2_5_omni", ("Qwen2Tokenizer", "Qwen2TokenizerFast" if is_tokenizers_available() else None)),
433434
("qwen2_5_vl", ("Qwen2Tokenizer", "Qwen2TokenizerFast" if is_tokenizers_available() else None)),
434435
("qwen2_audio", ("Qwen2Tokenizer", "Qwen2TokenizerFast" if is_tokenizers_available() else None)),
435436
(

0 commit comments

Comments
 (0)