Device selection for Qwen2Audio #32621

ifsheldon · 2024-08-12T10:26:40Z

What does this PR do?

Fixes runtime errors like "inputs are on different devices" when Qwen2 Audio runs on devices like "mps". This problem occurs when I tried to run the model on my Mac using mps device.

Tests in transformers/tests/models/qwen2_audio have passed and I have also test it with the official demo from Qwen 2 Audio with a few modification to run it on MPS device (see below).

Problem and Code References

transformers/src/transformers/models/qwen2_audio/processing_qwen2_audio.py

Line 94 in 342e3f9

audio_inputs = self.feature_extractor(

Here it calls

transformers/src/transformers/models/whisper/feature_extraction_whisper.py

Line 180 in 342e3f9

def __call__(

which has an optional argument device that defaults to "cpu". So, the output of the whisper feature extractor will by default

But we can't just pass device="mps" when calling Qwen2AudioProcessor.__call__, which will cause another runtime error that says self.tokenizer() does not have a device argument.

Who can review?

Probably @faychu @ylacombe can take a look because of #32137?

Modified Demo Code

Modified from https://github.com/QwenLM/Qwen2-Audio?tab=readme-ov-file#audio-analysis-inference

from io import BytesIO
from urllib.request import urlopen
import librosa
from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor

DEFAULT_DEVICE = "mps"  # or "cuda", "cpu", NEW

processor = AutoProcessor.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct")
model = Qwen2AudioForConditionalGeneration.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct", device_map="auto")

conversation = [
    {'role': 'system', 'content': 'You are a helpful assistant.'}, 
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/glass-breaking-151256.mp3"},
        {"type": "text", "text": "What's that sound?"},
    ]},
    {"role": "assistant", "content": "It is the sound of glass shattering."},
    {"role": "user", "content": [
        {"type": "text", "text": "What can you do when you hear that?"},
    ]},
    {"role": "assistant", "content": "Stay alert and cautious, and check if anyone is hurt or if there is any damage to property."},
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/1272-128104-0000.flac"},
        {"type": "text", "text": "What does the person say?"},
    ]},
]
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios = []
for message in conversation:
    if isinstance(message["content"], list):
        for ele in message["content"]:
            if ele["type"] == "audio":
                audios.append(
                    librosa.load(
                        BytesIO(urlopen(ele['audio_url']).read()), 
                        sr=processor.feature_extractor.sampling_rate)[0]
                )

inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True, device=DEFAULT_DEVICE)  # NEW
# inputs.input_ids = inputs.input_ids.to("cuda") # COMMENTED OUT, NOT NEEDED ANYMORE

generate_ids = model.generate(**inputs, max_length=256)
generate_ids = generate_ids[:, inputs.input_ids.size(1):]

response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

ylacombe · 2024-08-12T11:08:26Z

Hey @ifsheldon, thanks for opening this PR!
have you considered using the following way of switching devices ?

inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True).to(DEFAULT_DEVICE)

I'm not sure adding device to the processor fits with our current design

ifsheldon · 2024-08-12T11:55:18Z

@ylacombe

inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True).to(DEFAULT_DEVICE)

That works as well, but then the WhisperFeatureExtractor will run on cpu, so the performance is not optimal.

I'm not sure adding device to the processor fits with our current design

I don't know what you mean by "current design". By looking at class Qwen2AudioProcessor(ProcessorMixin), it seems there's no restriction on __call__, so I guess it's free to add any arguments that make sense for doing processing?

ifsheldon added 2 commits August 12, 2024 17:30

add device selection, fix not on the same device error

57feccb

format

fdb9f94

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Device selection for Qwen2Audio #32621

Device selection for Qwen2Audio #32621

ifsheldon commented Aug 12, 2024

ylacombe commented Aug 12, 2024 •

edited

Loading

ifsheldon commented Aug 12, 2024

Device selection for Qwen2Audio #32621

Are you sure you want to change the base?

Device selection for Qwen2Audio #32621

Conversation

ifsheldon commented Aug 12, 2024

What does this PR do?

Problem and Code References

Who can review?

Modified Demo Code

ylacombe commented Aug 12, 2024 • edited Loading

ifsheldon commented Aug 12, 2024

ylacombe commented Aug 12, 2024 •

edited

Loading