Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Device selection for Qwen2Audio #32621

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

ifsheldon
Copy link

What does this PR do?

Fixes runtime errors like "inputs are on different devices" when Qwen2 Audio runs on devices like "mps". This problem occurs when I tried to run the model on my Mac using mps device.

Tests in transformers/tests/models/qwen2_audio have passed and I have also test it with the official demo from Qwen 2 Audio with a few modification to run it on MPS device (see below).

Problem and Code References

Here it calls

which has an optional argument device that defaults to "cpu". So, the output of the whisper feature extractor will by default

But we can't just pass device="mps" when calling Qwen2AudioProcessor.__call__, which will cause another runtime error that says self.tokenizer() does not have a device argument.

Who can review?

Probably @faychu @ylacombe can take a look because of #32137?

Modified Demo Code

Modified from https://github.com/QwenLM/Qwen2-Audio?tab=readme-ov-file#audio-analysis-inference

from io import BytesIO
from urllib.request import urlopen
import librosa
from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor

DEFAULT_DEVICE = "mps"  # or "cuda", "cpu", NEW

processor = AutoProcessor.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct")
model = Qwen2AudioForConditionalGeneration.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct", device_map="auto")

conversation = [
    {'role': 'system', 'content': 'You are a helpful assistant.'}, 
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/glass-breaking-151256.mp3"},
        {"type": "text", "text": "What's that sound?"},
    ]},
    {"role": "assistant", "content": "It is the sound of glass shattering."},
    {"role": "user", "content": [
        {"type": "text", "text": "What can you do when you hear that?"},
    ]},
    {"role": "assistant", "content": "Stay alert and cautious, and check if anyone is hurt or if there is any damage to property."},
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/1272-128104-0000.flac"},
        {"type": "text", "text": "What does the person say?"},
    ]},
]
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios = []
for message in conversation:
    if isinstance(message["content"], list):
        for ele in message["content"]:
            if ele["type"] == "audio":
                audios.append(
                    librosa.load(
                        BytesIO(urlopen(ele['audio_url']).read()), 
                        sr=processor.feature_extractor.sampling_rate)[0]
                )

inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True, device=DEFAULT_DEVICE)  # NEW
# inputs.input_ids = inputs.input_ids.to("cuda") # COMMENTED OUT, NOT NEEDED ANYMORE

generate_ids = model.generate(**inputs, max_length=256)
generate_ids = generate_ids[:, inputs.input_ids.size(1):]

response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

@ylacombe
Copy link
Contributor

ylacombe commented Aug 12, 2024

Hey @ifsheldon, thanks for opening this PR!
have you considered using the following way of switching devices ?

inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True).to(DEFAULT_DEVICE)

I'm not sure adding device to the processor fits with our current design

@ifsheldon
Copy link
Author

@ylacombe

inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True).to(DEFAULT_DEVICE)

That works as well, but then the WhisperFeatureExtractor will run on cpu, so the performance is not optimal.

I'm not sure adding device to the processor fits with our current design

I don't know what you mean by "current design". By looking at class Qwen2AudioProcessor(ProcessorMixin), it seems there's no restriction on __call__, so I guess it's free to add any arguments that make sense for doing processing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants