π This is the official repository for the paper "VoiceAssistant-Eval: Benchmarking AI Assistants across Listening, Speaking, and Viewing", which contains the evaluation code for the VoiceAssistant-Eval benchmark.
[π Homepage] [π€ Huggingface Dataset] [π Leaderboard ] [π Detailed Leaderboard ] [π Roleplay Leaderboard ] [π Paper]
- [2025-09-27] Qwen2.5-Omni-7B achieves 59.2% accuracy on image + text queries but only 42.9% on image + audio queries, reflecting a 16.3-point drop.
- [2025-09-27] Step-Audio-2-mini achieves more than double the listening accuracy of the 32B LLaMA-Omni2 model (40.06 vs. 16.00).
- [2025-09-27] We observe that 20 out of 22 models score higher on Speaking than on Listening, and this mismatch highlights the need for more balanced development.
- [2025-09-27] GPT-4o-Audio fails to surpass open-source models in 4 out of 13 tasks.
- [2025-09-27] Our dataset is now accessible at huggingface.
- [2025-09-27] Our paper is now accessible at ArXiv Paper.
The growing capabilities of large language models and multimodal systems have spurred interest in voice-first AI assistants, yet existing benchmarks are inadequate for evaluating the full range of these systems' capabilities. We summarize four key weaknesses of current benchmarks, highlighting the urgent need for a new evaluation framework:
-
W1: Lack of voice personalization evaluation.
Current benchmarks rarely test how well models mimic specific voices, which is key for personalized assistants (e.g., in healthcare). Without this, models may fail in real-world personalized applications. -
W2: Limited focus on hands-free interaction.
Benchmarks often use text-based instructions, ignoring true voice-first, hands-free use. This limits reliability in critical contexts like driving or accessibility for visually impaired users. -
W3: Neglect of real-world audio contexts.
Datasets seldom cover varied, realistic audio environments. Models aren't tested on understanding beyond speech (e.g., music, nature sounds), reducing their everyday usefulness. -
W4: Insufficient multi-modal (vision + audio) assessment.
Benchmarks rarely test joint speech and visual input, missing key scenarios like smart tutors. This gap means benchmarks don't reflect real-world multimodal needs.
We introduce VoiceAssistant-Eval, a comprehensive benchmark designed to assess AI assistants across listening, speaking, and viewing. VoiceAssistant-Eval comprises 10,497 curated examples spanning 13 task categories. These tasks include natural sounds, music, and spoken dialogue for listening; multi-turn dialogue, role-play imitation, and various scenarios for speaking; and highly heterogeneous images for viewing.
To demonstrate its utility, we evaluate 21 open-source models and GPT-4o-Audio, measuring the quality of the response content and speech, as well as their consistency. The results reveal three key findings: (1) proprietary models do not universally outperform open-source models; (2) most models excel at speaking tasks but lag in audio understanding; and (3) well-designed smaller models can rival much larger ones. Notably, the mid-sized Step-Audio-2-mini (7B) achieves more than double the listening accuracy of LLaMA-Omni2-32B-Bilingual. However, challenges remain: multimodal (audio+visual) input and role-play voice imitation tasks are difficult for current models, and significant gaps persist in robustness and safety alignment. VoiceAssistant-Eval identifies these gaps and establishes a rigorous framework for evaluating and guiding the development of next-generation multimodal voice assistants.
Figure 1: (a) Scores of six prominent omni-models across 13 tasks. (b) Examples from four newly designed tasks for voice assistants: I. Example from the role-play task with reference audio. II. A truly voice-based multi-turn conversation, instead of providing multi-round context in text. III. Multi-modal (vision + audio) integration understanding. IV. An audio question with music context.
Please refer to our project homepage and the paper for more details.
Explore the comprehensive evaluation results of AI assistants across multiple dimensions:
- Official Leaderboard: Overall scores across Listening, Speaking, and Viewing tasks
- Detailed Leaderboard: In-depth scores across 13 specific tasks
- Roleplay Leaderboard: Performance on the Speaking Roleplay task
This repository uses Git LFS to store large files (audio files, model weights, etc.). To clone the repository properly:
# Install Git LFS if not already installed
git lfs install
# Clone the repository with LFS files
git clone https://github.com/mathllm/VoiceAssistant-Eval.git
cd VoiceAssistant-Eval
# Alternatively, if you've already cloned without LFS:
git lfs pull
conda create -p envs/voiceassistant python=3.12
conda activate envs/voiceassistant
# CUDA 12.4
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124
pip install git+https://github.com/wenet-e2e/wespeaker.git
pip install PyYAML
pip install requests
pip install librosa
pip install openai-whisper
pip install funasr
pip install openai
pip install transformers==4.55.4
pip install accelerate
Our evaluation provides a comprehensive assessment of both generated speech and text responses, as well as their consistency. Unlike previous studies that focus solely on text responses, we aggregate multiple detailed metrics into a single, unified score for holistic model performance evaluation.
We evaluate model responses across three key dimensions:
- Content Quality: Response accuracy, helpfulness, and appropriateness
- Speech Quality: Audio naturalness and fluency
- Consistency: Alignment between intended content and actual speech output
Final Score Calculation:
Final Score = Content Score Γ Speech Score Γ Consistency Score Γ 100%
Analyzes emotional expression in generated speech for the Speaking/Emotion task:
python 0_emotion2vec/evaluate_emotion.py
- Model: emotion2vec_plus_large
- Purpose: Extracts emotion probabilities (angry, disgusted, fearful, happy, neutral, sad, surprised)
- Output: Emotions with >1% probability are included in evaluation prompts
- Target:
./results/MODEL_NAME/Speaking/Emotion/
examples
Measures voice similarity for roleplay tasks:
python 0_speaker_similarity/evaluate_roleplay.py
- Model: WeSpeaker voxblink2_samresnet100_ft
- Purpose: Computes speaker similarity between generated speech and reference role audio
- Output: Speaker similarity scores incorporated into roleplay task evaluation
- Target:
./results/MODEL_NAME/Speaking/Roleplay/
examples
Evaluates response content using LLM-based judgment:
python 1_content_quality/evaluate_content.py
- Model: gpt-oss-20b
- Evaluation Prompts: 13 task-specific prompts from
evaluation_prompts.py
- Tasks Covered:
- Listening: General, Music, Sound, Speech
- Speaking: Assistant, Emotion, Instruction_Following, Multi_Round, Reasoning, Robustness, Roleplay, Safety
- Viewing: Multi_Discipline
- Output: Binary judgments (Correct/Incorrect, Good/Bad)
Measures audio naturalness and fluency:
python 2_speech_quality/evaluate_speech.py
- Model: UTMOS22_strong
- Purpose: Provides Mean Opinion Score (MOS) reflecting speech quality
- Output: MOS scores for all generated audio responses
- Score Range: 1-5 (converted to 20-100 scale in final calculation)
Step 5a: Speech Transcription (3_content_speech_consistency/evaluate_whisper.py
)
python 3_content_speech_consistency/evaluate_whisper.py
- Model: Whisper-Large-v3
- Purpose: Transcribes generated speech to text for consistency analysis
Step 5b: Consistency Analysis (3_content_speech_consistency/evaluate_consistency.py
)
python 3_content_speech_consistency/evaluate_consistency.py
- Method: Modified Word Error Rate (WER) calculation
- Formula: Character-level Levenshtein distance with length thresholds
- Special Handling: Addresses multiple-choice questions where models output only final letters
Modified WER Calculation:
Let n = len(non-space chars in text1.lower())
Let m = len(non-space chars in text2.lower())
WER'(text1, text2) = {
1, if min(n,m) < 10 and max(n,m) > 10
0, if min(n,m) < 10 and max(n,m) β€ 10
Levenshtein(text1,text2)/max(n,m), otherwise
}
python extract_judge_result.py
- Purpose: Parses binary decisions from content evaluation outputs
- Method: Extracts [Correct]/[Incorrect] and [Good]/[Bad] judgments
- Output: Structured results in
res.json
python get_final_scores.py
- Purpose: Computes comprehensive performance metrics
- Output Metrics:
- Content Score: Binary accuracy (0-100%)
- Consistency Score: 100 - WER% (0-100%)
- Speech Score: MOS Γ 20 (0-100 scale)
- Overall Score: Product of three dimensions
- Setup: Ensure model outputs follow the directory structure:
./results/MODEL_NAME/Task_Name/Sub_Task_Name/Example_ID/example.json
- for example:
./results/gpt-4o-audio-preview-2025-06-03/Listening/Music/Listening_Music_0/example.json
.
- for example:
- Run Core Evaluations (can be executed in parallel):
python 0_emotion2vec/evaluate_emotion.py # Emotion analysis python 0_speaker_similarity/evaluate_roleplay.py # Speaker similarity python 2_speech_quality/evaluate_speech.py # Speech quality python 3_content_speech_consistency/evaluate_whisper.py # Transcription
- Content Evaluation:
python 1_content_quality/evaluate_content.py # Content quality
- Consistency Analysis:
python 3_content_speech_consistency/evaluate_consistency.py # WER calculation
- Generate Results:
python extract_judge_result.py # Extract judgments python get_final_scores.py # Compute final scores
Dimension | Method | Models Used | Output Range |
---|---|---|---|
Emotion | Emotion Classification | emotion2vec | Probability distribution |
Speaker Similarity | Voice Verification | WeSpeaker | 0-1 similarity score |
Content Quality | LLM Judgment | gpt-oss-20b | 0-100% |
Speech Quality | MOS Prediction | UTMOS22 | 0-100 (MOSΓ20) |
Consistency | Modified WER | Whisper-Large-v3 | 0-100% (100-WER) |
This comprehensive evaluation framework enables thorough assessment of multimodal AI assistants across listening, speaking, and viewing capabilities, providing both granular insights and unified performance metrics.
If you find this benchmark useful in your research, please consider citing this BibTex:
@misc{wang2025voiceassistantevalbenchmarkingaiassistants,
title={VoiceAssistant-Eval: Benchmarking AI Assistants across Listening, Speaking, and Viewing},
author={Ke Wang and Houxing Ren and Zimu Lu and Mingjie Zhan and Hongsheng Li},
year={2025},
eprint={2509.22651},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.22651},
}
- [MathVisionπ₯] Measuring Multimodal Mathematical Reasoning with the MATH-Vision Dataset
- [MathCoder-VL] MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning
- [CSV] Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification
- [MathGenie] MathGenie: Generating Synthetic Data with Question Back-translation for Enhancing Mathematical Reasoning of LLMs
- [MathCoder] MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning
- [MathCoder2] MathCoder2: Better Math Reasoning from Continued Pretraining on Model-translated Mathematical Code