🔥 VoiceAssistant-Eval: Benchmarking AI Assistants across Listening, Speaking, and Viewing

🌟 This is the official repository for the paper "VoiceAssistant-Eval: Benchmarking AI Assistants across Listening, Speaking, and Viewing", which contains the evaluation code for the VoiceAssistant-Eval benchmark.

[🌐 Homepage] [🤗 Huggingface Dataset] [📊 Leaderboard ] [📊 Detailed Leaderboard ] [📊 Roleplay Leaderboard ] [📖 Paper]

💥 News

[2025-09-27] Qwen2.5-Omni-7B achieves 59.2% accuracy on image + text queries but only 42.9% on image + audio queries, reflecting a 16.3-point drop.
[2025-09-27] Step-Audio-2-mini achieves more than double the listening accuracy of the 32B LLaMA-Omni2 model (40.06 vs. 16.00).
[2025-09-27] We observe that 20 out of 22 models score higher on Speaking than on Listening, and this mismatch highlights the need for more balanced development.
[2025-09-27] GPT-4o-Audio fails to surpass open-source models in 4 out of 13 tasks.
[2025-09-27] Our dataset is now accessible at huggingface.
[2025-09-27] Our paper is now accessible at ArXiv Paper.

👀 Introduction

The growing capabilities of large language models and multimodal systems have spurred interest in voice-first AI assistants, yet existing benchmarks are inadequate for evaluating the full range of these systems' capabilities. We summarize four key weaknesses of current benchmarks, highlighting the urgent need for a new evaluation framework:

W1: Lack of voice personalization evaluation.
Current benchmarks rarely test how well models mimic specific voices, which is key for personalized assistants (e.g., in healthcare). Without this, models may fail in real-world personalized applications.
W2: Limited focus on hands-free interaction.
Benchmarks often use text-based instructions, ignoring true voice-first, hands-free use. This limits reliability in critical contexts like driving or accessibility for visually impaired users.
W3: Neglect of real-world audio contexts.
Datasets seldom cover varied, realistic audio environments. Models aren't tested on understanding beyond speech (e.g., music, nature sounds), reducing their everyday usefulness.
W4: Insufficient multi-modal (vision + audio) assessment.
Benchmarks rarely test joint speech and visual input, missing key scenarios like smart tutors. This gap means benchmarks don't reflect real-world multimodal needs.

We introduce VoiceAssistant-Eval, a comprehensive benchmark designed to assess AI assistants across listening, speaking, and viewing. VoiceAssistant-Eval comprises 10,497 curated examples spanning 13 task categories. These tasks include natural sounds, music, and spoken dialogue for listening; multi-turn dialogue, role-play imitation, and various scenarios for speaking; and highly heterogeneous images for viewing.

To demonstrate its utility, we evaluate 21 open-source models and GPT-4o-Audio, measuring the quality of the response content and speech, as well as their consistency. The results reveal three key findings: (1) proprietary models do not universally outperform open-source models; (2) most models excel at speaking tasks but lag in audio understanding; and (3) well-designed smaller models can rival much larger ones. Notably, the mid-sized Step-Audio-2-mini (7B) achieves more than double the listening accuracy of LLaMA-Omni2-32B-Bilingual. However, challenges remain: multimodal (audio+visual) input and role-play voice imitation tasks are difficult for current models, and significant gaps persist in robustness and safety alignment. VoiceAssistant-Eval identifies these gaps and establishes a rigorous framework for evaluating and guiding the development of next-generation multimodal voice assistants.

Figure 1: (a) Scores of six prominent omni-models across 13 tasks. (b) Examples from four newly designed tasks for voice assistants: I. Example from the role-play task with reference audio. II. A truly voice-based multi-turn conversation, instead of providing multi-round context in text. III. Multi-modal (vision + audio) integration understanding. IV. An audio question with music context.

Please refer to our project homepage and the paper for more details.

📐 Dataset Overview


Overview of principal statistics for VoiceAssistant-Eval.	Proportional distribution of tasks and the corresponding weaknesses addressed in VoiceAssistant-Eval.

🏆 Leaderboards

Explore the comprehensive evaluation results of AI assistants across multiple dimensions:

Official Leaderboard: Overall scores across Listening, Speaking, and Viewing tasks
Detailed Leaderboard: In-depth scores across 13 specific tasks
Roleplay Leaderboard: Performance on the Speaking Roleplay task

📈 Evaluation

📥 Repository Setup

This repository uses Git LFS to store large files (audio files, model weights, etc.). To clone the repository properly:

# Install Git LFS if not already installed
git lfs install

# Clone the repository with LFS files
git clone https://github.com/mathllm/VoiceAssistant-Eval.git
cd VoiceAssistant-Eval

# Alternatively, if you've already cloned without LFS:
git lfs pull

🛠️ Installation

conda create -p envs/voiceassistant python=3.12
conda activate envs/voiceassistant

# CUDA 12.4
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124

pip install git+https://github.com/wenet-e2e/wespeaker.git
pip install PyYAML
pip install requests
pip install librosa
pip install openai-whisper
pip install funasr
pip install openai
pip install transformers==4.55.4
pip install accelerate

🔬 Evaluation Protocols

Our evaluation provides a comprehensive assessment of both generated speech and text responses, as well as their consistency. Unlike previous studies that focus solely on text responses, we aggregate multiple detailed metrics into a single, unified score for holistic model performance evaluation.

🎯 Triadic Evaluation System

We evaluate model responses across three key dimensions:

Content Quality: Response accuracy, helpfulness, and appropriateness
Speech Quality: Audio naturalness and fluency
Consistency: Alignment between intended content and actual speech output

Final Score Calculation:

Final Score = Content Score × Speech Score × Consistency Score × 100%

🚀 Running Evaluations

1. Emotion Analysis (`0_emotion2vec/evaluate_emotion.py`)

Analyzes emotional expression in generated speech for the Speaking/Emotion task:

python 0_emotion2vec/evaluate_emotion.py

Model: emotion2vec_plus_large
Purpose: Extracts emotion probabilities (angry, disgusted, fearful, happy, neutral, sad, surprised)
Output: Emotions with >1% probability are included in evaluation prompts
Target: ./results/MODEL_NAME/Speaking/Emotion/ examples

2. Speaker Similarity (`0_speaker_similarity/evaluate_roleplay.py`)

Measures voice similarity for roleplay tasks:

python 0_speaker_similarity/evaluate_roleplay.py

Model: WeSpeaker voxblink2_samresnet100_ft
Purpose: Computes speaker similarity between generated speech and reference role audio
Output: Speaker similarity scores incorporated into roleplay task evaluation
Target: ./results/MODEL_NAME/Speaking/Roleplay/ examples

3. Content Quality Evaluation (`1_content_quality/evaluate_content.py`)

Evaluates response content using LLM-based judgment:

python 1_content_quality/evaluate_content.py

Model: gpt-oss-20b
Evaluation Prompts: 13 task-specific prompts from evaluation_prompts.py
Tasks Covered:
- Listening: General, Music, Sound, Speech
- Speaking: Assistant, Emotion, Instruction_Following, Multi_Round, Reasoning, Robustness, Roleplay, Safety
- Viewing: Multi_Discipline
Output: Binary judgments (Correct/Incorrect, Good/Bad)

4. Speech Quality Assessment (`2_speech_quality/evaluate_speech.py`)

Measures audio naturalness and fluency:

python 2_speech_quality/evaluate_speech.py

Model: UTMOS22_strong
Purpose: Provides Mean Opinion Score (MOS) reflecting speech quality
Output: MOS scores for all generated audio responses
Score Range: 1-5 (converted to 20-100 scale in final calculation)

5. Content-Speech Consistency Evaluation

Step 5a: Speech Transcription (3_content_speech_consistency/evaluate_whisper.py)

python 3_content_speech_consistency/evaluate_whisper.py

Model: Whisper-Large-v3
Purpose: Transcribes generated speech to text for consistency analysis

Step 5b: Consistency Analysis (3_content_speech_consistency/evaluate_consistency.py)

python 3_content_speech_consistency/evaluate_consistency.py

Method: Modified Word Error Rate (WER) calculation
Formula: Character-level Levenshtein distance with length thresholds
Special Handling: Addresses multiple-choice questions where models output only final letters

Modified WER Calculation:

Let n = len(non-space chars in text1.lower())
Let m = len(non-space chars in text2.lower())

WER'(text1, text2) = {
  1,     if min(n,m) < 10 and max(n,m) > 10
  0,     if min(n,m) < 10 and max(n,m) ≤ 10
  Levenshtein(text1,text2)/max(n,m), otherwise
}

📊 Score Extraction and Final Results

Extract Judgment Results

python extract_judge_result.py

Purpose: Parses binary decisions from content evaluation outputs
Method: Extracts [Correct]/[Incorrect] and [Good]/[Bad] judgments
Output: Structured results in res.json

Generate Final Scores

python get_final_scores.py

Purpose: Computes comprehensive performance metrics
Output Metrics:
- Content Score: Binary accuracy (0-100%)
- Consistency Score: 100 - WER% (0-100%)
- Speech Score: MOS × 20 (0-100 scale)
- Overall Score: Product of three dimensions

📋 Complete Evaluation Pipeline

Setup: Ensure model outputs follow the directory structure:
./results/MODEL_NAME/Task_Name/Sub_Task_Name/Example_ID/example.json
- for example: ./results/gpt-4o-audio-preview-2025-06-03/Listening/Music/Listening_Music_0/example.json.

Run Core Evaluations (can be executed in parallel):

python 0_emotion2vec/evaluate_emotion.py          # Emotion analysis
python 0_speaker_similarity/evaluate_roleplay.py  # Speaker similarity
python 2_speech_quality/evaluate_speech.py        # Speech quality
python 3_content_speech_consistency/evaluate_whisper.py  # Transcription

Content Evaluation:

python 1_content_quality/evaluate_content.py      # Content quality

Consistency Analysis:

python 3_content_speech_consistency/evaluate_consistency.py  # WER calculation

Generate Results:

python extract_judge_result.py    # Extract judgments
python get_final_scores.py        # Compute final scores

🎯 Evaluation Dimensions Summary

Dimension	Method	Models Used	Output Range
Emotion	Emotion Classification	emotion2vec	Probability distribution
Speaker Similarity	Voice Verification	WeSpeaker	0-1 similarity score
Content Quality	LLM Judgment	gpt-oss-20b	0-100%
Speech Quality	MOS Prediction	UTMOS22	0-100 (MOS×20)
Consistency	Modified WER	Whisper-Large-v3	0-100% (100-WER)

This comprehensive evaluation framework enables thorough assessment of multimodal AI assistants across listening, speaking, and viewing capabilities, providing both granular insights and unified performance metrics.

📝 Citation

If you find this benchmark useful in your research, please consider citing this BibTex:

@misc{wang2025voiceassistantevalbenchmarkingaiassistants,
      title={VoiceAssistant-Eval: Benchmarking AI Assistants across Listening, Speaking, and Viewing}, 
      author={Ke Wang and Houxing Ren and Zimu Lu and Mingjie Zhan and Hongsheng Li},
      year={2025},
      eprint={2509.22651},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.22651}, 
}

🧠 Related Work

[MathVision🔥] Measuring Multimodal Mathematical Reasoning with the MATH-Vision Dataset
[MathCoder-VL] MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning
[CSV] Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification
[MathGenie] MathGenie: Generating Synthetic Data with Question Back-translation for Enhancing Mathematical Reasoning of LLMs
[MathCoder] MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning
[MathCoder2] MathCoder2: Better Math Reasoning from Continued Pretraining on Model-translated Mathematical Code

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🔥 VoiceAssistant-Eval: Benchmarking AI Assistants across Listening, Speaking, and Viewing

💥 News

👀 Introduction

📐 Dataset Overview

🏆 Leaderboards

📈 Evaluation

📥 Repository Setup

🛠️ Installation

🔬 Evaluation Protocols

🎯 Triadic Evaluation System

🚀 Running Evaluations

1. Emotion Analysis (`0_emotion2vec/evaluate_emotion.py`)

2. Speaker Similarity (`0_speaker_similarity/evaluate_roleplay.py`)

3. Content Quality Evaluation (`1_content_quality/evaluate_content.py`)

4. Speech Quality Assessment (`2_speech_quality/evaluate_speech.py`)

5. Content-Speech Consistency Evaluation

📊 Score Extraction and Final Results

Extract Judgment Results

Generate Final Scores

📋 Complete Evaluation Pipeline

🎯 Evaluation Dimensions Summary

📝 Citation

🧠 Related Work

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
0_emotion2vec		0_emotion2vec
0_speaker_similarity		0_speaker_similarity
1_content_quality		1_content_quality
2_speech_quality		2_speech_quality
3_content_speech_consistency		3_content_speech_consistency
assets		assets
.gitattributes		.gitattributes
README.md		README.md
extract_judge_result.py		extract_judge_result.py
get_final_scores.py		get_final_scores.py
res.json		res.json

mathllm/VoiceAssistant-Eval

Folders and files

Latest commit

History

Repository files navigation

🔥 VoiceAssistant-Eval: Benchmarking AI Assistants across Listening, Speaking, and Viewing

💥 News

👀 Introduction

📐 Dataset Overview

🏆 Leaderboards

📈 Evaluation

📥 Repository Setup

🛠️ Installation

🔬 Evaluation Protocols

🎯 Triadic Evaluation System

🚀 Running Evaluations

1. Emotion Analysis (0_emotion2vec/evaluate_emotion.py)

2. Speaker Similarity (0_speaker_similarity/evaluate_roleplay.py)

3. Content Quality Evaluation (1_content_quality/evaluate_content.py)

4. Speech Quality Assessment (2_speech_quality/evaluate_speech.py)

5. Content-Speech Consistency Evaluation

📊 Score Extraction and Final Results

Extract Judgment Results

Generate Final Scores

📋 Complete Evaluation Pipeline

🎯 Evaluation Dimensions Summary

📝 Citation

🧠 Related Work

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

1. Emotion Analysis (`0_emotion2vec/evaluate_emotion.py`)

2. Speaker Similarity (`0_speaker_similarity/evaluate_roleplay.py`)

3. Content Quality Evaluation (`1_content_quality/evaluate_content.py`)

4. Speech Quality Assessment (`2_speech_quality/evaluate_speech.py`)

Packages