Skip to content

mathllm/VoiceAssistant-Eval

Repository files navigation

πŸ”₯ VoiceAssistant-Eval: Benchmarking AI Assistants across Listening, Speaking, and Viewing

Mathematical Reasoning MathQA MathQA MathQA Multimodal Reasoning

GPT-4V Gemini Gemini Gemini Gemini Gemini Gemini Gemini

🌟 This is the official repository for the paper "VoiceAssistant-Eval: Benchmarking AI Assistants across Listening, Speaking, and Viewing", which contains the evaluation code for the VoiceAssistant-Eval benchmark.

[🌐 Homepage] [πŸ€— Huggingface Dataset] [πŸ“Š Leaderboard ] [πŸ“Š Detailed Leaderboard ] [πŸ“Š Roleplay Leaderboard ] [πŸ“– Paper]

πŸ’₯ News

  • [2025-09-27] Qwen2.5-Omni-7B achieves 59.2% accuracy on image + text queries but only 42.9% on image + audio queries, reflecting a 16.3-point drop.
  • [2025-09-27] Step-Audio-2-mini achieves more than double the listening accuracy of the 32B LLaMA-Omni2 model (40.06 vs. 16.00).
  • [2025-09-27] We observe that 20 out of 22 models score higher on Speaking than on Listening, and this mismatch highlights the need for more balanced development.
  • [2025-09-27] GPT-4o-Audio fails to surpass open-source models in 4 out of 13 tasks.
  • [2025-09-27] Our dataset is now accessible at huggingface.
  • [2025-09-27] Our paper is now accessible at ArXiv Paper.

πŸ‘€ Introduction

The growing capabilities of large language models and multimodal systems have spurred interest in voice-first AI assistants, yet existing benchmarks are inadequate for evaluating the full range of these systems' capabilities. We summarize four key weaknesses of current benchmarks, highlighting the urgent need for a new evaluation framework:

  1. W1: Lack of voice personalization evaluation.
    Current benchmarks rarely test how well models mimic specific voices, which is key for personalized assistants (e.g., in healthcare). Without this, models may fail in real-world personalized applications.

  2. W2: Limited focus on hands-free interaction.
    Benchmarks often use text-based instructions, ignoring true voice-first, hands-free use. This limits reliability in critical contexts like driving or accessibility for visually impaired users.

  3. W3: Neglect of real-world audio contexts.
    Datasets seldom cover varied, realistic audio environments. Models aren't tested on understanding beyond speech (e.g., music, nature sounds), reducing their everyday usefulness.

  4. W4: Insufficient multi-modal (vision + audio) assessment.
    Benchmarks rarely test joint speech and visual input, missing key scenarios like smart tutors. This gap means benchmarks don't reflect real-world multimodal needs.

We introduce Logo VoiceAssistant-Eval, a comprehensive benchmark designed to assess AI assistants across listening, speaking, and viewing. VoiceAssistant-Eval comprises 10,497 curated examples spanning 13 task categories. These tasks include natural sounds, music, and spoken dialogue for listening; multi-turn dialogue, role-play imitation, and various scenarios for speaking; and highly heterogeneous images for viewing.

To demonstrate its utility, we evaluate 21 open-source models and GPT-4o-Audio, measuring the quality of the response content and speech, as well as their consistency. The results reveal three key findings: (1) proprietary models do not universally outperform open-source models; (2) most models excel at speaking tasks but lag in audio understanding; and (3) well-designed smaller models can rival much larger ones. Notably, the mid-sized Step-Audio-2-mini (7B) achieves more than double the listening accuracy of LLaMA-Omni2-32B-Bilingual. However, challenges remain: multimodal (audio+visual) input and role-play voice imitation tasks are difficult for current models, and significant gaps persist in robustness and safety alignment. VoiceAssistant-Eval identifies these gaps and establishes a rigorous framework for evaluating and guiding the development of next-generation multimodal voice assistants.



Figure 1: (a) Scores of six prominent omni-models across 13 tasks. (b) Examples from four newly designed tasks for voice assistants: I. Example from the role-play task with reference audio. II. A truly voice-based multi-turn conversation, instead of providing multi-round context in text. III. Multi-modal (vision + audio) integration understanding. IV. An audio question with music context.

Please refer to our project homepage and the paper for more details.

πŸ“ Dataset Overview

Overview of VoiceAssistant-Eval statistics Task distribution and weaknesses in VoiceAssistant-Eval
Overview of principal statistics for VoiceAssistant-Eval. Proportional distribution of tasks and the corresponding weaknesses addressed in VoiceAssistant-Eval.

πŸ† Leaderboards

Explore the comprehensive evaluation results of AI assistants across multiple dimensions:

πŸ“ˆ Evaluation

πŸ“₯ Repository Setup

This repository uses Git LFS to store large files (audio files, model weights, etc.). To clone the repository properly:

# Install Git LFS if not already installed
git lfs install

# Clone the repository with LFS files
git clone https://github.com/mathllm/VoiceAssistant-Eval.git
cd VoiceAssistant-Eval

# Alternatively, if you've already cloned without LFS:
git lfs pull

πŸ› οΈ Installation

conda create -p envs/voiceassistant python=3.12
conda activate envs/voiceassistant

# CUDA 12.4
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124

pip install git+https://github.com/wenet-e2e/wespeaker.git
pip install PyYAML
pip install requests
pip install librosa
pip install openai-whisper
pip install funasr
pip install openai
pip install transformers==4.55.4
pip install accelerate

πŸ”¬ Evaluation Protocols

Our evaluation provides a comprehensive assessment of both generated speech and text responses, as well as their consistency. Unlike previous studies that focus solely on text responses, we aggregate multiple detailed metrics into a single, unified score for holistic model performance evaluation.

🎯 Triadic Evaluation System

We evaluate model responses across three key dimensions:

  1. Content Quality: Response accuracy, helpfulness, and appropriateness
  2. Speech Quality: Audio naturalness and fluency
  3. Consistency: Alignment between intended content and actual speech output

Final Score Calculation:

Final Score = Content Score Γ— Speech Score Γ— Consistency Score Γ— 100%

πŸš€ Running Evaluations

1. Emotion Analysis (0_emotion2vec/evaluate_emotion.py)

Analyzes emotional expression in generated speech for the Speaking/Emotion task:

python 0_emotion2vec/evaluate_emotion.py
  • Model: emotion2vec_plus_large
  • Purpose: Extracts emotion probabilities (angry, disgusted, fearful, happy, neutral, sad, surprised)
  • Output: Emotions with >1% probability are included in evaluation prompts
  • Target: ./results/MODEL_NAME/Speaking/Emotion/ examples

2. Speaker Similarity (0_speaker_similarity/evaluate_roleplay.py)

Measures voice similarity for roleplay tasks:

python 0_speaker_similarity/evaluate_roleplay.py
  • Model: WeSpeaker voxblink2_samresnet100_ft
  • Purpose: Computes speaker similarity between generated speech and reference role audio
  • Output: Speaker similarity scores incorporated into roleplay task evaluation
  • Target: ./results/MODEL_NAME/Speaking/Roleplay/ examples

3. Content Quality Evaluation (1_content_quality/evaluate_content.py)

Evaluates response content using LLM-based judgment:

python 1_content_quality/evaluate_content.py
  • Model: gpt-oss-20b
  • Evaluation Prompts: 13 task-specific prompts from evaluation_prompts.py
  • Tasks Covered:
    • Listening: General, Music, Sound, Speech
    • Speaking: Assistant, Emotion, Instruction_Following, Multi_Round, Reasoning, Robustness, Roleplay, Safety
    • Viewing: Multi_Discipline
  • Output: Binary judgments (Correct/Incorrect, Good/Bad)

4. Speech Quality Assessment (2_speech_quality/evaluate_speech.py)

Measures audio naturalness and fluency:

python 2_speech_quality/evaluate_speech.py
  • Model: UTMOS22_strong
  • Purpose: Provides Mean Opinion Score (MOS) reflecting speech quality
  • Output: MOS scores for all generated audio responses
  • Score Range: 1-5 (converted to 20-100 scale in final calculation)

5. Content-Speech Consistency Evaluation

Step 5a: Speech Transcription (3_content_speech_consistency/evaluate_whisper.py)

python 3_content_speech_consistency/evaluate_whisper.py
  • Model: Whisper-Large-v3
  • Purpose: Transcribes generated speech to text for consistency analysis

Step 5b: Consistency Analysis (3_content_speech_consistency/evaluate_consistency.py)

python 3_content_speech_consistency/evaluate_consistency.py
  • Method: Modified Word Error Rate (WER) calculation
  • Formula: Character-level Levenshtein distance with length thresholds
  • Special Handling: Addresses multiple-choice questions where models output only final letters

Modified WER Calculation:

Let n = len(non-space chars in text1.lower())
Let m = len(non-space chars in text2.lower())

WER'(text1, text2) = {
  1,     if min(n,m) < 10 and max(n,m) > 10
  0,     if min(n,m) < 10 and max(n,m) ≀ 10
  Levenshtein(text1,text2)/max(n,m), otherwise
}

πŸ“Š Score Extraction and Final Results

Extract Judgment Results

python extract_judge_result.py
  • Purpose: Parses binary decisions from content evaluation outputs
  • Method: Extracts [Correct]/[Incorrect] and [Good]/[Bad] judgments
  • Output: Structured results in res.json

Generate Final Scores

python get_final_scores.py
  • Purpose: Computes comprehensive performance metrics
  • Output Metrics:
    • Content Score: Binary accuracy (0-100%)
    • Consistency Score: 100 - WER% (0-100%)
    • Speech Score: MOS Γ— 20 (0-100 scale)
    • Overall Score: Product of three dimensions

πŸ“‹ Complete Evaluation Pipeline

  1. Setup: Ensure model outputs follow the directory structure:
    ./results/MODEL_NAME/Task_Name/Sub_Task_Name/Example_ID/example.json
    • for example: ./results/gpt-4o-audio-preview-2025-06-03/Listening/Music/Listening_Music_0/example.json.
  2. Run Core Evaluations (can be executed in parallel):
    python 0_emotion2vec/evaluate_emotion.py          # Emotion analysis
    python 0_speaker_similarity/evaluate_roleplay.py  # Speaker similarity
    python 2_speech_quality/evaluate_speech.py        # Speech quality
    python 3_content_speech_consistency/evaluate_whisper.py  # Transcription
  3. Content Evaluation:
    python 1_content_quality/evaluate_content.py      # Content quality
  4. Consistency Analysis:
    python 3_content_speech_consistency/evaluate_consistency.py  # WER calculation
  5. Generate Results:
    python extract_judge_result.py    # Extract judgments
    python get_final_scores.py        # Compute final scores

🎯 Evaluation Dimensions Summary

Dimension Method Models Used Output Range
Emotion Emotion Classification emotion2vec Probability distribution
Speaker Similarity Voice Verification WeSpeaker 0-1 similarity score
Content Quality LLM Judgment gpt-oss-20b 0-100%
Speech Quality MOS Prediction UTMOS22 0-100 (MOSΓ—20)
Consistency Modified WER Whisper-Large-v3 0-100% (100-WER)

This comprehensive evaluation framework enables thorough assessment of multimodal AI assistants across listening, speaking, and viewing capabilities, providing both granular insights and unified performance metrics.

πŸ“ Citation

If you find this benchmark useful in your research, please consider citing this BibTex:

@misc{wang2025voiceassistantevalbenchmarkingaiassistants,
      title={VoiceAssistant-Eval: Benchmarking AI Assistants across Listening, Speaking, and Viewing}, 
      author={Ke Wang and Houxing Ren and Zimu Lu and Mingjie Zhan and Hongsheng Li},
      year={2025},
      eprint={2509.22651},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.22651}, 
}

🧠 Related Work

About

A rigorous framework for evaluating and guiding the development of next-generation AI assistants.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages