Skip to content

Latest commit

 

History

History
102 lines (82 loc) · 13.9 KB

README.md

File metadata and controls

102 lines (82 loc) · 13.9 KB

AudioLLMs

This repository is a curated collection of research papers focused on the development, implementation, and evaluation of language models for audio data. Our goal is to provide researchers and practitioners with a comprehensive resource to explore the latest advancements in AudioLLMs. Contributions and suggestions for new papers are highly encouraged!

Models

Date Model Key Affiliations Paper Link
2024-10 SPIRIT LM Meta SPIRIT LM: Interleaved Spoken and Written Language Model Paper / Code / Project
2024-10 DiVA Georgia Tech, Stanford Distilling an End-to-End Voice Assistant Without Instruction Training Data Paper / Project
2024-09 Moshi Kyutai Moshi: a speech-text foundation model for real-time dialogue Paper / Code
2024-09 LLaMA-Omni CAS LLaMA-Omni: Seamless Speech Interaction with Large Language Models Paper / Code
2024-09 Ultravox fixie-ai GitHub Open Source Code
2024-08 Mini-Omni Tsinghua Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming Paper / Code
2024-08 Typhoon-Audio Typhoon Typhoon-Audio Preview Release Page
2024-08 USDM SNU Integrating Paralinguistics in Speech-Empowered Large Language Models for Natural Conversation Paper
2024-08 MooER Moore Threads MooER: LLM-based Speech Recognition and Translation Models from Moore Threads Paper / Code
2024-07 GAMA UMD GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities Paper / Code
2024-07 LLaST CUHK-SZ LLaST: Improved End-to-end Speech Translation System Leveraged by Large Language Models Paper / Code
2024-07 CompA University of Maryland CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models Paper / Code / Project
2024-07 Qwen2-Audio Alibaba Qwen2-Audio Technical Report Paper / Code
2024-07 FunAudioLLM Alibaba FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs Paper / Code / Demo
2024-06 BESTOW NVIDIA BESTOW: Efficient and Streamable Speech Language Model with the Best of Two Worlds in GPT and T5 Paper
2024-06 DeSTA NTU-Taiwan, Nvidia DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment Paper / Code
2024-05 AudioChatLlama Meta AudioChatLlama: Towards General-Purpose Speech Abilities for LLMs Paper
2024-05 Audio Flamingo Nvidia Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities Paper / Code
2024-05 SpeechVerse AWS SpeechVerse: A Large-scale Generalizable Audio Language Model Paper
2024-04 SALMONN Tsinghua SALMONN: Towards Generic Hearing Abilities for Large Language Models Paper / Code / Demo
2024-03 WavLLM CUHK WavLLM: Towards Robust and Adaptive Speech Large Language Model Paper / Code
2024-02 LTU MIT Listen, Think, and Understand Paper / Code
2024-02 SLAM-LLM SJTU An Embarrassingly Simple Approach for LLM with Strong ASR Capacity Paper / Code
2024-01 Pengi Microsoft Pengi: An Audio Language Model for Audio Tasks Paper / Code
2023-12 Qwen-Audio Alibaba Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models Paper / Code / Demo
2023-12 LTU-AS MIT Joint Audio and Speech Understanding Paper / Code / Demo
2023-10 Speech-LLaMA Microsoft On decoder-only architecture for speech-to-text and large language model integration Paper
2023-10 UniAudio CUHK An Audio Foundation Model Toward Universal Audio Generation Paper / Code / Demo
2023-09 LLaSM LinkSoul.AI LLaSM: Large Language and Speech Model Paper / Code
2023-06 AudioPaLM Google AudioPaLM: A Large Language Model that Can Speak and Listen Paper / Demo
2023-05 VioLA Microsoft VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation Paper
2023-05 SpeechGPT Fudan SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities Paper / Code / Demo
2023-04 AudioGPT Zhejiang Uni AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head Paper / Code
2022-09 AudioLM Google AudioLM: a Language Modeling Approach to Audio Generation Paper / Demo

Models (language + audio + other modalities)

Date Model Key Affiliations Paper Link
2024-09 EMOVA HKUST EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions Paper / Demo
2023-11 CoDi-2 UC Berkeley CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation Paper / Code / Demo
2023-06 Macaw-LLM Tencent Macaw-LLM: Multi-Modal Language Modeling with Image, Video, Audio, and Text Integration Paper / Code

Methodology

Date Name Key Affiliations Paper Link
2024-10 SpeechEmotionLlama MIT, Meta Frozen Large Language Models Can Perceive Paralinguistic Aspects of Speech Paper
2024-09 AudioBERT Postech AudioBERT: Audio Knowledge Augmented Language Model Paper / Code
2024-09 MoWE-Audio A*STAR MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders Paper
2024-09 - Tsinghua SIGS Comparing Discrete and Continuous Space LLMs for Speech Recognition Paper
2024-07 - NTU-Taiwan, Meta Investigating Decoder-only Large Language Models for Speech-to-text Translation Paper
2024-06 Speech ReaLLM Meta Speech ReaLLM – Real-time Streaming Speech Recognition with Multimodal LLMs by Teaching the Flow of Time Paper
2023-09 Segment-level Q-Former Tsinghua Connecting Speech Encoder and Large Language Model for ASR Paper
2023-07 - Meta Prompting Large Language Models with Speech Recognition Abilities Paper

Adversarial Attacks

Date Name Key Affiliations Paper Link
2024-05 VoiceJailbreak CISPA Voice Jailbreak Attacks Against GPT-4o Paper

Evaluation

Date Name Key Affiliations Paper Link
2024-10 VoiceBench NUS VoiceBench: Benchmarking LLM-Based Voice Assistants Paper / Code
2024-07 AudioEntailment CMU, Microsoft Audio Entailment: Assessing Deductive Reasoning for Audio Understanding Paper / Code
2024-06 Audio Hallucination NTU-Taiwan Understanding Sounds, Missing the Questions: The Challenge of Object Hallucination in Large Audio-Language Models Paper / Code
2024-06 AudioBench A*STAR, Singapore AudioBench: A Universal Benchmark for Audio Large Language Models Paper / Code / LeaderBoard
2024-05 AIR-Bench ZJU, Alibaba AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension Paper / Code
2024-08 MuChoMusic UPF, QMUL, UMG MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models Paper / Code
2023-09 Dynamic-SUPERB NTU-Taiwan, etc. Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech Paper / Code

Audio Model

Audio Models are different from Audio Large Language Models.

Evaluation

Date Name Key Affiliations Paper Link
2024-09 Salmon Hebrew University of Jerusalem A Suite for Acoustic Language Model Evaluation Paper / Code

Survey

Date Key Affiliations Paper Link
2024-11 Zhejiang University WavChat: A Survey of Spoken Dialogue Models Paper
2024-10 CUHK, Tencent Recent Advances in Speech Language Models: A Survey Paper
2024-10 SJTU, AISpeech A Survey on Speech Large Language Models Paper