2024-10 |
SPIRIT LM |
Meta |
SPIRIT LM: Interleaved Spoken and Written Language Model |
Paper / Code / Project |
2024-10 |
DiVA |
Georgia Tech, Stanford |
Distilling an End-to-End Voice Assistant Without Instruction Training Data |
Paper / Project |
2024-09 |
Moshi |
Kyutai |
Moshi: a speech-text foundation model for real-time dialogue |
Paper / Code |
2024-09 |
LLaMA-Omni |
CAS |
LLaMA-Omni: Seamless Speech Interaction with Large Language Models |
Paper / Code |
2024-09 |
Ultravox |
fixie-ai |
GitHub Open Source |
Code |
2024-08 |
Mini-Omni |
Tsinghua |
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming |
Paper / Code |
2024-08 |
Typhoon-Audio |
Typhoon |
Typhoon-Audio Preview Release |
Page |
2024-08 |
USDM |
SNU |
Integrating Paralinguistics in Speech-Empowered Large Language Models for Natural Conversation |
Paper |
2024-08 |
MooER |
Moore Threads |
MooER: LLM-based Speech Recognition and Translation Models from Moore Threads |
Paper / Code |
2024-07 |
GAMA |
UMD |
GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities |
Paper / Code |
2024-07 |
LLaST |
CUHK-SZ |
LLaST: Improved End-to-end Speech Translation System Leveraged by Large Language Models |
Paper / Code |
2024-07 |
CompA |
University of Maryland |
CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models |
Paper / Code / Project |
2024-07 |
Qwen2-Audio |
Alibaba |
Qwen2-Audio Technical Report |
Paper / Code |
2024-07 |
FunAudioLLM |
Alibaba |
FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs |
Paper / Code / Demo |
2024-06 |
BESTOW |
NVIDIA |
BESTOW: Efficient and Streamable Speech Language Model with the Best of Two Worlds in GPT and T5 |
Paper |
2024-06 |
DeSTA |
NTU-Taiwan, Nvidia |
DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment |
Paper / Code |
2024-05 |
AudioChatLlama |
Meta |
AudioChatLlama: Towards General-Purpose Speech Abilities for LLMs |
Paper |
2024-05 |
Audio Flamingo |
Nvidia |
Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities |
Paper / Code |
2024-05 |
SpeechVerse |
AWS |
SpeechVerse: A Large-scale Generalizable Audio Language Model |
Paper |
2024-04 |
SALMONN |
Tsinghua |
SALMONN: Towards Generic Hearing Abilities for Large Language Models |
Paper / Code / Demo |
2024-03 |
WavLLM |
CUHK |
WavLLM: Towards Robust and Adaptive Speech Large Language Model |
Paper / Code |
2024-02 |
LTU |
MIT |
Listen, Think, and Understand |
Paper / Code |
2024-02 |
SLAM-LLM |
SJTU |
An Embarrassingly Simple Approach for LLM with Strong ASR Capacity |
Paper / Code |
2024-01 |
Pengi |
Microsoft |
Pengi: An Audio Language Model for Audio Tasks |
Paper / Code |
2023-12 |
Qwen-Audio |
Alibaba |
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models |
Paper / Code / Demo |
2023-12 |
LTU-AS |
MIT |
Joint Audio and Speech Understanding |
Paper / Code / Demo |
2023-10 |
Speech-LLaMA |
Microsoft |
On decoder-only architecture for speech-to-text and large language model integration |
Paper |
2023-10 |
UniAudio |
CUHK |
An Audio Foundation Model Toward Universal Audio Generation |
Paper / Code / Demo |
2023-09 |
LLaSM |
LinkSoul.AI |
LLaSM: Large Language and Speech Model |
Paper / Code |
2023-06 |
AudioPaLM |
Google |
AudioPaLM: A Large Language Model that Can Speak and Listen |
Paper / Demo |
2023-05 |
VioLA |
Microsoft |
VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation |
Paper |
2023-05 |
SpeechGPT |
Fudan |
SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities |
Paper / Code / Demo |
2023-04 |
AudioGPT |
Zhejiang Uni |
AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head |
Paper / Code |
2022-09 |
AudioLM |
Google |
AudioLM: a Language Modeling Approach to Audio Generation |
Paper / Demo |