Audio Flamingo is a novel audio-understanding language model for
- understanding audio,
- quickly adapting to unseen tasks via in-context learning and retrieval, and
- understanding and responding to multi-turn dialogues
We introduce a series of training techniques, architecture design, and data strategies to enhance our model with these abilities. Extensive evaluations across various audio understanding tasks confirm the efficacy of our method, setting new state-of-the-art benchmarks.
This model is ready for non-commercial research-only.
- Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities
- Project Page
- Demo Website
Architecture Type: Transformer
Network Architecture: Audio Flamingo
Audio Flamingo is a Flamingo-style architecture with frozen audio feature extractor, trainable transformation layers and xattn-dense layers, and language model layers.
Input Types: Audio, Text
Input Format: Wav/MP3/Flac, String
Input Parameters: None
Maximum Audio Input Lengths: 33.25 seconds
Maximum Text Input Lengths: 512 tokens
Output Type: Text
Output Format: String
Output Parameters: None
Runtime Engine(s): PyTorch
Supported Hardware Microarchitecture Compatibility:
- NVIDIA Ampere
- NVIDIA Hopper
- Linux
- v1.0
Audio Flamingo is trained with publicly available datasets under various licenses, with the most restricted ones being non-commercial/research-only. The dataset contains diverse audio types including speech, environmental sounds, and music.
- OpenAQA : Data collection method - [Human]; Labeling method - [Synthetic]
- Laion630K
- LP-MusicCaps
- SoundDescs
- WavCaps
- AudioSet
- AudioSet Strong Labeled
- WavText5K
- MSP-Podcast
- ClothoAQA
- Clotho-v2
- MACS
- FSD50k
- CochlScene
- NonSpeech 7k
- Chime-home
- Sonyc-UST
- Emov-DB
- JL-Corpus
- Tess
- OMGEmotion
- MELD
- MusicAVQA
- MusicQA
- MusicCaps
- NSynth
- MTG-Jamendo
- MusDB-HQ
- FMA
For all of these datasets, the data collection method is [human]. For OpenAQA, Laion630k, LP-MusicCaps, WavCaps, MusicQA, the data labeling method is [synthetic]. For the rest, the data labeling method is [human].
Audio Flamingo is evaluated on the test split of the following datasets.
- ClothoAQA
- MusicAVQA
- Clotho-v2
- FSD50k
- CochlScene
- NonSpeech 7k
- NSynth
- AudioCaps
- CREMA-D
- Ravdess
- US8K
- GTZAN
- Medley-solos-DB
For all of these datasets, the data collection method is [human] and the data labeling method is [human].
Engine: HuggingFace Transformers
Test Hardware [Name the specific test hardware model]: A100 80GB