AI Audio Datasets List (AI-ADL) 🎵

This is a list of datasets consisting of speech, music, and sound effects, which can provide training data for Generative AI, AIGC, AI model training, intelligent audio tool development, and audio applications. It is mainly used for speech recognition, speech synthesis, singing voice synthesis, music information retrieval, music generation, audio processing, sound synthesis, etc.

Speech
Music
Sound Effect

Speech

AISHELL-1 - AISHELL-1 is a corpus for speech recognition research and building speech recognition systems for Mandarin.
AISHELL-3 - AISHELL-3 is a large-scale and high-fidelity multi-speaker Mandarin speech corpus published by Beijing Shell Shell Technology Co.,Ltd. It can be used to train multi-speaker Text-to-Speech (TTS) systems.The corpus contains roughly 85 hours of emotion-neutral recordings spoken by 218 native Chinese mandarin speakers and total 88035 utterances.
Arabic speech Corpus - The Arabic Speech Corpus (1.5 GB) is a Modern Standard Arabic (MSA) speech corpus for speech synthesis. The corpus contains phonetic and orthographic transcriptions of more than 3.7 hours of MSA speech aligned with recorded speech on the phoneme level. The annotations include word stress marks on the individual phonemes.
AudioMNIST - The dataset consists of 30000 audio samples of spoken digits (0-9) of 60 different speakers.
AVSpeech - AVSpeech is a large-scale audio-visual dataset comprising speech clips with no interfering background signals. The segments are of varying length, between 3 and 10 seconds long, and in each clip the only visible face in the video and audible sound in the soundtrack belong to a single speaking person. In total, the dataset contains roughly 4700 hours of video segments with approximately 150,000 distinct speakers, spanning a wide variety of people, languages and face poses.
ATIS (Airline Travel Information Systems) - The ATIS (Airline Travel Information Systems) is a dataset consisting of audio recordings and corresponding manual transcripts about humans asking for flight information on automated airline travel inquiry systems. The data consists of 17 unique intent categories. The original split contains 4478, 500 and 893 intent-labeled reference utterances in train, development and test set respectively.
Casual Conversations - Casual Conversations dataset is designed to help researchers evaluate their computer vision and audio models for accuracy across a diverse set of age, genders, apparent skin tones and ambient lighting conditions.
CN-Celeb - CN-Celeb is a large-scale speaker recognition dataset collected `in the wild'. This dataset contains more than 130,000 utterances from 1,000 Chinese celebrities, and covers 11 different genres in real world.
Clotho - Clotho is an audio captioning dataset, consisting of 4981 audio samples, and each audio sample has five captions (a total of 24 905 captions). Audio samples are of 15 to 30 s duration and captions are eight to 20 words long.
Common Voice - Common Voice is an audio dataset that consists of a unique MP3 and corresponding text file. There are 9,283 recorded hours in the dataset. The dataset also includes demographic metadata like age, sex, and accent. The dataset consists of 7,335 validated hours in 60 languages.
CoVoST - CoVoST is a large-scale multilingual speech-to-text translation corpus. Its latest 2nd version covers translations from 21 languages into English and from English into 15 languages. It has total 2880 hours of speech and is diversified with 78K speakers and 66 accents.
CVSS - CVSS is a massively multilingual-to-English speech to speech translation (S2ST) corpus, covering sentence-level parallel S2ST pairs from 21 languages into English. CVSS is derived from the Common Voice speech corpus and the CoVoST 2 speech-to-text translation (ST) corpus, by synthesizing the translation text from CoVoST 2 into speech using state-of-the-art TTS systems.
EasyCom - The Easy Communications (EasyCom) dataset is a world-first dataset designed to help mitigate the cocktail party effect from an augmented-reality (AR) -motivated multi-sensor egocentric world view. The dataset contains AR glasses egocentric multi-channel microphone array audio, wide field-of-view RGB video, speech source pose, headset microphone audio, annotated voice activity, speech transcriptions, head and face bounding boxes and source identification labels. We have created and are releasing this dataset to facilitate research in multi-modal AR solutions to the cocktail party problem.
ESD (Emotional Speech Database) - ESD is an Emotional Speech Database for voice conversion research. The ESD database consists of 350 parallel utterances spoken by 10 native English and 10 native Chinese speakers and covers 5 emotion categories (neutral, happy, angry, sad and surprise). More than 29 hours of speech data were recorded in a controlled acoustic environment. The database is suitable for multi-speaker and cross-lingual emotional voice conversion studies.
FPT Open Speech Dataset (FOSD) - This dataset consists of 25,921 recorded Vietnamese speeches (with their transcripts and the labelled start and end times of each speech) manually compiled from 3 sub-datasets (approximately 30 hours in total) released publicly in 2018 by FPT Corporation.
Free Spoken Digit Dataset (FSDD) - A free audio dataset of spoken digits. Think MNIST for audio.A simple audio/speech dataset consisting of recordings of spoken digits in wav files at 8kHz. The recordings are trimmed so that they have near minimal silence at the beginnings and ends.
Fluent Speech Commands - Fluent Speech Commands is an open source audio dataset for spoken language understanding (SLU) experiments. Each utterance is labeled with "action", "object", and "location" values; for example, "turn the lights on in the kitchen" has the label {"action": "activate", "object": "lights", "location": "kitchen"}. A model must predict each of these values, and a prediction for an utterance is deemed to be correct only if all values are correct.
GenshinVoice - Voice dataset of Genshin Impact 原神语音数据集
GigaSpeech - GigaSpeech, an evolving, multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training, and 40,000 hours of total audio suitable for semi-supervised and unsupervised training.
How2 - The How2 dataset contains 13,500 videos, or 300 hours of speech, and is split into 185,187 training, 2022 development (dev), and 2361 test utterances. It has subtitles in English and crowdsourced Portuguese translations.
LibriSpeech - The LibriSpeech corpus is a collection of approximately 1,000 hours of audiobooks that are a part of the LibriVox project. Most of the audiobooks come from the Project Gutenberg. The training data is split into 3 partitions of 100hr, 360hr, and 500hr sets while the dev and test data are split into the ’clean’ and ’other’ categories, respectively, depending upon how well or challening Automatic Speech Recognition systems would perform against. Each of the dev and test sets is around 5hr in audio length.
LibriTTS - LibriTTS is a multi-speaker English corpus of approximately 585 hours of read English speech at 24kHz sampling rate, prepared by Heiga Zen with the assistance of Google Speech and Google Brain team members. The LibriTTS corpus is designed for TTS research. It is derived from the original materials (mp3 audio files from LibriVox and text files from Project Gutenberg) of the LibriSpeech corpus.
LibriTTS-R - LibriTTS-R: A Restored Multi-Speaker Text-to-Speech Corpus. It is derived by applying speech restoration to the LibriTTS corpus, which consists of 585 hours of speech data at 24 kHz sampling rate from 2,456 speakers and the corresponding texts. The constituent samples of LibriTTS-R are identical to those of LibriTTS, with only the sound quality improved.
LJSpeech (The LJ Speech Dataset) - This is a public domain speech dataset consisting of 13,100 short audio clips of a single speaker reading passages from 7 non-fiction books. A transcription is provided for each clip. Clips vary in length from 1 to 10 seconds and have a total length of approximately 24 hours. The texts were published between 1884 and 1964, and are in the public domain. The audio was recorded in 2016-17 by the LibriVox project and is also in the public domain.
LRS2 (Lip Reading Sentences 2) - The Oxford-BBC Lip Reading Sentences 2 (LRS2) dataset is one of the largest publicly available datasets for lip reading sentences in-the-wild. The database consists of mainly news and talk shows from BBC programs. Each sentence is up to 100 characters in length.
LRW (Lip Reading in the Wild) - The Lip Reading in the Wild (LRW) dataset a large-scale audio-visual database that contains 500 different words from over 1,000 speakers. Each utterance has 29 frames, whose boundary is centered around the target word. The database is divided into training, validation and test sets. The training set contains at least 800 utterances for each class while the validation and test sets contain 50 utterances.
MuAViC - A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation.
MuST-C - MuST-C currently represents the largest publicly available multilingual corpus (one-to-many) for speech translation. It covers eight language directions, from English to German, Spanish, French, Italian, Dutch, Portuguese, Romanian and Russian. The corpus consists of audio, transcriptions and translations of English TED talks, and it comes with a predefined training, validation and test split.
MetaQA (MoviE Text Audio QA) - The MetaQA dataset consists of a movie ontology derived from the WikiMovies Dataset and three sets of question-answer pairs written in natural language: 1-hop, 2-hop, and 3-hop queries.
MELD (Multimodal EmotionLines Dataset) - Multimodal EmotionLines Dataset (MELD) has been created by enhancing and extending EmotionLines dataset. MELD contains the same dialogue instances available in EmotionLines, but it also encompasses audio and visual modality along with text. MELD has more than 1400 dialogues and 13000 utterances from Friends TV series. Multiple speakers participated in the dialogues. Each utterance in a dialogue has been labeled by any of these seven emotions -- Anger, Disgust, Sadness, Joy, Neutral, Surprise and Fear. MELD also has sentiment (positive, negative and neutral) annotation for each utterance.
Microsoft Speech Corpus (Indian languages) - Microsoft Speech Corpus (Indian languages) release contains conversational and phrasal speech training and test data for Telugu, Tamil and Gujarati languages. The data package includes audio and corresponding transcripts. Data provided in this dataset shall not be used for commercial purposes. You may use the data solely for research purposes. If you publish your findings, you must provide the following attribution: “Data provided by Microsoft and SpeechOcean.com”.
PATS (Pose Audio Transcript Style) - PATS dataset consists of a diverse and large amount of aligned pose, audio and transcripts. With this dataset, we hope to provide a benchmark that would help develop technologies for virtual agents which generate natural and relevant gestures.
SAVEE (Surrey Audio-Visual Expressed Emotion) - The Surrey Audio-Visual Expressed Emotion (SAVEE) dataset was recorded as a pre-requisite for the development of an automatic emotion recognition system. The database consists of recordings from 4 male actors in 7 different emotions, 480 British English utterances in total. The sentences were chosen from the standard TIMIT corpus and phonetically-balanced for each emotion.
3D-Speaker-Datasets - A large scale multi-Device, multi-Distance, and multi-Dialect audio dataset of human speech.
TED-LIUM - Audio transcription of TED talk. 1495 TED talk audio recordings along with full-text transcriptions of those recordings, create by Laboratoire d’Informatique de l’Université du Maine (LIUM).
The Flickr Audio Caption Corpus - The Flickr 8k Audio Caption Corpus contains 40,000 spoken captions of 8,000 natural images. It was collected in 2015 to investigate multimodal learning schemes for unsupervised speech pattern discovery.
The People’s Speech - The People's Speech is a free-to-download 30,000-hour and growing supervised conversational English speech recognition dataset licensed for academic and commercial usage under CC-BY-SA (with a CC-BY subset). The data is collected via searching the Internet for appropriately licensed audio data with existing transcriptions.
The Spoken Wikipedia Corpora - The Spoken Wikipedia project unite volunteer reader of Wikipedia article. Hundreds of spoken article in multiple languages are available to user who are – for one reason or another – unable or unwilling to consume the write version of the article.
TIMIT - The DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus.
VoxCeleb2 - VoxCeleb2 is a large scale speaker recognition dataset obtained automatically from open-source media. VoxCeleb2 consists of over a million utterances from over 6k speakers. Since the dataset is collected ‘in the wild’, the speech segments are corrupted with real world noise including laughter, cross-talk, channel effects, music and other sounds. The dataset is also multilingual, with speech from speakers of 145 different nationalities, covering a wide range of accents, ages, ethnicities and languages.
VoxConverse - VoxConverse is an audio-visual diarisation dataset consisting of multispeaker clips of human speech, extracted from YouTube videos.
VoxLingua107 - VoxLingua107 is a dataset for spoken language recognition of 6628 hours (62 hours per language on the average) and it is accompanied by an evaluation set of 1609 verified utterances.
VoxPopuli - VoxPopuli is a large-scale multilingual corpus providing 100K hours of unlabelled speech data in 23 languages. It is the largest open data to date for unsupervised representation learning as well as semi-supervised learning. VoxPopuli also contains 1.8K hours of transcribed speeches in 16 languages and their aligned oral interpretations into 5 other languages totaling 5.1K hours.
VoxForge - VoxForge is an open speech dataset that was set up to collect transcribed speech for use with Free and Open Source Speech Recognition Engines (on Linux, Windows and Mac).
VocalSound - VocalSound is a free dataset consisting of 21,024 crowdsourced recordings of laughter, sighs, coughs, throat clearing, sneezes, and sniffs from 3,365 unique subjects. The VocalSound dataset also contains meta-information such as speaker age, gender, native language, country, and health condition.
VoiceBank + DEMAND - VoiceBank+DEMAND is a noisy speech database for training speech enhancement algorithms and TTS models. The database was designed to train and test speech enhancement methods that operate at 48kHz. A more detailed description can be found in the paper associated with the database.
WaveFake - WaveFake is a dataset for audio deepfake detection. The dataset consists of a large-scale dataset of over 100K generated audio clips.
WenetSpeech - WenetSpeech is a multi-domain Mandarin corpus consisting of 10,000+ hours high-quality labeled speech, 2,400+ hours weakly labelled speech, and about 10,000 hours unlabeled speech, with 22,400+ hours in total. The authors collected the data from YouTube and Podcast, which covers a variety of speaking styles, scenarios, domains, topics, and noisy conditions. An optical character recognition (OCR) based method is introduced to generate the audio/text segmentation candidates for the YouTube data on its corresponding video captions.
WSJ0-2mix - WSJ0-2mix is a speech recognition corpus of speech mixtures using utterances from the Wall Street Journal (WSJ0) corpus.
WHAM! (WSJ0 Hipster Ambient Mixtures) - The WSJ0 Hipster Ambient Mixtures ( WHAM! ) dataset pairs each two-speaker mixture in the wsj0-2mix dataset with a unique noise background scene. The noise audio was collected at various urban locations throughout the San Francisco Bay Area in late 2018. The environments primarily consist of restaurants, cafes, bars, and parks. Audio was recorded using an Apogee Sennheiser binaural microphone on a tripod between 1.0 and 1.5 meters off the ground.
YTTTS - The YouTube Text-To-Speech dataset is comprised of waveform audio extracted from YouTube videos alongside their English transcriptions.

Music

AAM: Artificial Audio Multitracks Dataset - This dataset contains 3,000 artificial music audio tracks with rich annotations. It is based on real instrument samples and generated by algorithmic composition with respect to music theory. It provides full mixes of the songs as well as single instrument tracks. The midis used for generation are also available. The annotation files include: Onsets, Pitches, Instruments, Keys, Tempos, Segments, Melody instrument, Beats, and Chords.
Acappella - Acappella comprises around 46 hours of a cappella solo singing videos sourced from YouTbe, sampled across different singers and languages. Four languages are considered: English, Spanish, Hindi and others.
ADD: audio-dataset-downloader - Simple Python CLI script for downloading N-hours of audio from Youtube, based on a list of music genres.
ADL Piano MIDI - The ADL Piano MIDI is a dataset of 11,086 piano pieces from different genres. This dataset is based on the Lakh MIDI dataset, which is a collection on 45,129 unique MIDI files that have been matched to entries in the Million Song Dataset.
Aligned Scores and Performances (ASAP) - ASAP is a dataset of aligned musical scores (both MIDI and MusicXML) and performances (audio and MIDI), all with downbeat, beat, time signature, and key signature annotations.
Bach Doodle - The Bach Doodle Dataset is composed of 21.6 million harmonizations submitted from the Bach Doodle. The dataset contains both metadata about the composition (such as the country of origin and feedback), as well as a MIDI of the user-entered melody and a MIDI of the generated harmonization. The dataset contains about 6 years of user entered music.
Bach Violin Dataset - A collection of high-quality public recordings of Bach's sonatas and partitas for solo violin (BWV 1001–1006).
CAL500 (Computer Audition Lab 500) - CAL500 (Computer Audition Lab 500) is a dataset aimed for evaluation of music information retrieval systems. It consists of 502 songs picked from western popular music. The audio is represented as a time series of the first 13 Mel-frequency cepstral coefficients (and their first and second derivatives) extracted by sliding a 12 ms half-overlapping short-time window over the waveform of each song.
CCMixter - CCMixter is a singing voice separation dataset consisting of 50 full-length stereo tracks from ccMixter featuring many different musical genres. For each song there are three WAV files available: the background music, the voice signal, and their sum.
ChMusic - ChMusic is a traditional Chinese music dataset for training model and performance evaluation of musical instrument recognition. This dataset cover 11 musical instruments, consisting of Erhu, Pipa, Sanxian, Dizi, Suona, Zhuiqin, Zhongruan, Liuqin, Guzheng, Yangqin and Sheng.
chongchong-free - Chongchong Piano Downloader is a software for free downloading of Chongchong piano score, which can obtain the link of the score, analyze the content of the score, and export the file.
ComMU - ComMU has 11,144 MIDI samples that consist of short note sequences created by professional composers with their corresponding 12 metadata. This dataset is designed for a new task, combinatorial music generation which generate diverse and high-quality music only with metadata through auto-regressive language model.
DALI - DALI: a large Dataset of synchronised Audio, LyrIcs and vocal notes.
DadaGP - DadaGP is a new symbolic music dataset comprising 26,181 song scores in the GuitarPro format covering 739 musical genres, along with an accompanying tokenized format well-suited for generative sequence models such as the Transformer. The tokenized format is inspired by event-based MIDI encodings, often used in symbolic music generation models. The dataset is released with an encoder/decoder which converts GuitarPro files to tokens and back.
DeepScores - Synthetic dataset of 300000 annotated images of written music for object classification, semantic segmentation and object detection. Based on a large set of MusicXML documents that were obtained from MuseScore, a sophisticated pipeline is used to convert the source into LilyPond files, for which LilyPond is used to engrave and annotate the images.
dMelodies - dMelodies is dataset of simple 2-bar melodies generated using 9 independent latent factors of variation where each data point represents a unique melody based on the following constraints: - Each melody will correspond to a unique scale (major, minor, blues, etc.). - Each melody plays the arpeggios using the standard I-IV-V-I cadence chord pattern. - Bar 1 plays the first 2 chords (6 notes), Bar 2 plays the second 2 chords (6 notes). - Each played note is an 8th note.
DISCO-10M - DISCO-10M is a music dataset created to democratize research on large-scale machine learning models for music.
Dizi - Dizi is a dataset of music style of the Northern school and the Southern School. Characteristics include melody and playing techniques of the two different music styles are deconstructed.
EMOPIA - A Multi-Modal Pop Piano Dataset For Emotion Recognition and Emotion-based Music Generation. EMOPIA (pronounced ‘yee-mò-pi-uh’) dataset is a shared multi-modal (audio and MIDI) database focusing on perceived emotion in pop piano music, to facilitate research on various tasks related to music emotion. The dataset contains 1,087 music clips from 387 songs and clip-level emotion labels annotated by four dedicated annotators.
ErhuPT (Erhu Playing Technique Dataset) - This dataset is an audio dataset containing about 1500 audio clips recorded by multiple professional players.
FMA - The Free Music Archive (FMA) is a large-scale dataset for evaluating several tasks in Music Information Retrieval. It consists of 343 days of audio from 106,574 tracks from 16,341 artists and 14,854 albums, arranged in a hierarchical taxonomy of 161 genres. It provides full-length and high-quality audio, pre-computed features, together with track- and user-level metadata, tags, and free-form text such as biographies.
GiantMIDI-Piano - GiantMIDI-Piano is a classical piano MIDI dataset contains 10,855 MIDI files of 2,786 composers. The curated subset by constraining composer surnames contains 7,236 MIDI files of 1,787 composers.
Groove (Groove MIDI Dataset) - The Groove MIDI Dataset (GMD) is composed of 13.6 hours of aligned MIDI and (synthesized) audio of human-performed, tempo-aligned expressive drumming. The dataset contains 1,150 MIDI files and over 22,000 measures of drumming.
Jazz Harmony Treebank - This repository contains the Jazz Harmony Treebank, a corpus of hierarchical harmonic analyses of jazz chord sequences selected from the iRealPro corpus published on zenodo by Shanahan et al.
jazznet - jazznet: A Dataset of Fundamental Piano Patterns for Music Audio Machine Learning Research. This paper introduces the jazznet Dataset, a dataset of fundamental jazz piano music patterns for developing machine learning (ML) algorithms in music information retrieval (MIR). The dataset contains 162520 labeled piano patterns, including chords, arpeggios, scales, and chord progressions with their inversions, resulting in more than 26k hours of audio and a total size of 95GB.
JS Fake Chorales - A MIDI dataset of 500 4-part chorales generated by the KS_Chorus algorithm, annotated with results from hundreds of listening test participants, with 300 further unannotated chorales.
LAKH MuseNet MIDI Dataset - Full LAKH MIDI dataset converted to MuseNet MIDI output format (9 instruments + drums).
Los Angeles MIDI Dataset - SOTA kilo-scale MIDI dataset for MIR and Music AI purposes.
MAESTRO - The MAESTRO dataset contains over 200 hours of paired audio and MIDI recordings from ten years of International Piano-e-Competition. The MIDI data includes key strike velocities and sustain/sostenuto/una corda pedal positions. Audio and MIDI files are aligned with ∼3 ms accuracy and sliced to individual musical pieces, which are annotated with composer, title, and year of performance. Uncompressed audio is of CD quality or higher (44.1–48 kHz 16-bit PCM stereo).
MagnaTagATune - MagnaTagATune dataset contains 25,863 music clips. Each clip is a 29-seconds-long excerpt belonging to one of the 5223 songs, 445 albums and 230 artists. The clips span a broad range of genres like Classical, New Age, Electronica, Rock, Pop, World, Jazz, Blues, Metal, Punk, and more. Each audio clip is supplied with a vector of binary annotations of 188 tags.
Main Dataset for "Evolution of Popular Music: USA 1960–2010" - This is a large file (~20MB) called EvolutionPopUSA_MainData.csv, in comma-separated data format with column headers. Each row corresponds to a recording. The file is viewable in any text editor, and can also be opened in Excel or imported to other data processing programs.
MetaMIDI Dataset - We introduce the MetaMIDI Dataset (MMD), a large scale collection of 436,631 MIDI files and metadata. In addition to the MIDI files, we provide artist, title and genre metadata that was collected during the scraping process when available. MIDIs in (MMD) were matched against a collection of 32,000,000 30-second audio clips retrieved from Spotify, resulting in over 10,796,557 audio-MIDI matches.
MIR-1K - MIR-1K (Multimedia Information Retrieval lab, 1000 song clips) is a dataset designed for singing voice separation.
MSD (Million Song Dataset) - The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks. The core of the dataset is the feature analysis and metadata for one million songs, provided by The Echo Nest.
MTG-Jamendo Dataset - We present the MTG-Jamendo Dataset, a new open dataset for music auto-tagging. It is built using music available at Jamendo under Creative Commons licenses and tags provided by content uploaders. The dataset contains over 55,000 full audio tracks with 195 tags from genre, instrument, and mood/theme categories. We provide elaborated data splits for researchers and report the performance of a simple baseline approach on five different sets of tags: genre, instrument, mood/theme, top-50, and overall.
MTG-Jamendo - The MTG-Jamendo dataset is an open dataset for music auto-tagging. The dataset contains over 55,000 full audio tracks with 195 tags categories (87 genre tags, 40 instrument tags, and 56 mood/theme tags). It is built using music available at Jamendo under Creative Commons licenses and tags provided by content uploaders. All audio is distributed in 320kbps MP3 format.
Music Data Sharing Platform for Computational Musicology Research (CCMUSIC DATASET) - This platform is a multi-functional music data sharing platform for Computational Musicology research. It contains many music datas such as the sound information of Chinese traditional musical instruments and the labeling information of Chinese pop music, which is available for free use by computational musicology researchers.
Music Emotion Recognition (MER) - We present a data set for the analysis of personalized Music Emotion Recognition (MER) systems. We developed the Music Enthusiasts platform aiming to improve the gathering and analysis of the so-called “ground truth” needed as input to such systems.
MUSAN - MUSAN is a corpus of music, speech and noise. This dataset is suitable for training models for voice activity detection (VAD) and music/speech discrimination. The dataset consists of music from several genres, speech from twelve languages, and a wide assortment of technical and non-technical noises.
MusicNet - MusicNet is a collection of 330 freely-licensed classical music recordings, together with over 1 million annotated labels indicating the precise time of each note in every recording, the instrument that plays each note, and the note's position in the metrical structure of the composition. The labels are acquired from musical scores aligned to recordings by dynamic time warping. The labels are verified by trained musicians; we estimate a labeling error rate of 4%. We offer the MusicNet labels to the machine learning and music communities as a resource for training models and a common benchmark for comparing results.
MusicCaps - MusicCaps is a dataset composed of 5.5k music-text pairs, with rich text descriptions provided by human experts.
MuseData - MuseData is an electronic library of orchestral and piano classical music from CCARH. It consists of around 3MB of 783 files.
MUSDB18 - The MUSDB18 is a dataset of 150 full lengths music tracks (~10h duration) of different genres along with their isolated drums, bass, vocals and others stems. The dataset is split into training and test sets with 100 and 50 songs, respectively. All signals are stereophonic and encoded at 44.1kHz.
Music Topics and Metadata - This dataset provides a list of lyrics from 1950 to 2019 describing music metadata as sadness, danceability, loudness, acousticness, etc. We also provide some informations as lyrics which can be used to natural language processing.
Music genres dataset - Dataset of 1494 genres, each containing 200 songs.
Multimodal Sheet Music Dataset - MSMD is a synthetic dataset of 497 pieces of (classical) music that contains both audio and score representations of the pieces aligned at a fine-grained level (344,742 pairs of noteheads aligned to their audio/MIDI counterpart).
Nlakh - Nlakh is a dataset for Musical Instrument Retrieval. It is a combination of the NSynth dataset, which provides a large number of instruments, and the Lakh dataset, which provides multi-track MIDI data.
NSynth - NSynth is a dataset of one shot instrumental notes, containing 305,979 musical notes with unique pitch, timbre and envelope. The sounds were collected from 1006 instruments from commercial sample libraries and are annotated based on their source (acoustic, electronic or synthetic), instrument family and sonic qualities. The instrument families used in the annotation are bass, brass, flute, guitar, keyboard, mallet, organ, reed, string, synth lead and vocal. Four second monophonic 16kHz audio snippets were generated (notes) for the instruments.
NES-MDB (Nintendo Entertainment System Music Database) - The Nintendo Entertainment System Music Database (NES-MDB) is a dataset intended for building automatic music composition systems for the NES audio synthesizer. It consists of 5278 songs from the soundtracks of 397 NES games. The dataset represents 296 unique composers, and the songs contain more than two million notes combined. It has file format options for MIDI, score and NLM (NES Language Modeling).
Niko Chord Progression Dataset - The Niko Chord Progression Dataset is used in AccoMontage2. It contains 5k+ chord progression pieces, labeled with styles. There are four styles in total: Pop Standard, Pop Complex, Dark and R&B.
Opencpop - Opencpop , a publicly available high-quality Mandarin singing corpus, is designed for singing voice synthesis (SVS) systems. This corpus consists of 100 unique Mandarin songs , which were recorded by a professional female singer. All audio files were recorded with studio-quality at a sampling rate of 44,100 Hz in a professional recording studio environment.
OpenGufeng - A melody and chord progression dataset for Chinese Gufeng Music.
POP909 - POP909 is a dataset which contains multiple versions of the piano arrangements of 909 popular songs created by professional musicians. The main body of the dataset contains the vocal melody, the lead instrument melody, and the piano accompaniment for each song in MIDI format, which are aligned to the original audio files. Furthermore, annotations are provided of tempo, beat, key, and chords, where the tempo curves are hand-labelled and others are done by MIR algorithms.
RWC (Real World Computing Music Database) - The RWC (Real World Computing) Music Database is a copyright-cleared music database (DB) that is available to researchers as a common foundation for research. It contains around 100 complete songs with manually labeled section boundaries. For the 50 instruments, individual sounds at half-tone intervals were captured with several variations of playing styles, dynamics, instrument manufacturers and musicians.
Sangeet - An XML Dataset for Hindustani Classical Music. SANGEET preserves all the required information of any given composition including metadata, structural, notational, rhythmic, and melodic information in a standardized way for easy and efficient storage and extraction of musical information. The dataset is intended to provide the ground truth information for music information research tasks, thereby supporting several data-driven analysis from a machine learning perspective.
Slakh2100 - The Synthesized Lakh (Slakh) Dataset is a dataset for audio source separation that is synthesized from the Lakh MIDI Dataset v0.1 using professional-grade sample-based virtual instruments. This first release of Slakh, called Slakh2100, contains 2100 automatically mixed tracks and accompanying MIDI files synthesized using a professional-grade sampling engine. The tracks in Slakh2100 are split into training (1500 tracks), validation (375 tracks), and test (225 tracks) subsets, totaling 145 hours of mixtures.
SymphonyNet - SymponyNet is an open-source project aiming to generate complex multi-track and multi-instrument music like symphony. Our method is fully compatible with other types of music like pop, piano, solo music..etc.
The Lakh MIDI Dataset - The Lakh MIDI dataset is a collection of 176,581 unique MIDI files, 45,129 of which have been matched and aligned to entries in the Million Song Dataset. Its goal is to facilitate large-scale music information retrieval, both symbolic (using the MIDI files alone) and audio content-based (using information extracted from the MIDI files as annotations for the matched audio files).
The Italian Music Dataset - The dataset is built by exploiting the Spotify and SoundCloud APIs. It is composed of over 14,500 different songs of both famous and less famous Italian musicians. Each song in the dataset is identified by its Spotify id and its title. Tracks' metadata include also lemmatized and POS-tagged lyrics and, in the most of cases, ten musical features directly gathered from Spotify. Musical features include acousticness (float), danceability (float), duration_ms (int), energy (float), instrumentalness (float), liveness (float), loudness (float), speechiness (float), tempo (float) and valence (float).
Universal Music Symbol Classifier - A Python project that trains a Deep Neural Network to distinguish between Music Symbols.
URMP (University of Rochester Multi-Modal Musical Performance) - URMP (University of Rochester Multi-Modal Musical Performance) is a dataset for facilitating audio-visual analysis of musical performances. The dataset comprises 44 simple multi-instrument musical pieces assembled from coordinated but separately recorded performances of individual tracks. For each piece the dataset provided the musical score in MIDI format, the high-quality individual instrument audio recordings and the videos of the assembled pieces.
VGMIDI Dataset - VGMIDI is a dataset of piano arrangements of video game soundtracks. It contains 200 MIDI pieces labelled according to emotion and 3,850 unlabeled pieces. Each labelled piece was annotated by 30 human subjects according to the Circumplex (valence-arousal) model of emotion.
Virtuoso Strings - Virtuoso Strings is a dataset for soft onsets detection for string instruments. It consists of over 144 recordings of professional performances of an excerpt from Haydn's string quartet Op. 74 No. 1 Finale, each with corresponding individual instrumental onset annotations.
YM2413-MDB - YM2413-MDB is an 80s FM video game music dataset with multi-label emotion annotations. It includes 669 audio and MIDI files of music from Sega and MSX PC games in the 80s using YM2413, a programmable sound generator based on FM. The collected game music is arranged with a subset of 15 monophonic instruments and one drum instrument.

Sound Effect

Animal Sound Dataset - This data consisting of 875 animal sounds contains 10 types of animal sounds. This animal sounds dataset consists 200 cat, 200 dog, 200 bird, 75 cow, 45 lion, 40 sheep, 35 frog, 30 chicken, 25 donkey, 25 monkey sounds.
AudioSet - Audioset is an audio event dataset, which consists of over 2M human-annotated 10-second video clips. These clips are collected from YouTube, therefore many of which are in poor-quality and contain multiple sound-sources. A hierarchical ontology of 632 event classes is employed to annotate these data, which means that the same sound could be annotated as different labels. For example, the sound of barking is annotated as Animal, Pets, and Dog. All the videos are split into Evaluation/Balanced-Train/Unbalanced-Train set.
AudioCaps - AudioCaps is a dataset of sounds with event descriptions that was introduced for the task of audio captioning, with sounds sourced from the AudioSet dataset. Annotators were provided the audio tracks together with category hints (and with additional video hints if needed).
BBC sound effects - There are 33,066 sound effects in the BBC sound effects dataset, with text descriptions. Genre: Mainly ambient sound. Every audio has a natural textual description.
DCASE 2016 - DCASE 2016 is a dataset for sound event detection. It consists of 20 short mono sound files for each of 11 sound classes (from office environments, like clearthroat, drawer, or keyboard), each file containing one sound event instance. Sound files are annotated with event on- and offset times, however silences between actual physical sounds (like with a phone ringing) are not marked and hence “included” in the event.
Environmental Audio Datasets - This page tries to maintain a list of datasets suitable for environmental audio research. In addition to the freely available dataset, also proprietary and commercial datasets are listed here for completeness. In addition to the datasets, also some of the on-line sound services are listed at the end of the page.
ESC-50 - The ESC-50 dataset is a labeled collection of 2000 environmental audio recordings suitable for benchmarking methods of environmental sound classification. It comprises 2000 5s-clips of 50 different classes across natural, human and domestic sounds, again, drawn from Freesound.org.
FAIR-Play - FAIR-Play is a video-audio dataset consisting of 1,871 video clips and their corresponding binaural audio clips recording in a music room. The video clip and binaural clip of the same index are roughly aligned.
FSD50K (Freesound Database 50K) - Freesound Dataset 50k (or FSD50K for short) is an open dataset of human-labeled sound events containing 51,197 Freesound clips unequally distributed in 200 classes drawn from the AudioSet Ontology. FSD50K has been created at the Music Technology Group of Universitat Pompeu Fabra. It consists mainly of sound events produced by physical sound sources and production mechanisms, including human sounds, sounds of things, animals, natural sounds, musical instruments and more.
FSDnoisy18k - The FSDnoisy18k dataset is an open dataset containing 42.5 hours of audio across 20 sound event classes, including a small amount of manually-labeled data and a larger quantity of real-world noisy data. The audio content is taken from Freesound, and the dataset was curated using the Freesound Annotator. The noisy set of FSDnoisy18k consists of 15,813 audio clips (38.8h), and the test set consists of 947 audio clips (1.4h) with correct labels. The dataset features two main types of label noise: in-vocabulary (IV) and out-of-vocabulary (OOV). IV applies when, given an observed label that is incorrect or incomplete, the true or missing label is part of the target class set. Analogously, OOV means that the true or missing label is not covered by those 20 classes.
FUSS (Free Universal Sound Separation) - The Free Universal Sound Separation (FUSS) dataset is a database of arbitrary sound mixtures and source-level references, for use in experiments on arbitrary sound separation. FUSS is based on FSD50K corpus.
Knocking Sound Effects With Emotional Intentions - The dataset was recorded by the professional foley artist Ulf Olausson at the FoleyWorks studios in Stockholm on the 15th October, 2019. Inspired by previous work on knocking sounds. we chose five type of emotions to be portrayed in the dataset: anger, fear, happiness, neutral and sadness.
MIMII - Sound Dataset for Malfunctioning Industrial Machine Investigation and Inspection (MIMII) is a sound dataset of industrial machine sounds.
Mivia Audio Events Dataset - The MIVIA audio event data set is compose of a total of 6000 event for surveillance application, namely glass breaking, gun shot and scream. The 6000 event are divide into a training set (compose of 4200 event) and a test set (compose of 1800 event).
Pitch Audio Dataset (Surge synthesizer) - 3.4 hours of audio synthesized using the open-source Surge synthesizer, based upon 2084 presets included in the Surge package. These represent ``natural'' synthesis sounds---i.e.presets devised by humans. We generated 4-second samples playing at velocity 64 with a note-on duration of 3 seconds. For each preset, we varied only the pitch, from MIDI 21--108, the range of a grand piano. Every sound in the dataset was RMS-level normalized using the normalize package. There was no elegant way to dedup this dataset; however only a small percentage of presets (like drums and sound effects) had no perceptual pitch variation or ordering.
SoundingEarth - SoundingEarth consists of co-located aerial imagery and audio samples all around the world.
STARSS22 (Sony-TAu Realistic Spatial Soundscapes 2022) - The Sony-TAu Realistic Spatial Soundscapes 2022(STARSS22) dataset consists of recordings of real scenes captured with high channel-count spherical microphone array (SMA). The recordings are conducted from two different teams at two different sites, Tampere University in Tammere, Finland, and Sony facilities in Tokyo, Japan. Recordings at both sites share the same capturing and annotation process, and a similar organization.
ToyADMOS - ToyADMOS dataset is a machine operating sounds dataset of approximately 540 hours of normal machine operating sounds and over 12,000 samples of anomalous sounds collected with four microphones at a 48kHz sampling rate, prepared by Yuma Koizumi and members in NTT Media Intelligence Laboratories.
TUT Sound Events 2017 - The TUT Sound Events 2017 dataset contains 24 audio recordings in a street environment and contains 6 different classes. These classes are: brakes squeaking, car, children, large vehicle, people speaking, and people walking.
UrbanSound8K - Urban Sound 8K is an audio dataset that contains 8732 labeled sound excerpts (<=4s) of urban sounds from 10 classes: air_conditioner, car_horn, children_playing, dog_bark, drilling, enginge_idling, gun_shot, jackhammer, siren, and street_music. The classes are drawn from the urban sound taxonomy. All excerpts are taken from field recordings uploaded to www.freesound.org.
VGG-Sound - A large scale audio-visual dataset. VGG-Sound is an audio-visual correspondent dataset consisting of short clips of audio sounds, extracted from videos uploaded to YouTube.
Visually Indicated Sounds - Materials make distinctive sounds when they are hit or scratched — dirt makes a thud; ceramic makes a clink. These sounds reveal aspects of an object's material properties, as well as the force and motion of the physical interaction.

Name		Name	Last commit message	Last commit date
Latest commit History 115 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Audio Datasets List (AI-ADL) 🎵

Speech

Music

Sound Effect

And more

About

Releases

Packages

License

WhiteFu/ai-audio-datasets-list

Folders and files

Latest commit

History

Repository files navigation

AI Audio Datasets List (AI-ADL) 🎵

Speech

Music

Sound Effect

And more

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages