This a web visualization and listening project based on scatter-gl, web-audio and tensorflow-hub AI sound models. Most of the time sound datasets are explored visually mainly through spectrograms or other time-frequency representation. This is an effort to show how fast you can overview a dataset through hearing it ordered in a similarity space.
Python preprocessing scripts are provided to handle single long audios, or datasets of sound clips that are stored in a single folder like ESC-50 or urbansound8k. The model used to generate the similarity space is YAMNet which is easily aviable throug the tensorflow-hub repository.
- ESC50 2000 clips of 50 sound categories.
- Urbansound8k fold1 875 clips of 10 sound categories.
- Urbansound8k fold2 888 clips of 10 sound categories.
- When you hoover over a point or spectrogram image, you'll hear the clip and the metadata will appear on the left side of the screen.
- When you click the point you'll hear the clip on loop 4 times.
- Selection will play a random clip from the selected ones.
Note: You'll have to wait until the page loads the soundfile in order to hear the clips.
The wepbage needs 4 files to render the sound dataset.
- config.json: Store metadata of the dataset and paths to the data files.
- projections.json: Store the 3d projections of the YAMNet embeddings, labels and other useful metadata of the clips.
- sprite.jpg: The spritesheet image of log-melspectrograms that uses the YAMNet model for each clip of the dataset.
- audio.flac: The "spriteclip" audio with all the dataset clips merged.
The python preprocessing script would receive as input a path to a folder that contains audio clips, or a path to a long audio file.
Audio clips on a folder(E.g. Esc50, UB8k)
cd preprocess
python preprocess.py -d <path_to_audio_folder>
This would try to:
- Load all the audios in the folder
- Extract a clip region around the signal maximum amplitude of 0.96 seconds(YAMNet window analysis size). If the clip has a duration less than 0.96 seconds it would be padded with zeros.
- Merge all trimmed clips, and resample the merged audio to the expected model sample rate(16Khz).
- Get the the YAMNet embeddings and Log-melspectrogram of the merged signal. Note: odd index embeddings are discarded to get one embedding per clip, and avoid clipwise aggregation and inference.
- Compute audio descriptors and parse labels from clip filenames for the metadata.
- Reduce the dimensionality of the YAMNet emb(1024) to 3 components.
- Generate the spritesheet image, the sprite clip, and the projections file.
A parse label function and a label list have to defined and passed as arguments to process_clips_from_folder
function. Examples are provided for ESC50, and urbansound8k dataset.
cd preprocess
python preprocess.py -f <path_to_long_audio>
This would try to.
- Load the long audio.
- Resample to the expected model sample rate(16Khz).
- Get the the YAMNet embeddings and Log-melspectrogram of the signal.
- Compute audio descriptors and generate filenames to display of the starting second of the segment for the metadata.
- Reduce the dimensionality of the YAMNet emb(1024) to 3 components.
- Generate the spritesheet image, the sprite clip, and the projections file.
The generated files are stored in the data folder of the project, once they are generated you can pass as url argument the name of your dataset and it will be rendered.
Note: Remember that the space that is rendered in the page is a projection into 3 components that uses either UMAP(default) or T-SNE with it's advantages and caveats. Please read Understanding UMAP and How to Use t-SNE Effectively
This project relies on the work done in the GSOC 2021 with Orcasound about exploring sound datasets on embedding spaces, big thanks to my mentors. Also, if you're a Spanish speaker I'm going to recommend this course that developed Irán Roman, who introduced me and continues to teach me about this exciting field of sounds and ML.