extract-jukebox-embeddings

A notebook for extracting embeddings from OpenAI's Jukebox model, following the approach described in Castellon et al. (2021) with some modifications followed in Spotify's Llark paper:

Source: Output of the 36th layer of the Jukebox encoder
Original Jukebox encoding: 4800-dimensional vectors at 345Hz
Audio/embeddings are chunked into 25 seconds clips as that is the max Jukebox can take in as input, any clips shorter than 25 seconds are padded before passed through Jukebox
Approach: Mean-pooling within 100ms frames, resulting in:
- Downsampled frequency: 10Hz
- Embedding size: 1.2 × 10^6 for a 25s audio clip.
- For a 25s audio clip the 2D array shape will be [240, 4800]
This method retains temporal information while reducing the embedding size

Having a Colab notebook for this gives us an easily reproducible environment and allows us to take advantage of the cheap T4 GPU's Colab offers.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
public_Extract_Jukebox_Embeddings.ipynb		public_Extract_Jukebox_Embeddings.ipynb

Provide feedback