Skip to content

Spectrogrand: Generating interesting audiovisuals for text prompts.

License

Notifications You must be signed in to change notification settings

vijay-jaisankar/spectrogrand

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Spectrogrand

Spectrogrand: Generating interesting audiovisuals for text prompts.

Architecture Diagram


About the project

Spectrograms are visual representations of audio samples often used in Engineering applications as features for various downstream tasks. We unlock the artistic value of spectrograms and use them in both scientific and artistic domains through Spectrogrand: a pipeline to generate interesting melspectrogram-driven audiovisuals given text topic prompts. We also bake in lightweight domain-driven computational creativity assessment throughout steps of the generation process.

In this regard, this pipeline has the following steps:

  • We use audioldm2-music to generate multiple candidate house music songs for the topic text prompt. We then estimate each candidate's novelty from human-generated house music songs (collected from the HouseX dataset) and value through its danceability score calculated using Essebtia. We select the song with the highest equiweighted score for our pipeline.
  • Then, we generate melspectrograms for the song as a whole, and for periodic chunks of the sample. These numerous images convey local intensity and temporal diversity scattered throughout different zones of the song.
  • We use the parent spectrogram to deduce the genre of the song. Our Resnet-101 based-model with augmented train-time transforms is the current SOTA on the HouseX-full-image task ๐Ÿฅณ
  • We then use stable-diffusion-xl-base-1.0 to generate candidate album covers for this song. The selected genre defines and augments the prompts through selecting base colours and descriptor words. We then estimate each candidate's value and surprisingness based on its aestheticness, and how likely it can fool a strong custom classifier (trained on human-generated and AI-generated album covers) into believing that the candidate is more human-generated. We select the image with the highest equiweighted score for our pipeline.
  • We then use magenta to perform arbitrary image style transfer on the selected album cover image and each of the song chunk's melspectrograms.
  • At the end of the pipeline, one can hence generate a static video and two spectrogram-driven audiovisual videos. As an additional feature โœจ, we also support stable-video-diffusion-img2vid-xt to automatically generate a music video of arbitrary length conditioned on the chosen album cover image.

Key contributions


Getting Started

To run the pipeline on Kaggle, please review the instructions listed in the Kaggle data release and check out the notebook (also linked in the dataset).

To run the pipeline locally, please follow the steps detailed in this notebook.


Outputs of Spectrogrand


Prompt topic: Futuristic Spaceship

Static video

static.mp4

Dynamic videos

dynamic1.mp4
dynamic2.mp4

Prompt topic: Dystopian Robotic World

(๐Ÿ’ก Inspiration from Twitter)

Static video

static.mp4

Dynamic videos

dynamic1.mp4
dynamic2.mp4

Prompt topic: Computer Vision

Static video

static.mp4

Dynamic videos

dynamic1.mp4
dynamic2.mp4

Acknowledgements and Contact Details

This project was done under the guidance of Prof. Dinesh Babu Jayagopi.

Corresponding email: vijay.jaisankar@iiitb.ac.in

About

Spectrogrand: Generating interesting audiovisuals for text prompts.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages