Watching a sound wave or unrelated images for a podcast that is published in a platform that supports video and images can be dull.
We propose asking a trio of AI models, bundled in a Unity project, to transcribe the audio to text and generate contexual images closely tied to the transcribed text.
We run two of the AI models locally, Whisper-Tiny and stable diffusion in U-Net architecture; we access a third, Chatgpt, remotely via an API.
In a Unity scene we loop the AI Models over each podcast audio section to generate the contextual images.
Talkomic-tecshift-image-gen-speed.30fps.mp4
Watch The Trailer🎬
Talkomic_trailer.30fps.mp4
I am new to AI, keen to tinker and learn !💥 The prototype is a good starting point as proof of concept to test the ability of AI models to help audio media augment its reach.
Special thanks to Jason Gauci co-host at Programming Throwdown podcast whose idea shared on the podcast served as inspiration for this prototype.
I am thrilled and truly grateful to Maurizio Raffone at Tech Shift F9 Podcast for trusting me to run a proof of concept of the Talkomic app prototype with the audio file of a fantastic episode in this podcast.
- Watch The Complete Podcast with AI Images 📽️
- View and download the Podcast's AI Image Gallery🎨
- See the Podcasts' AI Images in Augmented Reality😎 with the Tapgaze app
Finally, once the models have generated all images, we enhance these from 512×512 resolution to crisper 2048×2048 resolutions with the Real-ESRGAN AI Model. Suggested implementation steps in our blog.
This is a prototype repo for proof of concept. Read the Talkomic app blog for the suggested steps to build the project in Unity:
- Convert AI models to Onnx format using Olive the whisper-tiny text-transcription AI model
- Processing chunked podcast audio for whisper
- Chatgpt API request
- Discussion on the implementation of the stable diffusion model in Unity
- Get crisper images with Real-ESRGAN AI model
- Links to technical articles and much more
-
This project has been updated to build for windows. See updates
-
The AI models in the Unity project of this repo are powered by Microsoft's cross-platform OnnxRuntime.
-
Native dlls (Onnxruntime, NAudio etc) required files: Project should include the following packages to Visual Studio (tested in VS2022 v.17.7.3) and dlls to Unity's Assets/Plugins directory.
-
Clone and save weights.pb weights file into Assets/StreamingAssets/Models/unet/ . Step also required for this repo's Release package (file too large). Fail model download availability, try here.
-
Podcast Audio Section List Required: Create in script TalkomicManager.cs at GenerateSummaryAndTimesAudioQueueAndDirectories() a list for each section in the podcast audio with the section_name and its start time in minutes:seconds.
-
Podcast Audio Chunks: The Whisper model is designed to work on audio samples of up to 30s in duration. Hence we chunk the podcast audio for each section in chunks of max 30 seconds but load these as a queue in Whisper-tiny for each podcast section.
-
AI Generated Images: Shown in the scene along with the transcribed text and chatgpt image description.
N.B. - A black image is likely caused due to a not-safe filter triggered.
-
Scene Control Input Variables:
-
Script: TalkomicManager.cs:
-
pathToAudioFile: full path to podcast audio file. Audio file is in sync with the list of section names and start times created in coroutine GenerateSummaryAndTimesAudioQueueAndDirectories() For example, in the case of the Tech Shift F9 E8 podcast, the sections were broken out by the host as shown below, along with the start times.
You'd need to create these sections for your podcast file. Unity will chunk the audio for each section in max 30 sec wav files. Hence each section, with as many 30 second chunk audio files as it is required, will be patched together and transcribed for each section. Each transcribed section is then sent to chatgpt to generate a description of an image for the transcribed text section.
-
custom_chatgpt_pre_prompt: Text added at the front of the transcribed message sent to Chatgpt to guide chatgpt's response.
-
custom_diffuser_pre_prompt: Text added at the front of the stable diffusion prompt of the image to be created. It guides the stable diffusion result.
-
limitChatGPTResponseWordCount: Trims prompt to stable diffusion to 50 words to handle limit exception.
-
maxChatgptRequestedResponseWords: Max words requested for chatgpt to respond with.
-
numStableDiffusionImages: Number of images generated from a single chatgpt image description prompt.
-
steps: Number of stable diffusion denoising steps
-
ClassifierFreeGuidanceScaleValue: stable diffusion guidance scale
-
-
ChatGPT Scriptable Object API Credentials and Request Arguments Data: Script ChatgptCreds.cs
-
Key Additions:
-
Audio file is loaded from UI
-
Project is work in progress - audio sections still required to be manually entered before the build: For working demo, upload audio sample sampleaudio.wav which corresponds to audio section entered in TalkomicManager.cs:
-
Chatgpt credentials entered in UI at runtime
-
Windows runtime video demo:
Win-Demo-Talkomic.mp4
Unity version: Unity 2021.3.26f1.
This prototype has been tested on the Unity Editor and Windows 11 build.
Tested on Windows 11 system, 64GB RAM, GPU NVIDIA GeForce RTX 3090 Ti 24GB, CPU 12th Gen Intel i9-12900K, 3400Mhz, 16 cores.
This project is licensed under the MIT License. See LICENSE.txt for more information.
Special thanks to Jason Gauci co-host at Programming Throwdown podcast whose idea shared on the podcast served as inspiration for this prototype.
We also thank @sd-akashic @Haoming02 @Microsoft for helping to better understand onnxruntime implementation in Unity,
and ai-forever for the git repo for Real-ESRGAN,
and yasirkula's Simple File Browser.
If you find this helpful you can buy me a coffee :)