GitHub - sergiosolorzano/TalkomicApp-Unity: Whisper, Stable Diffusion on U-Net, Chatgpt AI models, bundled in a Unity project. Locally run models powered by Onnxruntime. These transcribe podcasts' audio to text and generate contextual images tied to the transcribed text.

Project

Watching a sound wave or unrelated images for a podcast that is published in a platform that supports video and images can be dull.

We propose asking a trio of AI models, bundled in a Unity project, to transcribe the audio to text and generate contexual images closely tied to the transcribed text.

We run two of the AI models locally, Whisper-Tiny and stable diffusion in U-Net architecture; we access a third, Chatgpt, remotely via an API.

In a Unity scene we loop the AI Models over each podcast audio section to generate the contextual images.

Talkomic-tecshift-image-gen-speed.30fps.mp4

Watch The Trailer🎬

Talkomic_trailer.30fps.mp4

Project Motivation

I am new to AI, keen to tinker and learn !💥 The prototype is a good starting point as proof of concept to test the ability of AI models to help audio media augment its reach.

Proof of Concept Results: Talkomic App Prototype - "A chat into a Images"

Special thanks to Jason Gauci co-host at Programming Throwdown podcast whose idea shared on the podcast served as inspiration for this prototype.

I am thrilled and truly grateful to Maurizio Raffone at Tech Shift F9 Podcast for trusting me to run a proof of concept of the Talkomic app prototype with the audio file of a fantastic episode in this podcast.

Watch The Complete Podcast with AI Images 📽️
View and download the Podcast's AI Image Gallery🎨
See the Podcasts' AI Images in Augmented Reality😎 with the Tapgaze app

Get Crisper Images with the ESRGAN AI Model

Finally, once the models have generated all images, we enhance these from 512×512 resolution to crisper 2048×2048 resolutions with the Real-ESRGAN AI Model. Suggested implementation steps in our blog.

512×512 2048x2048

Project's Blog Post

This is a prototype repo for proof of concept. Read the Talkomic app blog for the suggested steps to build the project in Unity:

Convert AI models to Onnx format using Olive the whisper-tiny text-transcription AI model
Processing chunked podcast audio for whisper
Chatgpt API request
Discussion on the implementation of the stable diffusion model in Unity
Get crisper images with Real-ESRGAN AI model
Links to technical articles and much more

Unity Project Features and Setup

This project has been updated to build for windows. See updates
The AI models in the Unity project of this repo are powered by Microsoft's cross-platform OnnxRuntime.
Native dlls (Onnxruntime, NAudio etc) required files: Project should include the following packages to Visual Studio (tested in VS2022 v.17.7.3) and dlls to Unity's Assets/Plugins directory.

Clone and save weights.pb weights file into Assets/StreamingAssets/Models/unet/ . Step also required for this repo's Release package (file too large). Fail model download availability, try here.
Podcast Audio Section List Required: Create in script TalkomicManager.cs at GenerateSummaryAndTimesAudioQueueAndDirectories() a list for each section in the podcast audio with the section_name and its start time in minutes:seconds.
- Unity will generate an output directory for each section, save the transcribed text and chatgpt image description for each section, and the images generated.
Podcast Audio Chunks: The Whisper model is designed to work on audio samples of up to 30s in duration. Hence we chunk the podcast audio for each section in chunks of max 30 seconds but load these as a queue in Whisper-tiny for each podcast section.
AI Generated Images: Shown in the scene along with the transcribed text and chatgpt image description.

N.B. - A black image is likely caused due to a not-safe filter triggered.
Scene Control Input Variables:
- Script: TalkomicManager.cs:
  - pathToAudioFile: full path to podcast audio file. Audio file is in sync with the list of section names and start times created in coroutine GenerateSummaryAndTimesAudioQueueAndDirectories() For example, in the case of the Tech Shift F9 E8 podcast, the sections were broken out by the host as shown below, along with the start times.
    
    You'd need to create these sections for your podcast file. Unity will chunk the audio for each section in max 30 sec wav files. Hence each section, with as many 30 second chunk audio files as it is required, will be patched together and transcribed for each section. Each transcribed section is then sent to chatgpt to generate a description of an image for the transcribed text section.
  - custom_chatgpt_pre_prompt: Text added at the front of the transcribed message sent to Chatgpt to guide chatgpt's response.
  - custom_diffuser_pre_prompt: Text added at the front of the stable diffusion prompt of the image to be created. It guides the stable diffusion result.
  - limitChatGPTResponseWordCount: Trims prompt to stable diffusion to 50 words to handle limit exception.
  - maxChatgptRequestedResponseWords: Max words requested for chatgpt to respond with.
  - numStableDiffusionImages: Number of images generated from a single chatgpt image description prompt.
  - steps: Number of stable diffusion denoising steps
  - ClassifierFreeGuidanceScaleValue: stable diffusion guidance scale
- ChatGPT Scriptable Object API Credentials and Request Arguments Data: Script ChatgptCreds.cs
  - Create the object and add it as property to RunChatgpt.cs component in Hierarchy object "RunChatGPT"
  - Enter credentials and request arguments

Update: Windows Build Functionality Complete

Key Additions:

Audio file is loaded from UI
Project is work in progress - audio sections still required to be manually entered before the build: For working demo, upload audio sample sampleaudio.wav which corresponds to audio section entered in TalkomicManager.cs:
Chatgpt credentials entered in UI at runtime
Windows runtime video demo:

Win-Demo-Talkomic.mp4

Prototype Software

Unity version: Unity 2021.3.26f1.

This prototype has been tested on the Unity Editor and Windows 11 build.

Tested on Windows 11 system, 64GB RAM, GPU NVIDIA GeForce RTX 3090 Ti 24GB, CPU 12th Gen Intel i9-12900K, 3400Mhz, 16 cores.

License

This project is licensed under the MIT License. See LICENSE.txt for more information.

Thank you

Special thanks to Jason Gauci co-host at Programming Throwdown podcast whose idea shared on the podcast served as inspiration for this prototype.

We also thank @sd-akashic @Haoming02 @Microsoft for helping to better understand onnxruntime implementation in Unity,

and ai-forever for the git repo for Real-ESRGAN,

and yasirkula's Simple File Browser.

If you find this helpful you can buy me a coffee :)

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
Assets		Assets
Packages		Packages
ProjectSettings		ProjectSettings
.gitattributes		.gitattributes
.gitignore		.gitignore
.vsconfig		.vsconfig
LICENSE.md		LICENSE.md
README.md		README.md
app.config		app.config
packages.config		packages.config
sampleaudio.wav		sampleaudio.wav

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project

Watch The Trailer🎬

Project Motivation

Proof of Concept Results: Talkomic App Prototype - "A chat into a Images"

Get Crisper Images with the ESRGAN AI Model

Project's Blog Post

Unity Project Features and Setup

Update: Windows Build Functionality Complete

Prototype Software

License

Thank you

About

Releases 2

Packages

Languages

License

sergiosolorzano/TalkomicApp-Unity

Folders and files

Latest commit

History

Repository files navigation

Project

Watch The Trailer🎬

Project Motivation

Proof of Concept Results: Talkomic App Prototype - "A chat into a Images"

Get Crisper Images with the ESRGAN AI Model

Project's Blog Post

Unity Project Features and Setup

Update: Windows Build Functionality Complete

Prototype Software

License

Thank you

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages