Voice Cloning and Realistic Video Generation

Introduction

Welcome to My Voice Cloning and Realistic Video Generation project. This project was developed to create near-real-time digital twins using advanced voice cloning and video generation techniques. The solution combines cutting-edge AI technologies to produce lifelike digital replicas of individuals, complete with their voices, expressions, and speech.

Problem Statement

The challenge is to develop AI models with the following capabilities:

Advanced Neural Architectures: I leveraged state-of-the-art deep learning techniques, including recurrent neural networks (RNNs), convolutional neural networks (CNNs), and generative adversarial networks (GANs), for voice cloning and realistic video generation.
Expressiveness: My goal was to create models that could accurately convey a wide range of emotions, accents, and speaking styles. This enables expressive voice cloning and natural video generation from 2D images.
Naturalness: I focused on making the generated voice clones sound completely natural and human-like. Additionally, I paid close attention to achieving precise lip-sync and realistic video corresponding to the cloned audio.
Real-Time Nature: I built an ensemble of voice cloning and video generation models designed to operate in near real-time. This makes my solution suitable for various conversational AI applications.

Solution Architecture

My approach to solving this challenge involved two distinct components:

Voice Cloning and Text-to-Speech (TTS)

I utilized the Tortoise-TTS repository to implement both voice cloning and text-to-speech capabilities. This component allows users to upload audio samples for voice cloning and specify text prompts for generating cloned voices.

Realistic Video Generation

For the generation of lifelike videos with precise lip-sync, I integrated the SadTalker repository. This component takes an input image, an audio file from the voice cloning step, and produces a video with seamless lip-sync.

To handle processing requirements efficiently and prevent crashes, I employed separate Google Colab instances for each component. Additionally, I configured ngrok with Flask to create user-friendly URLs for easy integration with a Streamlit application.

Getting Started

Prerequisites

Before you begin, ensure you have met the following requirements:

Requirement	Version
Python	>= 3.6
TensorFlow	>= 2.0
PyTorch	>= 1.0

Installation

To install the required dependencies, follow these steps:

Clone the repository:

git clone https://github.com/bruno-noir/voice-cloning-video-generation.git

Install the necessary packages:
```
pip install -r requirements.txt
```

Usage

To run the entire system, follow these steps:

Upload the TorTTS_API.ipynb notebook to one Colab instance and the Vid_API.ipynb notebook to another Colab instance.
Configure ngrok APIs in both instances.
Enter the ngrok URLs generated in step 2 into the app.py file.
Launch the Streamlit application using the command: streamlit run app.py.
In the Streamlit application, users can perform the following actions:
- Upload a sample audio file (.wav) with a duration of 10 to 15 seconds for voice cloning.
- Specify a text prompt for generating speech.
- Upload an image (.png) of the person whose voice is to be cloned.
- Witness the magic as the system crafts a video with seamless lip-sync using the provided audio and image.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
examples		examples
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
TorTTS_API.ipynb		TorTTS_API.ipynb
Vid_API.ipynb		Vid_API.ipynb
app.py		app.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Voice Cloning and Realistic Video Generation

Table of Contents

Introduction

Problem Statement

Solution Architecture

Voice Cloning and Text-to-Speech (TTS)

Realistic Video Generation

Getting Started

Prerequisites

Installation

Usage

License

About

Releases

Packages

Languages

License

bruno-noir/voice-cloning-video-generation

Folders and files

Latest commit

History

Repository files navigation

Voice Cloning and Realistic Video Generation

Table of Contents

Introduction

Problem Statement

Solution Architecture

Voice Cloning and Text-to-Speech (TTS)

Realistic Video Generation

Getting Started

Prerequisites

Installation

Usage

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages