Skip to content

Single- and Multi-Speaker Cloned Voice Detection: From Perceptual to Learned Features

License

Notifications You must be signed in to change notification settings

gthmk/ClonedVoiceDetection

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Single- and Multi-Speaker Cloned Voice Detection: From Perceptual to Learned Features

License Python 3.8.0 Code style: black

This is the repository for the paper titled Single and Multi Speaker Cloned Voice Detection: From Perceptual to Learned Features submitted to the 2023 IEEE International Workshop on Information Forensics and Security (WIFS 2023).

The provided source code includes implementations of both the single-speaker and multi-speaker pipelines. However, please note that the dataset used in the experiments is not included in this repository. To replicate the experiments, you would need to create an analogous experimental dataset with cloned voices using different voice cloning architectures or providers.

The repository does provide code for data generation and adversarial laundering, specifically tailored for an example provider called ElevenLabs. You would need to generate features from the analogous dataset and save them to disk. Additionally, you will need to modify the relevant data handling code to ensure compatibility with your new dataset in order to run the pipeline successfully.

Please refer to the repository and the paper for more detailed instructions on how to use the code and conduct the experiments.

Folder Structure

The repository is structured as follows:

Folder File Description
Experiment Pipeline
/src/ run_pipeline_ljspeech.py Runs the pipeline for single voice (LJSpeech) experiments
/src/ run_pipeline_multivoice.py Runs the pipeline for multivoice experiments
/src/packages/ ExperimentPipeline.py Class for running the experiment_pipeline and logging results
/src/packages/ ModelManager.py Class for managing the final classification models
Feature Generation
/src/packages/ AudioEmbeddingsManager.py Class for managing learned features generated using NVIDIA TitaNet
/src/packages/ SmileFeatureManager.py Class for managing spectral features generated using openSMILE
/src/packages/ SmileFeatureGenerator.py Class for generating spectral features and saving to disk for collections of audio files
/src/packages/ SmileFeatureSelector.py Class for selecting spectral features using sklearn.feature_selection
/src/packages/ CadenceModelManager.py Class for managing perceptual features generated using handcrafted technqiues
/src/packages/ CadenceUtils.py Utility functions used by CadenceModelManager for generating features
/src/packages/ BayesSearch.py A class that implements Bayesian Hyperparameter Optimization for perceptual model
/src/packages/ SavedFeatureLoader.py Helper function for loading during experiments the generated features saved to disk
Data Loaders
/src/packages/ LJDataLoader.py Class for loading and handling the LJSpeech data for experiments
/src/packages/ TIMITDataLoader.py Class for loading and handling the TIMIT data for multi-voice experiments
Data Generation
/src/packages/ BaseDeepFakeGenerator.py Base class used for processing data used for voice cloning
/src/packages/ ElevenLabsDeepFakeGenerator.py Class used to generate deepfakes using the ElevenLabs API
/src/packages/ AudioManager.py Class for resampling audio files and performing adversarial laundering
Misc
. README.md Provides an overview for the project
. conda_requirements.txt Dependencies for creating the conda environment
. pip_requirements.txt Dependencies installed with pip

Data

An overview of the real and synthetic datasets used in our single-speaker (top) and multi-speaker (bottom) evaluations. The 91,700 WaveFake samples correspond to 13,100 samples per each of seven different vocoder architectures, hence the larger number of clips and duration.

Single-speaker

Type Name Clips (#) Duration (sec)
Real LJSpeech 13,100 86,117
Synthetic WaveFake 91,700 603,081
Synthetic ElevenLabs 13,077 78,441
Synthetic Uberduck 13,094 83,322

Multi-speaker

Type Name Clips (#) Duration (sec)
Real TIMIT 4,620 14,192
Synthetic ElevenLabs 5,499 15,413

Publicly Available Data

  1. The LJ Speech 1.1 Dataset -- Data
  2. WaveFake: A Data Set to Facilitate Audio Deepfake Detection -- Paper, Data
  3. TIMIT Acoustic-Phonetic Continuous Speech Corpus -- Data

Commercial Voice Cloning Tools

  1. ElevenLabs (EL) -- https://beta.elevenlabs.io/
  2. UberDuck (UD) -- https://app.uberduck.ai/

Results

Single-speaker

Accuracies for a personalized, single-speaker classification of unlaundered audio (top) and audio subject to adversarial laundering in the form of additive noise and transcoding (bottom). Dataset corresponds to ElevenLabs (EL), UberDuck (UD), and WaveFake (WF); Model corresponds to a linear (L) or non-linear (NL) classifier, and for a single-classifier (real v. synthetic) or multi-classifier (real vs. specific synthethis architecture); accuracy (%) is reported for synthetic audio, real audio, and (for the single-classifiers) equal error rate (EER) is also reported.

Synthetic Accuracy (%) Real Accuracy (%) EER (%)
Dataset Model Learned Spectral Perceptual Learned Spectral Perceptual Learned Spectral Perceptual
Unlaundered
Binary
EL single (L) 100.0 99.2 78.2 100.0 99.9 72.5 0.0 0.5 24.9
single (NL) 100.0 99.9 82.2 100.0 100.0 80.4 0.0 0.1 18.6
UD single (L) 99.8 98.9 51.9 99.9 98.9 54.0 0.1 1.1 47.2
single (NL) 99.7 99.2 54.4 99.9 99.0 56.5 0.2 0.9 44.5
WF single (L) 96.5 78.4 57.8 97.1 82.3 45.6 3.3 19.7 48.5
single (NL) 94.5 87.6 50.3 96.7 90.2 52.7 4.4 11.2 48.6
EL+UD single (L) 99.7 94.8 63.4 99.9 97.1 60.3 0.2 4.2 37.9
single (NL) 99.7 99.2 57.3 99.9 99.6 69.0 0.2 0.8 37.6
EL+UD+WF single (L) 93.2 79.7 58.4 98.7 93.0 57.6 3.6 15.9 42.1
single (NL) 91.2 90.6 53.1 99.0 94.1 64.7 4.1 7.9 41.6
Multiclass
EL+UD multi (L) 99.9 96.6 61.0 100.0 94.6 35.7 - - -
multi (NL) 99.7 98.3 65.6 100.0 97.2 43.2 - - -
EL+UD+WF multi (L) 98.8 80.2 45.1 97.3 64.3 22.9 - - -
multi (NL) 98.1 94.2 48.6 96.3 84.4 27.6 - - -
Laundered
Binary
EL single (L) 95.5 94.3 61.1 94.5 92.6 65.2 4.9 6.7 36.6
single (NL) 96.0 96.2 70.4 95.4 95.6 69.6 4.1 4.1 30.1
UD single (L) 95.4 81.1 61.4 91.8 84.3 44.7 6.3 17.3 46.7
single (NL) 95.4 86.8 52.9 93.3 86.1 55.9 5.5 13.6 45.6
WF single (L) 87.6 60.7 59.6 85.0 70.4 42.5 13.9 34.4 49.4
single (NL) 83.6 77.1 51.4 85.6 76.7 53.9 15.3 23.1 47.3
EL+UD single (L) 95.2 79.1 54.0 91.7 78.4 59.8 6.2 21.3 43.1
single (NL) 94.8 86.1 55.2 93.3 90.0 62.4 6.0 12.0 41.4
EL+UD+WF single (L) 83.7 70.9 50.6 88.6 72.9 59.7 13.2 28.2 44.8
single (NL) 83.4 79.2 53.0 90.7 85.1 60.7 12.5 17.9 43.6
Multiclass
EL+UD multi (L) 94.2 85.6 50.9 91.0 77.1 29.1 - - -
multi (NL) 94.5 91.7 53.2 90.3 82.9 41.3 - - -
EL+UD+WF multi (L) 89.8 65.4 35.3 83.1 44.3 26.2 - - -
multi (NL) 88.8 78.8 39.8 82.1 63.0 28.6 - - -

Multi-speaker

Accuracies for a non-personalized, multi-speaker classification of unlaundered audio. Dataset corresponds to ElevenLabs (EL); Model corresponds to a linear (L) or non-linear (NL) classifier, and for a single-classifier (real v. synthetic) or multi-classifier (real vs. specific synthethis architecture); accuracy (%) is reported for synthetic audio, real audio, and (for the single-classifiers) equal error rate (EER) is also reported.

Synthetic Accuracy (%) Real Accuracy (%) EER (%)
Dataset Model Learned Spectral Perceptual Learned Spectral Perceptual Learned Spectral Perceptual
EL single (L) 100.0 94.2 83.8 99.9 98.3 86.9 0.0 3.0 1.3
single (NL) 92.3 96.3 82.2 100.0 99.7 87.7 0.1 1.6 1.4

Research Group

School of Information1 and Electrical Engineering and Computer Sciences1,2 at the University of California, Berkeley

This work was partially funded by a grant from the UC Berkeley Center For Long-Term Cybersecurity (CLTC), an award for open-source innovation from the Digital Public Goods Alliance and United Nations Development Program, and an unrestricted gift from Meta.

Citation

Please cite the following paper if you use this code:

@misc{barrington2023single,
      title={Single and Multi-Speaker Cloned Voice Detection: From Perceptual to Learned Features}, 
      author={Sarah Barrington and Romit Barua and Gautham Koorma and Hany Farid},
      year={2023},
      eprint={2307.07683},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}

About

Single- and Multi-Speaker Cloned Voice Detection: From Perceptual to Learned Features

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%