DiffEditor: Enhancing Speech Editing with Semantic Enrichment and Acoustic Consistency

Abstract

As text-based speech editing becomes increasingly prevalent, the demand for unrestricted free-text editing continues to grow. However, existing speech editing techniques encounter significant challenges, particularly in maintaining intelligibility and acoustic consistency when dealing with out-of-domain (OOD) text. In this paper, we introduce DiffEditor, a novel speech editing model designed to enhance performance in OOD text scenarios through semantic enrichment and acoustic consistency. To improve the intelligibility of the edited speech, we enrich the semantic information of phoneme embeddings by integrating word embeddings extracted from a pretrained language model. Furthermore, we emphasize that inter-frame smoothing properties are critical for modeling acoustic consistency, and thus we propose a first-order loss function to promote smoother transitions at editing boundaries and enhance the overall fluency of the edited speech. Experimental results demonstrate that our model achieves state-of-the-art performance in both in-domain and OOD text scenarios.

Workflow

Implementation

This repo contains official PyTorch implementations of:

DiffEditor: Enhancing Speech Editing with Semantic Enrichment and Acoustic Consistency Demo page | Code

This repo contains unofficial PyTorch implementations of:

Supported Datasets

Our framework supports the following datasets:

VCTK

Downloading VCTK

You can download the VCTK dataset from the official website. Follow these steps:

Visit the VCTK dataset download page.
Download the dataset (VCTK-Corpus.tar.gz).

Extract the downloaded file to your desired directory. For example:

mkdir data/raw
tar -xzf VCTK-Corpus.tar.gz -C ../data/raw

Install Dependencies

Please install the latest numpy, torch and tensorboard first. Then run the following commands:

export PYTHONPATH=.
# install requirements.
pip install -U pip
pip install -r requirements.txt
sudo apt install -y sox libsox-fmt-mp3

Finally, install Montreal Forced Aligner following the link below:

https://montreal-forced-aligner.readthedocs.io/en/latest/

Download the pre-trained vocoder

mkdir pretrained
mkdir pretrained/hifigan_hifitts

download model_ckpt_steps_2168000.ckpt, config.yaml, from https://drive.google.com/drive/folders/1n_0tROauyiAYGUDbmoQ__eqyT_G4RvjN?usp=sharing to pretrained/hifigan_hifitts

Download the pre-trained bert

mkdir cache

download bert-base-multilingual-cased directory, from https://huggingface.co/google-bert/bert-base-multilingual-cased to cache/bert-base-multilingual-cased

Data Preprocess

# The default dataset is ``vctk``.
python data_gen/tts/base_preprocess.py --config egs/DiffEditor.yaml
bash data_gen/tts/run_mfa_train_align.sh  vctk  data_gen/tts/mfa_train_config_vctk.yaml
python data_gen/tts/base_binarizer.py --config egs/DiffEditor.yaml

Train

# Example run for DiffEditor.
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --dir /path/to/your/DiffEditor --config egs/DiffEditor.yaml --exp_name DiffEditor --reset

Inference

We provide the data structure of inference in inference/example.csv. text and edited_text refer to the original text and target text. region refers to the word idx range (start from 1 ) that you want to edit. edited_region refers to the word idx range of the edited_text.

id	item_name	text	edited_text	wav_fn_orig	edited_region	region
0	1	"I'd love to be at the world cup."	"I'd absolutely love to be at the world cup."	inference/audio_example/1.wav	[1,3]	[1,2]

# run with example_en.csv
bash run_test.sh  DiffEditor   inference/example_en.csv  en  inference/raw_wav   1

the result is named by DiffEditor_auto in inferecne directory

Evaluation

Example Objective Evaluation for DiffEditor. You can use the following objective evaluation metrics: MCD, STOI, PESQ

python eval/get_metrics.py

website

You can run the following command to build a website for editing

python app.py

License and Agreement

Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.

Tips

If you find the mfa_dict.txt, mfa_model.zip, phone_set.json, or word_set.json are missing in inference, you need to run the preprocess script in our repo to get them. You can also download all of these files you need for inferencing the pre-trained model from https://drive.google.com/drive/folders/1BOFQ0j2j6nsPqfUlG8ot9I-xvNGmwgPK?usp=sharing and put them in data/processed/vctk.
Please specify the MFA version as 2.0.0rc3.
in order to reproduce the paper result, you need to use the testset shown in data/vctk_test_split.csv

Author

E-mail：2120230617@mail.nankai.edu.cn

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DiffEditor: Enhancing Speech Editing with Semantic Enrichment and Acoustic Consistency

Abstract

Workflow

Implementation

Supported Datasets

Downloading VCTK

Install Dependencies

Download the pre-trained vocoder

Download the pre-trained bert

Data Preprocess

Train

Inference

Evaluation

website

License and Agreement

Tips

Author

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
css		css
data		data
data_gen/tts		data_gen/tts
egs		egs
eval		eval
inference		inference
js		js
modules		modules
pretrained/hifigan_hifitts		pretrained/hifigan_hifitts
resources		resources
tasks		tasks
templates		templates
utils		utils
.gitignore		.gitignore
README.md		README.md
app.py		app.py
index.html		index.html
requirements.txt		requirements.txt
run_test.sh		run_test.sh

NKU-HLT/DiffEditor

Folders and files

Latest commit

History

Repository files navigation

DiffEditor: Enhancing Speech Editing with Semantic Enrichment and Acoustic Consistency

Abstract

Workflow

Implementation

Supported Datasets

Downloading VCTK

Install Dependencies

Download the pre-trained vocoder

Download the pre-trained bert

Data Preprocess

Train

Inference

Evaluation

website

License and Agreement

Tips

Author

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages