Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to Make Finetuning Dataset #69

Closed
fakerybakery opened this issue Nov 23, 2023 · 8 comments
Closed

How to Make Finetuning Dataset #69

fakerybakery opened this issue Nov 23, 2023 · 8 comments

Comments

@fakerybakery
Copy link
Contributor

Hi, for the finetuning dataset, should we use Whisper -> Phonemizer to make it from a list of audio files?

@Kreevoz
Copy link

Kreevoz commented Nov 23, 2023

The one thing that would be of a real utility value is to give users the option to provide non-phonemized audio.wav|transcript in normal english list files, and then handle the phonemization (and maybe caching) for them so it matches the requirements of StyleTTS2 exactly.

Splitting out the phonemization into its own utility function and not repeating it verbatim in the first 4 lines of every inference function definition also would make sense. Then it could be called from anywhere including at auto-dataset generation.

You can't blindly trust whisper's transcripts though. I've run a bunch of larger datasets over it and it makes enough insanely stupid mistakes at times (even with the large model) that you will negatively impact your training results if you don't fix it up by hand and with careful listening.
You definitely want to double-check the punctuation it generates also and terminate sentences properly.

@yl4579
Copy link
Owner

yl4579 commented Nov 23, 2023

@Kreevoz One can use https://github.com/jaywalnut310/vits/blob/main/preprocess.py to generate the phonemes.

@devidw
Copy link
Contributor

devidw commented Nov 23, 2023

working on a pipeline to easily allow building a compatible dataset, https://github.com/devidw/dswav

its a gradio ui that allows you to transcribe an input audio and have it be split into samples based on detected sentences and also builds required files for training

as @Kreevoz noted, whisper is a source of potential issues, if not carefully checking

also splitting at sentences seems not ideal, since sometimes there will be artifacts at the end in the chucked audio samples, doing some sort of splitting based on silence would prob be the better approach

@fakerybakery
Copy link
Contributor Author

Thanks for sharing your tool! Would you mind adding a license to it?

@devidw
Copy link
Contributor

devidw commented Nov 23, 2023

sure, added @fakerybakery

@devidw
Copy link
Contributor

devidw commented Nov 25, 2023

One can use jaywalnut310/vits@main/preprocess.py to generate the phonemes.

Hey @yl4579, thx sharing this. I'm trying to replicate this in order to build custom fine-tune datasets, however, when I use the shared script the output looks different from the training data shared in this repo.

For example, for LJ015-0030, in Data/train_list.txt its:

ðə bˈæŋk hɐdbɪŋ kəndˈʌktᵻd ˌɔn fˈɔls pɹˈɪnsɪpəlz ;

While when I look up the source text

The bank had been conducted on false principles;

And pipe it into the vits scripts, I get this:

- ðə bˈæŋk hɐdbɪŋ kəndˈʌktᵻd ˌɔn fˈɔls pɹˈɪnsɪpəlz ;
+ ðə bˈæŋk hɐdbɪn kəndˈʌktᵻd ˌɑːn fˈɑːls pɹˈɪnsɪpəlz;

Using default arguments, in this case english_cleaners2, english_cleaners produces:

- ðə bˈæŋk hɐdbɪŋ kəndˈʌktᵻd ˌɔn fˈɔls pɹˈɪnsɪpəlz ;
+ ðə bæŋk hɐdbɪn kəndʌktᵻd ɑːn fɑːls pɹɪnsɪpəlz

Any idea why the output might look different to the one in the repo?

I guess it's quite important that we exactly match the formatting like you used, when we fine-tune on the shared checkpoints, right?

@Kreevoz
Copy link

Kreevoz commented Nov 25, 2023

I can chime in on that. You need to modify the english_cleaners functions if you want to use them as they are. Remove the phonemization step from them and instead split that out and re-use the code from StyleTTS2.

StyleTTS2 does next to no text cleaning before the input texts are sent to the phonemizer. The phonemizer strips out most of the junk by itself (that's generally not what you want because you have little control over that).

This is what you'd need to phonemize correctly, taken from StyleTTS2:

from nltk.tokenize import word_tokenize

import phonemizer
global_phonemizer = phonemizer.backend.EspeakBackend(language='en-us', preserve_punctuation=True,  with_stress=True)

def text_to_phonemes(text):
  text = text.strip()
  ps = global_phonemizer.phonemize([text])
  ps = word_tokenize(ps[0])
  ps = ' '.join(ps)
  return ps

You then run the english_cleaners function on your input text first, followed by the text_to_phonemes. The result will be identical phonemization to the texts you see in the demo files that ship with StyleTTS2 (though you may find different punctuation if you keep the example cleaners from the other repo. If you want a 100% match, send the text directly to the text_to_phonemes step).

As for having to match the exact same formatting, that is not necessary. As long as you phonemize it in the same way you can preprocess the text differently. I've run a couple of finetunes over the past day and replaced the text cleaning with my own more aggressive implementation which adjusts punctuation to a format that the phonemizer likes and preserves.

It only matters that you have the exact same text cleaners at inference time too. The finetuned model will adapt to the new style of punctuation and formatting. That way it will also shed the habit of ignoring punctuation that the current pretrained models exhibit. That's because the datasets they were trained on don't have pauses in the audio when there is punctuation to indicate there should be. That behavior can be finetuned out of the model again to increase controllability.

@devidw
Copy link
Contributor

devidw commented Nov 25, 2023

Thanks a ton @Kreevoz 🙌

I just run the input through the StyleTTS2 phonemes function, and it looks close.

https://gist.github.com/devidw/1bb5cd4d9d524218db22d6b0b10b6712

There is a minor difference in 2 phonemes tho I think, no idea where that might be coming from. Not sure if this could be something due to different version/os builds of espeak

Tested on

speak text-to-speech: 1.48.03 04.Mar.14 Data at: /opt/homebrew/Cellar/espeak/1.48.04_1/share/espeak-data

and

eSpeak text-to-speech: 1.48.03 04.Mar.14 Data at: /usr/lib/x86_64-linux-gnu/espeak-data

However, if the inference code is producing the same as in the custom dataset with the same function this should not be an issue I guess.

Repository owner locked and limited conversation to collaborators Nov 27, 2023
@yl4579 yl4579 converted this issue into discussion #92 Nov 27, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants