-
Notifications
You must be signed in to change notification settings - Fork 462
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ability to Read Longer Audio (ie Audiobooks) #54
Comments
I think so far it’s not very good at narrating the entire audiobook because the training data isn’t the entire audiobook. The training data is purely independent clip taken from amateur audiobooks readings, rather than an entire audiobook. It won’t be like ElevenLabs that are trained with professional audiobook datasets as these data are usually not public domains. However, if we do have the data, it can be easily changed to train on this sort of data, by conditioning on the previous style to sample the current style. This probably would reproduce the effect of ElevenLabs, especially for dialogues. The closest dataset to entire audiobook is LJSpeech, but again it’s completely non-fiction, so it won’t be good for any fiction reading (no dialogue), and it might produce unnatural intonation’s because each clip was treated independently during training. |
Hmm. Thanks. LibriVox seems like a good place to get public domain audiobooks. Are there any plans to add this capability in the future? |
LibriTTS is already taken from LibriVox, but for some reason they aren’t complete audiobook narration but very fragmentized clips taken from complete audiobook narrations. I don’t know why they remove a lot of clips. |
I feel like the quality would be lower if you trained it on an entire audiobook, right? I don't know, I guess it just feels like the longer the samples are the worse it will be (I might be wrong). Maybe we can use Tortoise TTS's splitting script with this? However, if it's possible to train a TTS model on long text without degrading quality, it shouldn't be too hard to write a script to scrape LibriVox based on readers (they have an API). I was able to make this dataset a while back using their API, but I didn't include readers at that time. |
No we do have to train on audio clips, but the idea is we condition the current style sampling on previous text and style, so it will be more continuous and possibly also makes it handle dialogue better (if the audio clips are split according to dialogues). It won’t work if we train on entire audio clips because we don’t have enough RAM. |
Hmm interesting! Are you planning to implement something like this in the future? |
Yeah probably, but I don't think it'll be that simple. If the effort is more than trivial concatenation it could be a different project or paper, but now the difference probably won't be big enough on LibriTTS dataset because there is no dialogue. It's more useful if we can get some fictional audiobook datasets that are separated by characters. |
Hmm. Hypothetically, if there was a long audiobook dataset available, how difficult do you think it would be to implement? |
I implemented a basic long-text reader on the online demo by splitting text, but it isn't perfect yet. (update: I removed it because someone said it made it harder to clone with Docker) |
I am fine with removing the long-text option, because I think that it should be a default setting in every task. |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Hi,
Might it be possible to implement a
tqdm
progress bar for longer text? This would make it possible to easily narrate entire audiobooks!Thanks!
The text was updated successfully, but these errors were encountered: