-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
5. Improve the EnCodec speech quality #10
Comments
Hi, I am new to Vector quantization-based TTS and VC, and I am planning to use your code for |
Hi, your question arrived just in time :). I just finished training a 3kbps S->A model and I'll push it today. The downside is that the training is a bit more than 2x slower and it seems we need 2x more epochs to get good results. I trained it for 15 epochs (1.5x more than the 1.5kbps model) and it took me 15 hours on a single A100. So in the end it has a lot better audio quality but is not yet as good as the 1.5kbps model in terms of speech legibility (it sometimes mumbles). We could train on 6kbps directly if we reduced the window size to 15s. But before that I would love to check if we can fine-tune the EnCodec bottleneck only on speech since SPEAR TTS samples showed great speech quality in merely 1.5kbps. Could you share more about your "prompt based VS" idea? Sounds interesting. |
Idea is "Prompt based Any to Any Voice Conversion" as it's Voice to Voice problem we don't required Another interesting use case of this model on low resource languages as Vall-e or Spear-TTS required lots of data to get perfect results but if some language have low amount of text and speaker's data then we can use TTS + Voice conversion to achieve voice cloning. We can use traditional TTS methods to clone any person's voice in target language and then pass that generated audio to VC module to paste target speaker voice over that, we can use Google TTS service as it that supports many languages to get text to speech and then pass that speech to VC module to paste target speaker's expression and voice. |
@jpc Suno just open sourced Spear-TTS like model : |
@rishikksh20 Thanks for the explanation. Yeah, that sounds great and we definitely support that with our S->A model. One problem I see with this is I think semantic tokens have to be trained with speech from the target language, otherwise it may lack the important phoneme representations. I have to test that with my current implementation. Thanks for the suno-ai link. Cool model, and they are a bit ahead of us (and have a better dataset). OTOH we have not scaled our approach yet (I trained 2 layer enc-dec models (4 total) vs. their 12 layer decoder-only models) and we released the full training and data preparation code and not just the final inference model. I love that they verified an idea I also considered – to predict the 6kbps EnCodec tokens from the low-quality token stream (0.75kbps in their case). |
@jpc Yes sematic tokenizer needed to train a multi-lingual way otherwise it will not have good pronunciation. Also for Speech Codec : https://github.com/yangdongchao/AcademiCodec |
I trained 3 encoder-only transformers for quality enchancement. (82902a6) They work perfectly on the training data, I can also fine-tune them on high-quality recordings from LibriTTS and they are even better, but they completely fail to transfer to the domain of tokens generated by the autoregressive S2A model. I think if you read the recent Google and Microsoft papers carefully they are noticing this train/inference mismatch as well. I plan to look into the SoundStorm architecture since it quite similar in principle to how I designed my A2A enchancement models but since it would be single stage S2A it should solve the issues I am encountering. |
Hi @jpc , I have implemented Soundstorm : https://github.com/rishikksh20/SoundStorm-pytorch . |
@rishikksh20 I would love to collaborate on this part. I have semantic and acoustic tokens extracted for a large part of LibriLight (and whole LibriTTS) that we could use for training. I also have 50 semantic tokens per second and 75 acoustic timesteps. Previously I used 150 acoustic tokens (2 quantization levels serialized) so I just repeated each semantic token 3 times. This is going to be a little bit more tricky. We can probably get away with inserting a dummy padding token between every 2 semantic tokens to align them. |
Yeah it's sound bit complicated to me, I think 50 semantic tokens per second would be fine but for acoustic tokens we can use What's your thought on that ? |
@jpc I have completed my implementation of SoundStorm. Now we can directly drop in SoundStorm in this repo but only problem I think remain is sampling method. |
@rishikksh20 Hey, I was thinking about how we could collaborate on this. Two weeks ago I did try to train SoundStorm but had trouble with the @lucidrains implementation and yours wasn't finished yet. I have semantic and acoustic tokens (EnCodec, 8 quantizers) for 1300 hours of a single speaker extracted from LibriLight. Do you think it would make sense for you to try and train your SoundStream implementation on that? I also have an 8k hours multi-speaker subset and I am working on processing the rest of the LibriLight dataset. |
Regarding the sampling – I ran some tests with a simple auto-regressive model and this 3/2 padding method (insert one padding token in between two source tokens) worked quite well:
|
Hi @jpc , Hope you are doing well! I am completed the implementation of SoundStorm, I have created a dataloader as per 50 toks/s to 100 toks/s . I can also adopt your suggestion for 50 toks/s to 75 toks/s but I needed to test first weather my code is working or not. |
I was thinking about uploading the dataset to Huggingface tomorrow. Would that work for you? |
yeah sure |
@jpc have you uploaded the data ? |
@rishikksh20 Hey, I am working on it right now. I got a bit sick last week and also underestimated the time it takes to clean it, test it and compress/upload the whole thing it. I ended up having almost 1TB of uncompressed EnCodec tokens (6kbps). I should finish it today. Do you have a Discord account or some other app where we could sync? |
My Discord username is |
Hey, this was solved by combining several approaches: using the Vocos vocoder, training on data with 4 Encodec quantizers and implementing MusicGen-like time-shifting to cut the sequence length. Overall this made the voice quality is very nice. |
Right now the EnCodec speech quality at 1.5kbps is pretty terrible (far from what Google shows for their SoundStream-based codec). I am pretty sure the problem is caused by EnCodec being a universal sound codec because the official samples for SoundStream at 1.5kbps sound quite similar (Lyra-v2 sound even worse than that). That's why I suspect SPEAR TTS is based on an unreleased speech-only codec.
Since EnCodec has multi-rate capability so the overall model knows how to represent high-quality speech. The pretty good results we had for compressing the Whisper embeddings suggest we might get away with retraining just the quantization layer to reprioritize the bandwidth allocation and improve speech quality (at the cost of ignoring music and other audio).
The text was updated successfully, but these errors were encountered: