Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alltalkbeta #288

Merged
merged 9 commits into from
Oct 20, 2024
Merged

Alltalkbeta #288

merged 9 commits into from
Oct 20, 2024

Conversation

IIEleven11
Copy link

You'll see two scripts. compare_and_merge.py and expand_xtts.py.

I didn't do any integration with alltalk so these scripts are capable of running as is, standalone.

steps to use

  1. Run start_finetune and check the "bpe_tokenizer" box to train a new tokenizer during transcription
  2. Begin transcription
  3. When transcription is complete you will have a bpe_tokenizer-vocab.json
  4. Open compare_and_merge.py and fill in the file paths for the base model files and the new vocab.
  5. Run compare_and_merge.py
  6. You now have an expanded_vocab.json.
  7. Open expand_xtts.py and fill in the file paths
  8. Run expand_xtts.py

You now have an expanded base xttsv2 "expanded_model.pth" and its pair "expanded_vocab.json"
The base xttsv2 model needs to be removed from the file path "/alltalk_tts/models/xtts/xttsv2_2.0.3/model.pth"
The base "vocab.json" needs to be removed from the file path "/alltalk_tts/models/xtts/xttsv2_2.0.3/vocab.json"
Place "expanded_model.pth" and "expanded_vocab.json" in the place of the removed base model/vocab at path "/alltalk_tts/models/xtts/xttsv2_2.0.3/". Rename them to "model.pth" and "vocab.json".

That's it you can now begin fine tuning.

You'll find each file commented with more detail about what's going on. Finetune.py had an edit i was using to rotate the port because when using an online instance, when I have to end the script the port can linger blocked. Which causes the script to fail and I have to go in and change the port. So just setting a range from port # - port # fixes that issue. But I removed it as it's beyond the scope of this specific PR. I can send it in another if that's something you want to implement.

@IIEleven11
Copy link
Author

Ignore my finetune.py script changes. I reverted them.

So this solution worked with no slurred speech and no accent with the 2.0.2 model. I believe the accent with the 2.0.3 model was inherent with the base model and not specific to this solution.

You'll see a new custom_tokenizer.py. This script needs a txt file that's run through the extract_dataset_for_tokenizer.py script. This will remove the first and third columns from the csv's. Output will be your new custom datasets vocab.json. Use this with compare and merge script then expand_xtts script and begin training.

As far as the 2.0.3 model. It remains unknown and I fear will always remain that way as Coqui has exited the party. So it might be wise to revert the model from 2.0.3 as default to 2.0.2.

I had to do a lot of learning here with these so I am cautious and open to the possibility i missed something. Especially with the creation of the new tokenizer. So if anyone has anything to point out please do.

@erew123
Copy link
Owner

erew123 commented Aug 3, 2024

Hi @IIEleven11

Sorry its taken a while to respond, some days Im busy elsewhere and some days I wake up and there's 10+ messages to deal with before I get to even look at anything.

If Im interpreting what you've said correctly, it will work fine on the 2.0.2 model, but 2.0.3 goes a bit funny. The only differences I know of with the 2.0.3 model was that they introduced 2x new languages, which I think were Hungarian and Korean https://docs.coqui.ai/en/latest/models/xtts.html#updates-with-v2

But actually, they added 3x languages. Hindi was added too, but not documented anywhere apart from here https://huggingface.co/coqui/XTTS-v2#languages (that I ever found).

As there is no difference in the training setup that identifies differences between the models (that I know of) would you think that means there would be something different in the config.json or vocab.json that perhaps is the difference that maybe makes 2.0.3 funny to train?

Apologies for the questions Im just digging into the knowledge youve learned and wondering if I can think of anything that may help solve the puzzle.

That aside, thanks for all your work on this! I will test it soon. :)

@IIEleven11
Copy link
Author

Yeah so check coqui-ai/TTS#3309 (comment).
They do acknowledge there was some fall back. Specifically when adding new languages/speakers.

I am curious what would happen if we removed the other than English tokens from the vocab.json. they take up a very large amount of space. I would think it will allow for more English vocabulary and therefore a better English speaking model. Will incur many requests asking for multi lingual support though.

The configs and vocabs for each version of the model are different the 2.0.2 vocab has a smaller size and smaller embedding layer. So they aren't compatible for inference or trainining without adjusting the architecture of the model.

There's a couple of other fine tuning webuis that also default to 2.0.2. Daswers fine tuning webui for example.

But yeah more testing of course. I only used it with a single dataset. I think allowing the community to go at it would be a good solution for now as we can only really confirm with more testing. We are somewhat working blind with whatever information Coqui left behind.

@erew123
Copy link
Owner

erew123 commented Aug 4, 2024

I can tell you why we both used the 2.0.2 model at the time of creating the interfaces. The 2.0.3 model had something bad/wrong released in the models configuration (or something) that created very very bad audio. The solution back then was to use 2.0.2 and Coqui did resolve 2.0.3 eventually, however it was just easier to stick on 2.0.2 at the time, rather than re-code.

@IIEleven11
Copy link
Author

I can tell you why we both used the 2.0.2 model at the time of creating the interfaces. The 2.0.3 model had something bad/wrong released in the models configuration (or something) that created very very bad audio. The solution back then was to use 2.0.2 and Coqui did resolve 2.0.3 eventually, however it was just easier to stick on 2.0.2 at the time, rather than re-code.

Ahh I did see you comment back then, yeah. The accent within the voice could very well have been an error somewhere on my part. I don't want to remove that from the equation.

The 2.0.3 model has pros and cons. I think it has a greater ability to meet a wider range of people's needs than 2.0.2 because it does have a slightly bigger vocab. But this means it's potential is possibly lesser than 2.0.2.

The big reason I'm hesitant to provide what I did to remove all but English tokens in the vocab.json is because I am not confident that I completely understood all the changes I made. While it did most certainly work, some of it I just said "that looks right" and moved on. Training models is really complex and I just want to make sure I'm not providing code that will give someone a harder time due to my ignorance.

@erew123
Copy link
Owner

erew123 commented Aug 10, 2024

Hi @IIEleven11 Hope you are keeping well. Apologies for not catching up with you, Its been a busy week for me with quite a few requests/issues with lots of things.

Thanks for the updates above, do you think its now time for me to merge/test this out?

Thanks

@IIEleven11
Copy link
Author

Hi @IIEleven11 Hope you are keeping well. Apologies for not catching up with you, Its been a busy week for me with quite a few requests/issues with lots of things.

Thanks for the updates above, do you think its now time for me to merge/test this out?

Thanks

Yeah I would really love It if another developer would really look into it with me. I've been trying to essentially reverse engineer coqui's code and would love another mind to collaborate with.

I have tested it a few more times since then. Adding vocabulary works as expected.

One thing though. I am trying to add a new special token which is proving to be a bit more nuanced.

I would guess most users don't try and do this though so it shouldn't be a problem for now.

@IIEleven11
Copy link
Author

I also saw you were deep into the conversation at one point in some really old commits. Do you know anything about the loss of ability to prompt engineer the model between tortoise and xtts?

Things like "[joy] it's nice to meet you!" Would generate an emotional joyous sentence. Tortoise can do it. Xttsv2 paid API could do it. But now we can't do it.

This is what I've been trying to solve. It would appear they removed this functionality from the open source versions. And because the tortoise and xtts models are nearly identical I believe we could put the pieces together to get it back.

@erew123
Copy link
Owner

erew123 commented Aug 11, 2024

Hi @IIEleven11 Spent my morning cleaning up after spilling coffee all over my desk, computer, keyboard, wall, floor etc.... :/ so lost a few hours of my day where I was hoping to respond properly, look into a few things etc. How annoying!

Anyway, first off, I found this conversation earlier coqui-ai/TTS#3704 I wonder if that may be of interest??

As for emotions, I didn't know they HAD implemented them at some point in the past but it must have been on the roadmap according to this coqui-ai/TTS#3255 and I can see it on the roadmap coqui-ai/TTS#378 as Implement emotion and style adaptation. in the as yet uncompleted "milestones along the way".

To add to all this, eginhard https://github.com/eginhard is currently maintaining TTS and the Coqui scripts. He is not someone whom worked for Coqui (as I understand) he is just passionate about TTS and the Coqui model. He also appears to be doing quite a bit of work on the trainers/finetuning https://github.com/idiap/coqui-ai-TTS/commits/dev/ (yet to be released). Im not sure how involved he may want to be with another project, but, I suspect he knows quite a bit about the trainer and probably knows/has figured out quite a bit about the model....... Maybe he might be a good person for us to ask a few questions to (should he have time). I suppose we could pose any questions there, if you agree that could be a good path?

@IIEleven11
Copy link
Author

Awesome! Thanks for the leads. Yeah that's a good idea.

I did just make a breakthrough though that kind of confirms some of my theories.

I trained an xttsv2 model that can whisper using a custom special token "[whisper]". So I think this means that we can technically make any special token including for emotions.

The only difference being tortoise can just do many emotions and these tokens are nowhere to be found within its vocab.json but yet it knoww exactly how to handle them.

Anyways, so my conclusion with this new tokenizer is if people want to train new vocabulary they need a significant amount of data. 4 or 5 hours only works partially it will lose the ability to say generate certain sounds while gaining the ability to say others. This is negated with more data. It looks like somewhere around 15 to 20 hours give or take would be more ideal.

@erew123
Copy link
Owner

erew123 commented Aug 13, 2024

Wow! Training it to emote, that's pretty cool!

Re your conclusion though, that sounds similar to what I read about training an entirely new language into the model, without fully training all other languages at the same time. I imagine you need a hell of a lot of compute to build out a base model for this.

@erew123
Copy link
Owner

erew123 commented Aug 27, 2024

Hi @IIEleven11 Hope you are well. Apologies again, Im struggling to get near code/deal with support at the moment. I dont want to air my life on the internet, however for the past few months, I have a ongoing situation that has me traveling+away from my own home and computer, providing help/care for a family member.

If you feel this should be merged in, I am happy to do so, as long as you feel its bug free. I can give it a run through when possible and check all works.

If there is anything specific you would like me to try look at or help you figure, please give me a list of items and I will try to do so.

I will get to it as soon as I can.

All the best

@IIEleven11
Copy link
Author

Hi @IIEleven11 Hope you are well. Apologies again, Im struggling to get near code/deal with support at the moment. I dont want to air my life on the internet, however for the past few months, I have a ongoing situation that has me traveling+away from my own home and computer, providing help/care for a family member.

If you feel this should be merged in, I am happy to do so, as long as you feel its bug free. I can give it a run through when possible and check all works.

If there is anything specific you would like me to try look at or help you figure, please give me a list of items and I will try to do so.

I will get to it as soon as I can.

All the best

Oh sorry, actually I have an update for it that solves the model losing the ability to speak specific words. We need to freeze the embeddings layers of the base model prior to training. After I push that to this though, you could merge it but it isn't integrated into your webui. So if anyone wants to use the process they would need to run each script alone. I could maybe work on integrating it with your code, I don't expect it to be too difficult (famous last words). I am just swamped with clients at the moment and am about to release my own personal project. If I can get to it though I will.

@erew123
Copy link
Owner

erew123 commented Sep 22, 2024

Hi @IIEleven11

Hope you are keeping well! :)

I'm back for a few days, before heading off again. Sorry I havnt gotten around to this. Turns out when you go away for a while, there is quite a backlog of things to deal with when you return!

Should I be pulling this merge in now and sending it live?

Thanks

@IIEleven11
Copy link
Author

Sorry yeah I;ve been busy too. So I have done quite a bit of testing and the results are good. The asterisk though is I did it with English and a single speaker. There will most certainly be nuances when fine tuning with a different language. Also, I still haven't incorporated it into your interface. It's is going to require a little bit of shuffling around and choosing which base model.

But if you do want to merge it and let people who are capable of using the scripts as just standalones for now, it should be fine. Maybe make a quick note in the UI that this whole process can still be a bit difficult to grasp. I tried to make it as automatic as possible but the quality of their results is still going to depend upon their dataset and how they curated it. I would maybe point them to this video first, so they get a grasp of what theyre actually doing. https://youtu.be/zduSFxRajkE?si=K2NF8V1wrR_RTfWH

@erew123
Copy link
Owner

erew123 commented Oct 3, 2024

@IIEleven11 Still not had an opportunity to pull this in, test etc. Im still bouncing about like a Ping-Pong ball with my unwell family situation. What I have at least managed to do (without my main computer) is write a hell of a load of documentation on the Wiki, to try keep my requests for information/support down. https://github.com/erew123/alltalk_tts/wiki

I'm intending to pull down your updates on the Finetuning and also Ill do a larger section of the wiki on XTTS finetuning (probably mostly pulled from old written content and whats in finetuning, as well as linking to that video you gave above). If there is anything else you think I should include LMK.

Honestly, sorry and sorry for not pulling this in yet. Its just a case of getting time to properly test it, and as soon as Im away for X days, I come back to 20+ emails from people on here (hence deciding its time to write the wiki). I will get there, promise!!

@Mixomo
Copy link

Mixomo commented Oct 6, 2024

Hello @IIEleven11 I'm moving my question from #362 to here.

Before proceeding with the question, I have read this thread and saw that you put some instructions at the beginning, and I don't know if they still apply.
On the other hand, I know that progress is being made on the tokenizer part, so no need for a quick reply, I'll just leave my message here so I can keep track of progress and future PRs and merges related to this topic.


My question is not about how it works per se, but to know if indeed all talk uses the BPE tokenizer that has been trained in the inference, or embeds it somehow in the vocab.json or in the weights?

Since from what I was seeing, at the time of fine-tuning, all talk always uses the vocab.json of the base model (original or custom), and if then in the inference I manually point to the path of the vocab bpe custom, it gives me a missmatch error.
I don't even know if the reasoning I am trying to apply is correct.

Thank you very much in advance.

raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for Xtts: size mismatch for gpt.text_embedding.weight: copying a param with shape torch.Size([6681, 1024]) from checkpoint, the shape in current model is torch.Size([1431, 1024]). size mismatch for gpt.text_head.weight: copying a param with shape torch.Size([6681, 1024]) from checkpoint, the shape in current model is torch.Size([1431, 1024]). size mismatch for gpt.text_head.bias: copying a param with shape torch.Size([6681]) from checkpoint, the shape in current model is torch.Size([1431]).

image

The trained tokenizer:
bpe_tokenizer-vocab.json

The used tokenizer:
vocab.json

P.S:
And it is not because of the file names: bpe_tokenizer-vocab.json or renaming it to vocab.json gives the same error.

@IIEleven11
Copy link
Author

Hello @IIEleven11 I'm moving my question from #362 to here.

Before proceeding with the question, I have read this thread and saw that you put some instructions at the beginning, and I don't know if they still apply.
On the other hand, I know that progress is being made on the tokenizer part, so no need for a quick reply, I'll just leave my message here so I can keep track of progress and future PRs and merges related to this topic.


My question is not about how it works per se, but to know if indeed all talk uses the BPE tokenizer that has been trained in the inference, or embeds it somehow in the vocab.json or in the weights?

Since from what I was seeing, at the time of fine-tuning, all talk always uses the vocab.json of the base model (original or custom), and if then in the inference I manually point to the path of the vocab bpe custom, it gives me a missmatch error.
I don't even know if the reasoning I am trying to apply is correct.

Thank you very much in advance.

raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for Xtts: size mismatch for gpt.text_embedding.weight: copying a param with shape torch.Size([6681, 1024]) from checkpoint, the shape in current model is torch.Size([1431, 1024]). size mismatch for gpt.text_head.weight: copying a param with shape torch.Size([6681, 1024]) from checkpoint, the shape in current model is torch.Size([1431, 1024]). size mismatch for gpt.text_head.bias: copying a param with shape torch.Size([6681]) from checkpoint, the shape in current model is torch.Size([1431]).

image

The trained tokenizer:
bpe_tokenizer-vocab.json

The used tokenizer:
vocab.json

P.S:
And it is not because of the file names: bpe_tokenizer-vocab.json or renaming it to vocab.json gives the same error.

If you clone the branch I used to send the PR then those scripts should work for you.

As for your error, if you attempted to use the default process for training the new tokenizer then the error you got is consistent with what this PR is attempting to fix.

This happens because the base model was not being expanded according to the new vocabulary. This results in the size mismatch you got.

The process is
make a new vocab.json

merge it with the base model vocab.json

then freeze the base model except the embeddings layers

then expand the models embeddings layers using the vocab you merged.

Then you can begin fine tuning with your newly expanded model and its vocab.

@Mixomo
Copy link

Mixomo commented Oct 7, 2024

@IIEleven11

UPDATE:
I followed all your instructions, and without giving any errors in both scripts and training, the inference ends up being noise.

bug.all.talk-1.mp4

trainer_0_log.txt

The only thing I can mention is that I modified the scripts so that they can handle utf-8 files (since the language is Spanish and has accents).

https://gist.github.com/Mixomo/e6a82c6a373ed8a8925cc5eb12176d79

The base model was a custom one dedicated to Spanish, and while I'm not sure what exact version of XTTS V2 it was, I think it's the same, otherwise it wouldn't have let me train, right?

The version of Coqui AI TTS that I have is the new one that came out a few days ago, maybe that is the reason? Should I go back to the previous version?

What I will do now is to train with the original XTTS base model, to see if I get different results.

Thanks

expanded_vocab.json

@Mixomo
Copy link

Mixomo commented Oct 7, 2024

UPDATE # 2:

Training from the original base model worked, however I notice that speech does not have the same flexibility as training it with the original tokenizer and the Spanish base model, as it skips words and/or syllables.

@IIEleven11
Copy link
Author

@IIEleven11

UPDATE: I followed all your instructions, and without giving any errors in both scripts and training, the inference ends up being noise.

bug.all.talk-1.mp4
trainer_0_log.txt

The only thing I can mention is that I modified the scripts so that they can handle utf-8 files (since the language is Spanish and has accents).

https://gist.github.com/Mixomo/e6a82c6a373ed8a8925cc5eb12176d79

The base model was a custom one dedicated to Spanish, and while I'm not sure what exact version of XTTS V2 it was, I think it's the same, otherwise it wouldn't have let me train, right?

The version of Coqui AI TTS that I have is the new one that came out a few days ago, maybe that is the reason? Should I go back to the previous version?

What I will do now is to train with the original XTTS base model, to see if I get different results.

Thanks

expanded_vocab.json

Yeah I wouldn't attempt to train a new tokenizer from a model that isn't the base model. Its not impossible, but there would be other nuances you would need to address.

As for the new model you made, I am glad it worked for you, although there were some errors. What I guess is happening is that the model is getting new vocabulary but not enough data to train/learn on that vocabulary which results in what you're hearing. The answer to this problem is just to provide it with a significant amount of training data.

For reference I trained a model with a special token [whisper]. Where I gave it 40 hours of pure whispering. I had attempted it a few times prior with less data and got sub par results. It either had no idea what that token meant or would only work sometimes. So my theory is that you should be giving it somewhere between 30-40 hours or more to train on. I understand this is not a small number for the average person but when we consider in relation to the amount of training data the base model had and all other models in general, it actually is a very small number.

@erew123
Copy link
Owner

erew123 commented Oct 14, 2024

@IIEleven11 I don't think I'm going to get a chance to test this for a while, so I'm happy just to pull it in. Obviously @Mixomo has tested it now and it clearly worked through, so Im sure it will be fine for most use cases. I had to put this statement up about my current situation and Ive been fire fighting to try deal with support issues on Github when I can.

I want to try write some Finetuning Wiki stuff for people, probably a mix of the existing instructions, the video you linked and I guess I should add any other detail. I can and have been writing the Wiki https://github.com/erew123/alltalk_tts/wiki as I can do that with just a laptop. @IIEleven11 if you have any thoughts for anything to add, let me know, but Im going to give things 48 hours for to calm down here on Github, then Im going to merge this in, assuming all is well and quiet again!

Thanks so much again!! :)

@erew123
Copy link
Owner

erew123 commented Oct 14, 2024

@IIEleven11 Oh, not sure if this makes any sense to you or what you think about it #368 Ive not been able to look at this at all. Im not suggesting you do anything, but if you have any thoughts on it, Id be happy to hear them. Thanks

@erew123
Copy link
Owner

erew123 commented Oct 14, 2024

@IIEleven11 oh and maybe this is something that idiap who manages the Coqui scripts and base coqui scripts needs to look at, rather than anything in the Finetuning here......

@IIEleven11
Copy link
Author

@IIEleven11 Oh, not sure if this makes any sense to you or what you think about it #368 Ive not been able to look at this at all. Im not suggesting you do anything, but if you have any thoughts on it, Id be happy to hear them. Thanks

Hope all is well man, no rush, life is life.

As far as teaching people how to train models. It's always more complex then it appears. The tokenizer video is great. I have another one on overfitting https://www.youtube.com/watch?v=Gf5DO6br0ts. Ive been training/finetuning models for awhile now and if I had to pick what single biggest factor of a quality model is, it would be the dataset, by an extremely large amount. All of their time should be spent making sure it's pristine. As in, its segmented well, clear/noiseless audio, includes audio that spans the entire phonemic spectrum, has a gaussian distribution of audio length/text, proper sample rate, etc.

I actually have a repo where I make attempts to automate the dataset curation process. At its core its a bit complex but the theory is abstracting all of that away. https://github.com/IIEleven11/Automatic-Audio-Dataset-Maker.git. It does by default, spit out a xttsv2 dataset format as well as a huggingface hub dataset. So it should work for users right now out of the box.

As for the prodigy optimizer. I briefly looked it over and while it appears to be a slot in LR option and it working with pytorch. I highly doubt actually implementing it with all of the alltalk models will be a simple task. It is just a way to automatically adjust learning rate. There's no guarantee it will be more ideal than manually adjusting learning rate/using a scheduler. But, if it is a slot in simple addition/improvement, then sure why not?

@erew123 erew123 merged commit fbffe0a into erew123:alltalkbeta Oct 20, 2024
@erew123
Copy link
Owner

erew123 commented Oct 20, 2024

Hi @IIEleven11 I've fiiiiiiiiinally pulled in the PR :) I had a few busy days and a suspicion I may have updated something between your PR and the code base at some point, so just wanted to check that before pulling it in (I hadn't made any changes it appears).

I had to make 2x small changes and also added a line to make Gradio quiet about the fact there is a new Gradio version and to update etc. bb314fa

Over the next few days, Im hoping to get time to digest your suggestions for documentation and hopefully get something written, though Im going to be doing a bit of catch up first with other support requests, package version changes etc. (probably going to test out Pytorch 2.4 and a couple of other things) and then get the documentation written.

Obviously merging in the PR closes the PR, but ill catch you back here or feel free to catch me back here).

I just want to say thanks again for working on this! Thanks for being patient with me taking my time to merge the code in etc!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants