-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Figure out the best training hyperparameters #1
Comments
experiment 2For my 2nd experiment (the first one being the one on the README page), I:
Both of these moves appear to have been mistakes. My mixed dataset was highly imbalanced, with >70% of the speech going to a single narrator alone (in a dataset of 100 speakers); this caused all voice outputs to be severely biased towards the most common speaker. I also observed much more noise in the resultant outputs, which might have to do with the dataset or with the higher learning rate or with the lack of other model fine-tunes. Might commit results later, but my conclusions here are:
|
experiment 3This was the one where I first used the colab notebook. It went pretty well, which was surprising because the dataset had <200 samples. However, this only really worked because I manually adjusted a whole bunch of parameters down. That led me to develop automatic calculations for some parameters based on the dataset. experiment 4This was just a redo of the previous experiment with the new automatic parameter system. Worked well enough. |
experiment 5This was my 2nd attempt at a multispeaker training session. This time, I capped samples for every character at a maximum of 1000 lines (in the training DS). I learned a few things:
|
experiment 6Testing on a different dataset this time. Single speaker, female, emotional, fairly large dataset with maybe 1-2k samples. Now that I've gotten the validation metrics working, I can use those as graphs: This was a disastrous outcome, and the voices were all garbled when I test them. I don't know why, maybe the speaker is too different. I didn't change anything about the training process. |
I've been running tests on small datasets (15 total samples) and I notice the result sounds stepped, like he's talking through a broken speaker kind of thing, even weirder, the higher the preset you go the more it clears up and sounds good, I tried various combinations and believe it's mainly influenced by the autoregressive sample amount, not sure why it's getting this sort of effect.. Comparison: All are using the same seed and the same candidate is being compared Standard Preset - fine-tuned: Standard Preset - Original: Ultra-fast - fine-tuned (notice the hiss and stepping): Ultra-fast - Original: |
Training curves && params would be good. It probably overfit on the small amount of data included, which could be made less bad if I manage to fix the conditioning latents problem. |
I wonder if this can be further fixed using AudioLDM once they release their audio super-resolution, voicefixer completely destroys the speech |
in regards to the diffusion model, I talked to the dev that wrote on reddit he retrained the vqvae, he said he didn't retrain the diffusion model at all |
btw, check out this thread where neonbjb discusses the gpt training |
I'm aware of the cheater latents problem, I discuss the problems with fixing that here, but thanks for the link nonetheless |
I haven't checked it out, I'll go do that later
Did he mean to recreate the VQVAE from scratch, or to fine-tune? |
I'm not sure tbh These configs were shared by neon on some random discussion awhile back, they're different from the ones in the original DL repo, perhaps they could help make sense of how he trained his GPT. |
These are all very interesting... they look like the exact configs he used to train the actual tortoise model. This is the first time I've seen the real filepaths to his larger ocotillo transcribed dataset. I can already see some errors I made regarding the diffusion model trainer, like layer drop or lr decay. This is good. Where was it from? |
Regarding finding good hyperparameters, I think this might be useful. |
I ran a bunch of experiments reducing lr (with its helpful bold comment "you should experiment with this value"). Reducing it seems to resolve the "stepped, like he's talking through a broken speaker kind of thing, even weirder, the higher the preset you go the more it clears up" situation. I found that values between 1e-7 and 5e-8 worked best (kinda hard to tell within that range which is best), avoiding both the unsmooth robot-like tonality of zero-shot (i.e., original model) and the stepped sound of 1e-5 . I'm using ~180 samples, .85/.15 train/validate, niter (I'm assuming this is "number of iterations" and synonymous with "steps") of 1800 so ~12 epochs, and then gen_lr_steps [462, 924, 1386, 1618] so stepping down the lr every four epochs. At least, that's what I think I'm doing anyway (not an ML genius), and it sounds pretty good. I'm training on a pretty normal voice that isn't that far off of the libritts-ish voices so may not need as much training as other voices would. The thing that I still plagues me is issue 237 in the original tortoise repo: repeats (so, an inference issue, not Hparams). Posting on that in #61 to keep topics clean. |
Sorry, ignore last comment. I hadn't comprehended steps well enough (i.e., "one batch of batch_size is 1 unit/step" in the example yml. I had a batch size of 77 (so two batches per epoch with 154 training samples), so 1800 steps was hundreds of epochs. Interesting to experiment with low learning rate, lots of iterations, I guess; nothing good enough to recommend. Works much better with lr 1e-5 and 5 to 8 epochs. Are there units to the y-axis val_loss_text_ce? Is that just an arbitrary loss function? Trying to figure out if one can infer anything from the difference between it converging on, say 1.31 in experiment 6 here versus on 4.4 here in one of my recent experiments (or other future graphs), or if it is just more about the shape of the curve. |
are you changing any temperature or or top p when using tortoise fast? so lower learning rate works better? |
Caveat that I'm just a hobbyist here so my theoretical conceptions of these things are of a "I read a blog post about them" level. But I can report I have done experiments and can't discern any meaningful difference when moving the temperature or top_p dials (from .5 to .95 in each case). Or repetition_penalty or length_penalty for that matter -- nothing. At first I thought low top_p made the sound more "boring" (less prosody) but listening again now I think maybe that's just a bias from having read the docs, which say that's what it does. |
The numbers written in
./experiments/EXAMPLE_gpt.yml
were picked completely at random! It is very likely the numbers can be better, so long as people are willing to test and see what works.Please post results here if you change any of the parameters, even if it completely fails!
The text was updated successfully, but these errors were encountered: