-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accent classification training and inference script #2
Comments
Hey @MilanaShhanukova, the task is still really relevant! At the moment, we're mainly looking to reproduce the accent classifier from the paper we get inspired from, in section 3.1.1. The accent classifier in the paper use EdAcc, VCTK, and the English-accented subset of VoxPopuli! In terms of base model, they're using some kind of MMS-lid such as this one. Let me know if you like to tackle this! I'll probably dedicate some time over this week! |
@ylacombe I would like to work on this I'm also interested in SNR and pitch descriptions, let me know if you need help with this |
@ylacombe Hi, I've analysed vctk, edacc and common accents datasets. In total, there are 63 accents in them. To make an efficient classifier, we should match all of the classes. If no matching between classes is introduced, it may lead to an increasing amount of similar classes. I suggest to use the MMS language coverage. However, not all accents can be grouped in such a way. Please, find two json files attached to this comment. The raw2clean_accents.json is the file where I tried to group the present classes. In total, there are 31 classes excluding "I don't know" accent. However, there are some problems regarding this grouping.
language_coverage.json In addition, what do you think of the pipeline to basically use the embeddings of the accent or language group and use them in a simple classifier with a few FF layers? It can be the first pipeline. |
Hey @MilanaShhanukova, thanks for the effort here! I think it's only fair to group the Scotland accents together, and to do the same with English and British. Note that we probably don't need to classify accent in every samples in the ultimate pipeline, as we can have a mix of descriptions with and without accent. In terms of training it sounds good to me. Will you use MMS-lid or something to classify with FFN layers on top of it ? |
Hey @MilanaShhanukova, any update on this ? |
Hi @ylacombe ! Sorry for the late answer, got a bit busy last week. Added the VoxPopuli dataset and started to train the model with an IterableDataset, however with the amount of data on VCTK dataset it is rather not recommended, as shuffling cannot be applied on the fly. I'm uploading the resampled datasets, so it is faster to load them. |
Hey @MilanaShhanukova thanks for the update, let me know if I can help! |
Hey @MilanaShhanukova, any luck in training the accent classifier ? What test set are you using ? |
Hi,You could also use the CommonVoice dataset. The metadata includes information about the speaker's region, which likely correlates strongly with their accent. |
@ylacombe |
No worries @MilanaShhanukova, let me know when you have the validation ready! |
@ylacombe just an update that the model to this moment shows good quality for comparing UK and USA accents, but worse for Indian and etc. Working on optimisations now, hope to have a better model at the end of the week. In addition, I'm now training a model in a contrastive mode as the embeddings might be also useful in the future. If you have any ideas about it, let me know |
Hey @MilanaShhanukova, how did you compute the accent list? I've made some good progress myself. I've been re-using an accent classifier script we've been working on a few months ago (without success at this time). As you can see, this uses EdAcc, VCTK, and the English-accented subset of VoxPopuli accent lists, that we had updated with Common Voice accents. Using the script above, I've managed to reach 80% accuracy, using 15% percent of the dataset as an evaluation set. You an find training logs here. What do you think of uniting our efforts here ?
|
@ylacombe Looks cool! May I ask what unfreezed layers give you the best results for this model? I currently update all projection, classifier, adapter, layer_norm, feature_projection and pos_conv_embed layers. Seems like your models will get better results overall, is there anything I can help with? |
Thanks! I've only frozen the feature encoder part (at the very beginning)! Well, maybe you have a better accent cleaning approach or more data or a better base model to train ! Let me know if you think of something and don't hesitate to share your approach! |
Freezing only the encoder seems the best actually. One of the things that I noticed during training is that speaker identity and accent correlate a lot. Previously, I was also using the same speakers for validation. However, it gives not representative results, so for testing I decided to take new speakers from dataset https://www.kaggle.com/datasets/rtatman/speech-accent-archive/data. I assume it was formed from https://accent.gmu.edu/howto.php. Some audio files have background noise, but overall quality is good. It includes more speakers from similar countries as in VCTK. What dataset for testing and evaluation do you use now? Have you checked whether the same problem with speaker identity is present? |
Hey @MilanaShhanukova, nice catch, let me check if this changes something (it'll surely does unfortunately). |
Indeed, performance is much lower without speaker leakage! However, I also realize that if you limit the number of samples per speaker, performance is much better (still without speaker leakage - say 50 samples per speaker), while still not being acceptable yet. You can maybe try limiting the number of samples per speaker with you own training script? |
BTW, the model doesn't have to be perfect, we'll probably use it for speakers on which the model is really confident |
Hi, may I ask whether this task is still relevant? If so, what dataset and model should be used for the accent classification? I would like to work on this task if it is possible.
The text was updated successfully, but these errors were encountered: