Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accent classification training and inference script #2

Open
MilanaShhanukova opened this issue Apr 14, 2024 · 19 comments
Open

Accent classification training and inference script #2

MilanaShhanukova opened this issue Apr 14, 2024 · 19 comments

Comments

@MilanaShhanukova
Copy link

Hi, may I ask whether this task is still relevant? If so, what dataset and model should be used for the accent classification? I would like to work on this task if it is possible.

@ylacombe
Copy link
Collaborator

Hey @MilanaShhanukova, the task is still really relevant!

At the moment, we're mainly looking to reproduce the accent classifier from the paper we get inspired from, in section 3.1.1. The accent classifier in the paper use EdAcc, VCTK, and the English-accented subset of VoxPopuli!

In terms of base model, they're using some kind of MMS-lid such as this one.

Let me know if you like to tackle this! I'll probably dedicate some time over this week!

@MilanaShhanukova
Copy link
Author

@ylacombe I would like to work on this

I'm also interested in SNR and pitch descriptions, let me know if you need help with this

@MilanaShhanukova
Copy link
Author

@ylacombe Hi, I've analysed vctk, edacc and common accents datasets. In total, there are 63 accents in them. To make an efficient classifier, we should match all of the classes. If no matching between classes is introduced, it may lead to an increasing amount of similar classes. I suggest to use the MMS language coverage. However, not all accents can be grouped in such a way. Please, find two json files attached to this comment. The raw2clean_accents.json is the file where I tried to group the present classes. In total, there are 31 classes excluding "I don't know" accent. However, there are some problems regarding this grouping.

  1. Scotland - As most of the datasets do not split information about the type of Scotland accent: Gaelic or Scots, it can be grouped in one Scottish accent as long as there are in the same language group.
  2. English and British - There are two accents present: English and British, I may assume that they can be grouped together.
  3. European - In addition, there are many European accents present in the datasets. However, there is also a huge class called "European". I guess, it can have some similarities with other classes. There are both Eastern and Non-eastern European classes.

language_coverage.json
raw2clean_accents.json

In addition, what do you think of the pipeline to basically use the embeddings of the accent or language group and use them in a simple classifier with a few FF layers? It can be the first pipeline.

@ylacombe
Copy link
Collaborator

Hey @MilanaShhanukova, thanks for the effort here! I think it's only fair to group the Scotland accents together, and to do the same with English and British.
In terms of European, how many samples and what percentage does it represent? we can simply ditch those samples if they are a small percent of the overall dataset.

Note that we probably don't need to classify accent in every samples in the ultimate pipeline, as we can have a mix of descriptions with and without accent.

In terms of training it sounds good to me. Will you use MMS-lid or something to classify with FFN layers on top of it ?
Many thanks for your effort! Please let me know if you have any questions or doubts

@ylacombe
Copy link
Collaborator

ylacombe commented May 9, 2024

Hey @MilanaShhanukova, any update on this ?
Let me know if you need help!

@MilanaShhanukova
Copy link
Author

Hi @ylacombe ! Sorry for the late answer, got a bit busy last week. Added the VoxPopuli dataset and started to train the model with an IterableDataset, however with the amount of data on VCTK dataset it is rather not recommended, as shuffling cannot be applied on the fly. I'm uploading the resampled datasets, so it is faster to load them.

@ylacombe
Copy link
Collaborator

Hey @MilanaShhanukova thanks for the update, let me know if I can help!
BTW, I believe you can shuffle iterable datasets with a buffer size according to Datasets docs, would it be enough for your use-case?

@ylacombe
Copy link
Collaborator

Hey @MilanaShhanukova, any luck in training the accent classifier ? What test set are you using ?

@QajikHakobyan
Copy link

Hi,You could also use the CommonVoice dataset. The metadata includes information about the speaker's region, which likely correlates strongly with their accent.

@MilanaShhanukova
Copy link
Author

@ylacombe
Hi, yes, I will upload the first model trained on the VCTK dataset tomorrow. Currently I am validating it on an external dataset. It took a bit more time than planned because of some problems with GPU availability.

@ylacombe
Copy link
Collaborator

ylacombe commented Jun 5, 2024

No worries @MilanaShhanukova, let me know when you have the validation ready!
I was planning to take a look at the accent classification tomorrow btw, so it's perfect timing if you got something already ready!

@MilanaShhanukova
Copy link
Author

@ylacombe just an update that the model to this moment shows good quality for comparing UK and USA accents, but worse for Indian and etc. Working on optimisations now, hope to have a better model at the end of the week. In addition, I'm now training a model in a contrastive mode as the embeddings might be also useful in the future. If you have any ideas about it, let me know

@ylacombe
Copy link
Collaborator

ylacombe commented Jun 7, 2024

Hey @MilanaShhanukova, how did you compute the accent list?

I've made some good progress myself. I've been re-using an accent classifier script we've been working on a few months ago (without success at this time).

As you can see, this uses EdAcc, VCTK, and the English-accented subset of VoxPopuli accent lists, that we had updated with Common Voice accents. Using the script above, I've managed to reach 80% accuracy, using 15% percent of the dataset as an evaluation set. You an find training logs here.

What do you think of uniting our efforts here ?

Accent          Perc.
--------------------
American        10.87583064156525
Australian      10.87583064156525
Canadian        10.87583064156525
English         10.87583064156525
German          10.87583064156525
Indian          10.87583064156525
Scottish        8.302609111770911
Irish           8.07421666829804
South african   4.822343306470032
New zealand     2.5014410475600073
Chinese         1.1985165367004904
Singaporean     0.9451096827520202
Dutch           0.8809422819667853
Polish          0.8189500473098633
Czech           0.7536950634604718
French          0.6449367570448193
Italian         0.5133392062818798
Hungarian       0.501375792576158
Malaysian       0.48506204661381014
Finnish         0.47636138210055795
Welsh           0.4567848869457405
Spanish         0.414369147443636
Eastern european 0.4024057337379142
Romanian        0.3676030756849054
Slovak          0.32953766843942706
Jamaican        0.24905652169184422
Estonian        0.22077936202377457
Egyptian        0.2001152838048006
Vietnamese      0.15552437817438308
Lithuanian      0.13594788301956562
Indonesian      0.130509967698783
Latin american  0.11419622173643512
Catalan         0.1087583064156525
Croatian        0.087006645132522
Nigerian        0.06634256691354802
Bulgarian       0.06307981772107844
Brazilian       0.053291570143669725
Slovenian       0.053291570143669725
Japanese        0.04567848869457405
Thai            0.0391529903096349
Kenyan          0.03262749192469575
Russian         0.026101993539756597
Latvian         0.025014410475600074
Ukrainian       0.018488912090660923
Swedish         0.013050996769878299
Serbian         0.0076130814490956746
Swiss           0.0043503322566261
Greek           0.0032627491924695747
Norwegian       0.00217516612831305

@MilanaShhanukova
Copy link
Author

@ylacombe Looks cool! May I ask what unfreezed layers give you the best results for this model?

I currently update all projection, classifier, adapter, layer_norm, feature_projection and pos_conv_embed layers. Seems like your models will get better results overall, is there anything I can help with?

@ylacombe
Copy link
Collaborator

ylacombe commented Jun 7, 2024

Thanks! I've only frozen the feature encoder part (at the very beginning)!

Well, maybe you have a better accent cleaning approach or more data or a better base model to train ! Let me know if you think of something and don't hesitate to share your approach!

@MilanaShhanukova
Copy link
Author

@ylacombe

Freezing only the encoder seems the best actually. One of the things that I noticed during training is that speaker identity and accent correlate a lot. Previously, I was also using the same speakers for validation. However, it gives not representative results, so for testing I decided to take new speakers from dataset https://www.kaggle.com/datasets/rtatman/speech-accent-archive/data. I assume it was formed from https://accent.gmu.edu/howto.php. Some audio files have background noise, but overall quality is good. It includes more speakers from similar countries as in VCTK. What dataset for testing and evaluation do you use now? Have you checked whether the same problem with speaker identity is present?

@ylacombe
Copy link
Collaborator

ylacombe commented Jun 8, 2024

Hey @MilanaShhanukova, nice catch, let me check if this changes something (it'll surely does unfortunately).

@ylacombe
Copy link
Collaborator

ylacombe commented Jun 8, 2024

Indeed, performance is much lower without speaker leakage!
So far, I've been splitting the training set -> 15% of it was for testing. Without speaker leakage, the model don't seem to generalize that well.

However, I also realize that if you limit the number of samples per speaker, performance is much better (still without speaker leakage - say 50 samples per speaker), while still not being acceptable yet.

You can maybe try limiting the number of samples per speaker with you own training script?

@ylacombe
Copy link
Collaborator

ylacombe commented Jun 8, 2024

BTW, the model doesn't have to be perfect, we'll probably use it for speakers on which the model is really confident

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants