Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inquiry Regarding Audio Spectrogram Transformer #128

Open
Ingram-lin opened this issue Apr 22, 2024 · 2 comments
Open

Inquiry Regarding Audio Spectrogram Transformer #128

Ingram-lin opened this issue Apr 22, 2024 · 2 comments
Labels
question Further information is requested

Comments

@Ingram-lin
Copy link

I am a graduate student from China, and our team recently had the privilege of studying your article on the 'Audio Spectrogram Transformer'. We were truly impressed by the content and scope of your work, and it has sparked a great deal of interest within our team. Following our admiration for your research, we endeavored to replicate your work on the ESC-50 dataset. However, as we proceeded to fine-tune the model using our own dataset, we encountered several challenges. We would greatly appreciate your guidance and assistance in navigating these challenges.

1、Our dataset consists of 2400 samples, each audio clip is 4 seconds. We set the audio_length parameter to 400 and timen to 80. We replaced the labels while keeping the rest recipe consistent with ESC-50. We downloaded a pre-trained model from Audioset and followed the same process as replicating ESC-50. We are pleased with the final result; the accuracy can reach 0.9. However, what surprises us is that the average precision is only between 0.3 to 0.5. Why could this be?

2、We understand your work involves projecting spectrograms to embeddings. (If our understanding of your work is incorrect, please forgive us.) After fine-tuning the model, we process new speech data and aim to obtain the embeddings. Could you please guide us on how to do this?

3、For example, if we want to fine-tune a pre-trained model with an English dataset and then validate the fine-tuned model with a Chinese dataset, can we set the training set as the English dataset and the validation set as the Chinese dataset during the fine-tuning process?

@YuanGongND
Copy link
Owner

YuanGongND commented Apr 22, 2024

1、Our dataset consists of 2400 samples, each audio clip is 4 seconds. We set the audio_length parameter to 400 and timen to 80. We replaced the labels while keeping the rest recipe consistent with ESC-50. We downloaded a pre-trained model from Audioset and followed the same process as replicating ESC-50. We are pleased with the final result; the accuracy can reach 0.9. However, what surprises us is that the average precision is only between 0.3 to 0.5. Why could this be?

This could be many reasons, but I do not have time to debug (and do not have information). The possible reasons include ESC-50 is balanced, so acc is a good metric, your dataset might be imbalanced. So the model is biased to the majority class, in that case, you would need to turn on class balancing, etc. Acc is not a good measure when the dataset is not balanced.

2、We understand your work involves projecting spectrograms to embeddings. (If our understanding of your work is incorrect, please forgive us.) After fine-tuning the model, we process new speech data and aim to obtain the embeddings. Could you please guide us on how to do this?

If you wish to get the last layer embedding, you should let the model return x right after this line (the model needs to be a trained model).

x = (x[:, 0] + x[:, 1]) / 2

3、For example, if we want to fine-tune a pre-trained model with an English dataset and then validate the fine-tuned model with a Chinese dataset, can we set the training set as the English dataset and the validation set as the Chinese dataset during the fine-tuning process?

That is totally possible, you can simply prepare two dataset and replace the original dataset in our recipe. But you might lose some performance due to the training / test mis-match. This will be true for all models, not just AST.

-Yuan

@YuanGongND YuanGongND added the question Further information is requested label Apr 22, 2024
@Ingram-lin
Copy link
Author

Thank you very much for your help. I have a new question to ask you. I want to extract 768-dimensional features from new speech data. I found this feature extraction method (as shown in the figure below). I would like to ask, how can I perform feature extraction using my own fine-tuned model? Due to my limited technical skills, I don't know how to proceed. Can you help me?
1715327148907(1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants