How can I make a prediction without using a manifest file? #2248

yogso · 2021-05-21T16:33:11Z

yogso
May 21, 2021

Dear Nemo team,

How can I use a pre-trained speaker verification model to generate the embeddings of an audio that I have previously loaded in memory (for example, with librosa), and without using a manifest file?.

And, how can I make the embeddings that the model returns be stored in a variable and not on disk as a pickle file? If I wanted to do it also for the prediction of a classification model, would it be the same way?

Many Thanks

Answered by nithinraok

May 21, 2021

Hi @yogso,
I see what you would like to do. The answer is fairly simple and it should be applicable to any of asr collections in NeMo. In general, we generate the PyTorch dataset based on the input manifest. Instead if one would like to use inference on audio directly then input audio has to be read and should be passed through a collate function which is dependent on collection (asr/speech commands/speaker recognition/VAD).

Now coming to speaker verification collection,
the collate processing function used is _fixed_seq_collate_fn . Here in _fixed_seq_collate_fn we limit the input audio signal to max time_length (can be found in config) along with other basic processing, but if only a si…

View full answer

nithinraok · 2021-05-21T22:49:47Z

nithinraok
May 21, 2021
Maintainer

Hi @yogso,
I see what you would like to do. The answer is fairly simple and it should be applicable to any of asr collections in NeMo. In general, we generate the PyTorch dataset based on the input manifest. Instead if one would like to use inference on audio directly then input audio has to be read and should be passed through a collate function which is dependent on collection (asr/speech commands/speaker recognition/VAD).

Now coming to speaker verification collection,
the collate processing function used is _fixed_seq_collate_fn . Here in _fixed_seq_collate_fn we limit the input audio signal to max time_length (can be found in config) along with other basic processing, but if only a single audio signal is present, which is similar to your case, then we can ignore this and send the audio_signal (torch tensor) and its audio_signal_len (torch tensor) to model forward loop, as shown in this extraction loop then the output would be embs, which you are looking for.

in steps:
Load a pretrain model
speaker_model = ExtractSpeakerEmbeddingsModel.from_pretrained(model_name="speakerverification_speakernet")
get the audio_signal and audio_signal_length [tensors]
audio_signal, audio_signal_length = torch.tensor([audio]), torch.tensor([audio_length]) # move to cuda if gpu available
Get embeddings
speaker_model.eval();
_, embs = speaker_model.forward(input_signal=audio_signal, input_signal_length=audio_signal_len)

Yes, it would be the same for classification, but for classification, you may have to finetune for your known speaker labels and use the classification labels instead of embs as shown here . Remember to map corresponding label indices.

Hope this helps.

2 replies

titu1994 May 22, 2021
Maintainer

@nithinraok is it possible to use the .transcribe() method to automate these steps ? Or would it be impossible ?

yogso May 22, 2021
Author

Thank you very much, it was a very clear explanation. I just tested the code and it worked!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can I make a prediction without using a manifest file? #2248

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

How can I make a prediction without using a manifest file? #2248

yogso May 21, 2021

Replies: 1 comment · 2 replies

nithinraok May 21, 2021 Maintainer

titu1994 May 22, 2021 Maintainer

yogso May 22, 2021 Author

yogso
May 21, 2021

Replies: 1 comment 2 replies

nithinraok
May 21, 2021
Maintainer

titu1994 May 22, 2021
Maintainer

yogso May 22, 2021
Author