How can I make a prediction without using a manifest file? #2248
-
Dear Nemo team, How can I use a pre-trained speaker verification model to generate the embeddings of an audio that I have previously loaded in memory (for example, with librosa), and without using a manifest file?. And, how can I make the embeddings that the model returns be stored in a variable and not on disk as a pickle file? If I wanted to do it also for the prediction of a classification model, would it be the same way? Many Thanks |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
Hi @yogso, Now coming to speaker verification collection, in steps: Yes, it would be the same for classification, but for classification, you may have to finetune for your known speaker labels and use the classification labels instead of embs as shown here . Remember to map corresponding label indices. Hope this helps. |
Beta Was this translation helpful? Give feedback.
Hi @yogso,
I see what you would like to do. The answer is fairly simple and it should be applicable to any of asr collections in NeMo. In general, we generate the PyTorch dataset based on the input manifest. Instead if one would like to use inference on audio directly then input audio has to be read and should be passed through a collate function which is dependent on collection (asr/speech commands/speaker recognition/VAD).
Now coming to speaker verification collection,
the collate processing function used is _fixed_seq_collate_fn . Here in
_fixed_seq_collate_fn
we limit the input audio signal to max time_length (can be found in config) along with other basic processing, but if only a si…