You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The questions below are linked to training ASR models using nemo of type conformer and fast conformer.
I am in the process of investigating the impact of how training data is represented and loaded, such that I can provide guidelines to our internal teams how to best organize and handle their data.
Some people create manifest files where utterances have a reference to a wave file, using an offset and a duration to indicate where in the audio file the utterance is located. A single file can contain dozens of utterances, which means a manifest file has a dozen of lines, one for each utterance, each referring to the same audio file with a different offset and duration. To understand how such utterances are treated, I had some questions:
will the complete audio file be read and copied to GPU RAM, or only the audio as specified by the offset and duration?
will the system open/close such multi-utterance audio file just once and read the individual utterances, or will the files be opened and closed every time a single utterance is read?
Note that I am aware of the mechanism of using tarred audio files. Converting the above data organisation to tarred audio files would however not be efficient as one would end up with dozens of copies of the same big audio file. One should either use tarred files with one file per utterance (with offset 0), or use larger wave files and using proper offset values in the manifest files for each utterance. I have good experiences with the former in terms of efficiency. For illustration: in one of my experiments, using dynamic data bucketing with lhotse combined with tarring audio files led to a 10x speed-up of training compared to the original, traditional, setup I was using of manifest files and untarred audio (1 file per utterance) without bucketing. So how data is loaded and organized matters a lot.
I am looking forward learning your thoughts and experience about this topic. At this moment, I am particularly interested in better understanding how and what data gets loaded into GPU RAM.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
The questions below are linked to training ASR models using nemo of type conformer and fast conformer.
I am in the process of investigating the impact of how training data is represented and loaded, such that I can provide guidelines to our internal teams how to best organize and handle their data.
Some people create manifest files where utterances have a reference to a wave file, using an offset and a duration to indicate where in the audio file the utterance is located. A single file can contain dozens of utterances, which means a manifest file has a dozen of lines, one for each utterance, each referring to the same audio file with a different offset and duration. To understand how such utterances are treated, I had some questions:
Note that I am aware of the mechanism of using tarred audio files. Converting the above data organisation to tarred audio files would however not be efficient as one would end up with dozens of copies of the same big audio file. One should either use tarred files with one file per utterance (with offset 0), or use larger wave files and using proper offset values in the manifest files for each utterance. I have good experiences with the former in terms of efficiency. For illustration: in one of my experiments, using dynamic data bucketing with lhotse combined with tarring audio files led to a 10x speed-up of training compared to the original, traditional, setup I was using of manifest files and untarred audio (1 file per utterance) without bucketing. So how data is loaded and organized matters a lot.
I am looking forward learning your thoughts and experience about this topic. At this moment, I am particularly interested in better understanding how and what data gets loaded into GPU RAM.
Beta Was this translation helpful? Give feedback.
All reactions