You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The way that the CLI is implemented currently, it is unusable on a medium-size model and e.g. an hour or two long audio file.
The main problem is that after the transcription, the Whisper model is not properly GCed, and the cache isn't cleared, so there is just not enough memory for both the Whisper and the alignment model.
This is a huge issue since many people have consumer RTX 3000-series cards with only 6GB VRAM.
A quick solution is to not load all models at the same time, and properly GC the whisper model before doing the next steps:
The way that the CLI is implemented currently, it is unusable on a medium-size model and e.g. an hour or two long audio file.
The main problem is that after the transcription, the Whisper model is not properly GCed, and the cache isn't cleared, so there is just not enough memory for both the Whisper and the alignment model.
This is a huge issue since many people have consumer RTX 3000-series cards with only 6GB VRAM.
A quick solution is to not load all models at the same time, and properly GC the whisper model before doing the next steps:
The text was updated successfully, but these errors were encountered: