When doing transcription in Hindi for a file, I encounter invalid unicode character.
I have noticed this with many Hindi files.
Used whisper-large-v2 mode for inference on CPU. Have noticed the same issue when inferencing on GPU as well.
I am guessing the issue is: whisper model token output (BPE encoded) is not getting correctly mapped to unicode characters.