-
Notifications
You must be signed in to change notification settings - Fork 588
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SpeakerIdentificationDemo.java does not print out any hypotheses #31
Comments
It does print hypothesis for the speaker, to suppress the log you can redirect it to null: java -cp ... 2> /dev/null then you will see the output The second speaker is silence so the result for it is empty. I haven't decided what to do about it though, maybe we will ignore silence on speaker intervals first of all. |
Thanks for the response Nickolay, you're absolutely right. That printout got lost in the logging output for me. A quick question about the silence issue here, given the current state of things. Is there a way to predict which speaker will correspond to the silence intervals (aside from having no results in the transcription for those intervals). To be more specific, lets say I have 4 participants in a conversation that I transcribe, and sphinx identifies 5 speakers (each with several segments of speaking). I would assume they would be created in the "order of appearance", is that correct? So if the conversation begins with Speaker1 talking, and then is followed by a short interval of silence, the Silence would effectively be Speaker2. Also, based on your experience, how likely is it for a single effective speaker to be labeled as multiple speakers in sphinx (based on varying background noise & channel conditions). That is, if I'm the only one talking, is Sphinx likely to erroneously identify various speakers if there is significant variation in the background noise? I apologize if this doesn't belong in the issues, if so please tell me where to move this question/discussion. Edit: 1 more quick question while I'm at it I realize that the speaker identification/diarization in Sphinx is done mainly so that one can run the speaker adaptation, and thus hopefully get somewhat better accuracy from the acoustic model. Having said that, how do you feel Sphinx's speaker segmentation compares with a dedicated speaker diarization package like Lium? If I wanted to integrate this feature, would I be able to use Sphinx' SpeakerIdentifcation exclusively, or do you think there would be some benefit from using Lium's tools (LIUM_SpkDiarization). |
Yes
Currently we don't have that, more advanced toolkits like LIUM Diarization have specific tools and steps to detect silence.
Yes, this happens pretty frequently. Our approach is not the best one around. For adaptation it's enough to have like 20 seconds of speech to improve accuracy, so such misclassification is not really harmful on long recordings
LIUM tools are certainly more advanced but they are not easy to use and also have algorithm flaws. Thats why we started our own speakerid part, but it is very far from being complete. |
Thank you for the detailed follow-up response Nickolay. In the future, if there are any more "open-ended" questions pertaining to Sphinx, where is the most appropriate place to start such threads? |
You are welcome to post message to our forums on sourceforge, create an issue here, join the irc channel #cmusphinx on freenode or contact me directly at nshmyrev@gmail.com |
Hello Nicolai, thank you for fixing the issue pertaining to the "singular matrix" exception. I can verify now that it's no longer an issue, the speaker segments are being properly calculated and dumped to console with printSpeakerIntervals().
However, my concern now is that it seems that the speakerAdaptiveDecoding() never prints any hypotheses. A quick look at the demo code tells me that it should be iterating over the many speaker segments and running the (adapted) recognition. I can definitely see it iterate many times (loading language/acoustic/etc models) and running speedTracker, but I never see any hypothesis output.
In addition, the speedTracker reports the transcription time as "0.00 X realtime", prompting me to believe that it never actually runs after the models are loaded.
I'm running the demo "as is", on the in-package /edu/cmu/sphinx/demo/speakerid/test.wav file, but I'd be happy to try this out on some of my own media if you think it would help.
I've been using Sphinx4 for over half a year now, and generally under the right conditions I feel it works pretty well, I was hoping to get a bit of an accuracy boost from the speaker adaptation.
The text was updated successfully, but these errors were encountered: