Fix vague language codes caused wrong recognition result #136
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We know that autosub use the same language codes to process src_language and dst_language. But it isn't specific enough for the api to judge the language. From speech-to-text/docs and translate/docs we know that speech-to-text api language codes are different from translation api language codes. Even the Simplified Chinese version of the docs differs from the English version. (That's totally troublesome)
You can see the difference of the Chinese language codes between these two docs. And this really matters in some cases.
By the way, although autosub still use the old version of google api to handle the api processing jobs, Google has changed old docs into the new ones. And after my test which I will talk about it later in this passage, at least some of them worked better than the codes before.
In this case, Google won't tell you your language codes are vague and refuse to recognize your speech but it will recognize it using the localized version of the language. For example, in accent version of Chinese we have Cantonese which Hong Kong people use it and Mandarin which is the official language of mainland China. When someone used arguments of
-S zh-CN -D zh-CN
or-S zh -D zh
(I modify the constant.py and test it) like the ones on the English docs to recognize the Mandarin Chinese in Hong Kong IP, he will get something recognized mistakenly by Cantonese. People also mentioned in this #112 (Although in Chinese).So I modified the
constant.py
and the__init__.py
to use the new version of lang codes. I didn't test the translation api but I think it's usable since the docs talk about the usage above. I also fix the logic bug when -S is given and -D is not given. I hope you can read it and much appreciation for your work on autosub.Below is the test:
Sorry to offend you but I screenshot the bug mentioned in #87Hong Kong IP confirmzh-TW is the Taiwan version of Mandarin at least orally they are almost the same.
I change the audio into another English one to eliminate the concern about whether Hong Kong is a bad place for Google to do the speech-to-text recognition.