Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use ctcdecode in native client (Fixes #1668) #1679
Use ctcdecode in native client (Fixes #1668) #1679
Changes from 11 commits
c34fc5b
0002d0f
e72f079
770d742
3cc9b37
b29d0ab
200e61e
70ff71c
58b25da
cf57453
db3a36c
fa6434b
68258e3
23497fd
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still lacks some trivial factorization ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, I did the review commit by commit, and this is fixed by a newer one, so disregard that comment (github flagged it as "outdated")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This creates a strange dependency between
Alphabet
andctc_beam_search_decoder
, both know about this magic number.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see where ctc_beam_search_decoder knows about this magic number? It's supposed to not match any real label if there's no space label in the alphabet. Although now I realize that there's actually a subtle bug here, since space_label_ is unsigned, this is assigning a missing space label to UINT_MAX-1, which could break for an alphabet that's exactly UINT_MAX long :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about it I don't think this is a big deal as languages without a space label will always use character-based language models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it true that "languages without a space label will always use character-based language models"? For example Thai has an alphabet but words can also can be written without spaces. I think Javanese is the same way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The -2 constant would be a problem for a language that uses a word-based language model and has an alphabet that is 2^32-2 characters long. AFAIK, the CJK languages are the only ones that could possibly get close to that many characters, and they would all probably use character-level LMs. Unicode defines close to 100,000 CJK ideograms, so even then we still have a lot of margin. I can add a separate "has_space" flag to remove the in-band signaling here, but I don't think it would be a problem anytime soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why switch from
std::unordered_map
tostd::vector
? Just curious.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ctcdecode code uses a
vector<string>
as the alphabet representation so I initially was just passing Alphabet's vector to it, but eventually I made it use the Alphabet class directly instead. Butstd::unordered_map
is unnecessary here as avector
is sufficient and faster/leaner.