-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use ctcdecode in native client (Fixes #1668) #1679
Conversation
d42530f
to
ca35e2c
Compare
All tests are green, modulo some infra issues. |
native_client/BUILD
Outdated
@@ -52,6 +52,8 @@ tf_cc_shared_object( | |||
linkopts = select({ | |||
"//tensorflow:darwin": [], | |||
"//conditions:default": [ | |||
"-ldl", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't we need it as well on macOS ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works fine on my machine, and it passes tests. I don't know why :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks good, but I'm wondering regarding improving future-proof-ability
@star3858 and @ybg7955 A PR is not the place for such comments. Please in future do not add non-review comments to a PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few little changes for this PR.
However, for the final PR we should remove some temporary code such as the
label_to_str_.push_back("*");
and data/smoke_test/vocab.trie.ctcdecode
pointing to `v0.2.0-prod-ctcdecode~ and the like.
} | ||
|
||
private: | ||
size_t size_; | ||
std::unordered_map<unsigned int, std::string> label_to_str_; | ||
unsigned int space_label_; | ||
std::vector<std::string> label_to_str_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why switch from std::unordered_map
to std::vector
? Just curious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ctcdecode code uses a vector<string>
as the alphabet representation so I initially was just passing Alphabet's vector to it, but eventually I made it use the Alphabet class directly instead. But std::unordered_map
is unnecessary here as a vector
is sufficient and faster/leaner.
@@ -17,29 +18,29 @@ class Alphabet { | |||
Alphabet(const char *config_file) { | |||
std::ifstream in(config_file, std::ios::in); | |||
unsigned int label = 0; | |||
space_label_ = -2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This creates a strange dependency between Alphabet
and ctc_beam_search_decoder
, both know about this magic number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see where ctc_beam_search_decoder knows about this magic number? It's supposed to not match any real label if there's no space label in the alphabet. Although now I realize that there's actually a subtle bug here, since space_label_ is unsigned, this is assigning a missing space label to UINT_MAX-1, which could break for an alphabet that's exactly UINT_MAX long :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about it I don't think this is a big deal as languages without a space label will always use character-based language models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it true that "languages without a space label will always use character-based language models"? For example Thai has an alphabet but words can also can be written without spaces. I think Javanese is the same way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The -2 constant would be a problem for a language that uses a word-based language model and has an alphabet that is 2^32-2 characters long. AFAIK, the CJK languages are the only ones that could possibly get close to that many characters, and they would all probably use character-level LMs. Unicode defines close to 100,000 CJK ideograms, so even then we still have a lot of margin. I can add a separate "has_space" flag to remove the in-band signaling here, but I don't think it would be a problem anytime soon.
@@ -320,55 +320,24 @@ ModelState::infer(const float* aMfcc, unsigned int n_frames, vector<float>& logi | |||
char* | |||
ModelState::decode(vector<float>& logits) | |||
{ | |||
const int top_paths = 1; | |||
const int cutoff_top_n = 40; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this magic number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the just the default value used in ctcdecode.
Hey -- maintainer of parlance/ctcdecode here. Is there a way that we can structure this so that future improvements either in the Mozilla fork or in the parlance fork can be easily shared? |
Hi @ryanleary, thanks for your work on ctcdecode! Would you be interested in merging the support for caching the FST in disk? That's the main change I had to make, the rest is simple adaptations of the function signatures. I definitely intend to upstream any bug fixes or improvements we make to the decoder. |
Sure, I'm happy to add any new features/improvements, as well as expose additional bindings as it makes sense. |
…ssed" This reverts commit b29d0ab.
All comments should be addressed. I decided to add the functionality to build the trie at runtime on the native client only to limit the exposure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -48,6 +50,7 @@ bool ProcessArgs(int argc, char** argv) | |||
{"lm", required_argument, nullptr, 'l'}, | |||
{"trie", required_argument, nullptr, 'r'}, | |||
{"audio", required_argument, nullptr, 'w'}, | |||
{"run_very_slowly_without_trie_I_really_know_what_Im_doing", no_argument, nullptr, 999}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😃
native_client/BUILD
Outdated
"-Wl,-Bsymbolic", | ||
"-Wl,-Bsymbolic-functions", | ||
"-Wl,-export-dynamic", | ||
"-l:libstdc++.a", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still lacks some trivial factorization ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, I did the review commit by commit, and this is fixed by a newer one, so disregard that comment (github flagged it as "outdated")
native_client/ctcdecode/scorer.cpp
Outdated
@@ -45,10 +45,10 @@ Scorer::Scorer(double alpha, | |||
|
|||
Scorer::~Scorer() { | |||
if (language_model_ != nullptr) { | |||
delete static_cast<lm::base::Model*>(language_model_); | |||
delete language_model_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we make sure there's no memory leak?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean by using std::unique_ptr
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for example :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, if we can fix any memory issue before landing it's perfect, otherwise we can check and fix that later
@ryanleary Hello Ryan, sorry to hijack this PR, but it looks like we have some contributor willing to spend time building stuff for Windows, and currently hitting some roadblocks on |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
This PR switches the native client code (and clients that use it) to use ctcdecode instead of the TensorFlow decoder. It does not change any of the training code, nor does it remove the TensorFlow decoder. That will be done as part of #1675. With this PR, there's a disconnect between training code and clients, and DeepSpeech.py WER reports for example won't match the results of our clients. On the other hand, it does let us get the benefits of ctcdecode out to users more quickly.
I recommend reviewing commit by commit. They all build on top of each other and the code builds and works at any given commit in this branch. Here's a breakdown:
I'm not asking for review yet because I want to see how this does on the tests.