Use ctcdecode in native client (Fixes #1668) #1679

reuben · 2018-10-24T22:34:29Z

This PR switches the native client code (and clients that use it) to use ctcdecode instead of the TensorFlow decoder. It does not change any of the training code, nor does it remove the TensorFlow decoder. That will be done as part of #1675. With this PR, there's a disconnect between training code and clients, and DeepSpeech.py WER reports for example won't match the results of our clients. On the other hand, it does let us get the benefits of ctcdecode out to users more quickly.

I recommend reviewing commit by commit. They all build on top of each other and the code builds and works at any given commit in this branch. Here's a breakdown:

c34fc5b - Just imports the code and adds a README.mozilla pointing to the version imported
0002d0f - Adds the new code to the BUILD file and also cleans it up a bit
e72f079 - Makes a few modifications to the Alphabet class needed for using it in ctcdecode
770d742 - Modifies ctcdecode to use our Alphabet class, and adds support for saving and loading the trie to/from disk
3cc9b37 - Switches the native client to use ctcdecode. Also adapts generate_trie.cpp and trie_load.cc to work with the new setup.
b29d0ab - The new decoder can build the trie from the language model, so make the trie flag optional in clients.

I'm not asking for review yet because I want to see how this does on the tests.

reuben · 2018-10-25T22:13:55Z

All tests are green, modulo some infra issues.

native_client/alphabet.h

native_client/ctcdecode/ctc_beam_search_decoder.cpp

native_client/ctcdecode/scorer.cpp

native_client/deepspeech.cc

native_client/client.cc

lissyx · 2018-10-26T08:40:13Z

native_client/BUILD

@@ -52,6 +52,8 @@ tf_cc_shared_object(
    linkopts = select({
        "//tensorflow:darwin": [],
        "//conditions:default": [
+            "-ldl",


Wouldn't we need it as well on macOS ?

Works fine on my machine, and it passes tests. I don't know why :)

taskcluster/test-armbian-opt-base.tyml

native_client/BUILD

lissyx

That looks good, but I'm wondering regarding improving future-proof-ability

native_client/ctcdecode/README.mozilla

kdavis-mozilla · 2018-10-29T08:47:26Z

@star3858 and @ybg7955 A PR is not the place for such comments. Please in future do not add non-review comments to a PR.

kdavis-mozilla

Just a few little changes for this PR.

However, for the final PR we should remove some temporary code such as the

label_to_str_.push_back("*");

and data/smoke_test/vocab.trie.ctcdecode pointing to `v0.2.0-prod-ctcdecode~ and the like.

native_client/alphabet.h

kdavis-mozilla · 2018-10-29T12:46:51Z

native_client/alphabet.h

  }

 private:
  size_t size_;
-  std::unordered_map<unsigned int, std::string> label_to_str_;
+  unsigned int space_label_;
+  std::vector<std::string> label_to_str_;


Why switch from std::unordered_map to std::vector? Just curious.

The ctcdecode code uses a vector<string> as the alphabet representation so I initially was just passing Alphabet's vector to it, but eventually I made it use the Alphabet class directly instead. But std::unordered_map is unnecessary here as a vector is sufficient and faster/leaner.

kdavis-mozilla · 2018-10-29T12:52:57Z

native_client/alphabet.h

@@ -17,29 +18,29 @@ class Alphabet {
  Alphabet(const char *config_file) {
    std::ifstream in(config_file, std::ios::in);
    unsigned int label = 0;
+    space_label_ = -2;


This creates a strange dependency between Alphabet and ctc_beam_search_decoder, both know about this magic number.

I don't see where ctc_beam_search_decoder knows about this magic number? It's supposed to not match any real label if there's no space label in the alphabet. Although now I realize that there's actually a subtle bug here, since space_label_ is unsigned, this is assigning a missing space label to UINT_MAX-1, which could break for an alphabet that's exactly UINT_MAX long :)

Thinking about it I don't think this is a big deal as languages without a space label will always use character-based language models.

Is it true that "languages without a space label will always use character-based language models"? For example Thai has an alphabet but words can also can be written without spaces. I think Javanese is the same way.

The -2 constant would be a problem for a language that uses a word-based language model and has an alphabet that is 2^32-2 characters long. AFAIK, the CJK languages are the only ones that could possibly get close to that many characters, and they would all probably use character-level LMs. Unicode defines close to 100,000 CJK ideograms, so even then we still have a lot of margin. I can add a separate "has_space" flag to remove the in-band signaling here, but I don't think it would be a problem anytime soon.

native_client/ctcdecode/scorer.h

kdavis-mozilla · 2018-10-29T13:26:06Z

native_client/deepspeech.cc

@@ -320,55 +320,24 @@ ModelState::infer(const float* aMfcc, unsigned int n_frames, vector<float>& logi
 char*
 ModelState::decode(vector<float>& logits)
 {
-  const int top_paths = 1;
+  const int cutoff_top_n = 40;


Why this magic number?

This is the just the default value used in ctcdecode.

native_client/client.cc

native_client/BUILD

ryanleary · 2018-10-29T16:47:56Z

Hey -- maintainer of parlance/ctcdecode here. Is there a way that we can structure this so that future improvements either in the Mozilla fork or in the parlance fork can be easily shared?

reuben · 2018-10-29T17:18:55Z

Hi @ryanleary, thanks for your work on ctcdecode!

Would you be interested in merging the support for caching the FST in disk? That's the main change I had to make, the rest is simple adaptations of the function signatures.

I definitely intend to upstream any bug fixes or improvements we make to the decoder.

ryanleary · 2018-10-29T17:26:02Z

Sure, I'm happy to add any new features/improvements, as well as expose additional bindings as it makes sense.

…ssed" This reverts commit b29d0ab.

reuben · 2018-10-30T02:59:00Z

All comments should be addressed. I decided to add the functionality to build the trie at runtime on the native client only to limit the exposure.

kdavis-mozilla

LGTM

kdavis-mozilla · 2018-10-30T10:12:54Z

native_client/args.h

@@ -48,6 +50,7 @@ bool ProcessArgs(int argc, char** argv)
            {"lm", required_argument, nullptr, 'l'},
            {"trie", required_argument, nullptr, 'r'},
            {"audio", required_argument, nullptr, 'w'},
+            {"run_very_slowly_without_trie_I_really_know_what_Im_doing", no_argument, nullptr, 999},


lissyx · 2018-10-30T10:26:32Z

native_client/BUILD

+                "-Wl,-Bsymbolic",
+                "-Wl,-Bsymbolic-functions",
+                "-Wl,-export-dynamic",
+                "-l:libstdc++.a",


This still lacks some trivial factorization ?

sorry, I did the review commit by commit, and this is fixed by a newer one, so disregard that comment (github flagged it as "outdated")

lissyx · 2018-10-30T10:27:48Z

native_client/ctcdecode/scorer.cpp

@@ -45,10 +45,10 @@ Scorer::Scorer(double alpha,

 Scorer::~Scorer() {
  if (language_model_ != nullptr) {
-    delete static_cast<lm::base::Model*>(language_model_);
+    delete language_model_;


can we make sure there's no memory leak?

You mean by using std::unique_ptr?

for example :)

lissyx

LGTM, if we can fix any memory issue before landing it's perfect, otherwise we can check and fix that later

lissyx · 2018-10-31T17:05:49Z

@ryanleary Hello Ryan, sorry to hijack this PR, but it looks like we have some contributor willing to spend time building stuff for Windows, and currently hitting some roadblocks on ctcdecode side: https://discourse.mozilla.org/t/compiling-for-windows/32939/18

lock · 2019-01-02T15:42:15Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

reuben self-assigned this Oct 24, 2018

reuben force-pushed the ctcdecode branch 6 times, most recently from d42530f to ca35e2c Compare October 25, 2018 18:52

reuben added 7 commits October 25, 2018 17:00

Import parlance/ctcdecode into repository

c34fc5b

Integrate ctcdecode into build system

0002d0f

Change Alphabet class to match ctcdecode needs

e72f079

Adjust ctcdecode to use our Alphabet code, support using trie files

770d742

Use ctcdecode in native client

3cc9b37

Make trie parameter optional, building it on demand if not passed

b29d0ab

Add libdl and pthread to linkopts

200e61e

reuben force-pushed the ctcdecode branch from ca35e2c to 07da2a1 Compare October 25, 2018 20:02

reuben added 2 commits October 25, 2018 17:38

Switch native tests to ctcdecode trie and prodmodel

70ff71c

Fix libstdc++ versioning error on ARM

58b25da

reuben force-pushed the ctcdecode branch from 07da2a1 to 58b25da Compare October 25, 2018 20:42

reuben requested review from kdavis-mozilla and lissyx October 25, 2018 22:13

reuben added 2 commits October 25, 2018 22:28

Fix signature of ctc_beam_search_decoder_batch

cf57453

Remove leftover hack making Scorer members public

db3a36c