Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove old CTC decoder (Fixes #1675) #1696

Merged
merged 13 commits into from
Nov 12, 2018
Merged

Remove old CTC decoder (Fixes #1675) #1696

merged 13 commits into from
Nov 12, 2018

Conversation

reuben
Copy link
Contributor

@reuben reuben commented Nov 3, 2018

  • Remove old decoder code and build system integration
  • Remove references in code and tests
  • Update documentation

@reuben reuben force-pushed the remove-old-ctc branch 14 times, most recently from 8d022cc to da29a5c Compare November 5, 2018 01:42
@reuben
Copy link
Contributor Author

reuben commented Nov 5, 2018

@lissyx I'm having trouble getting the ctcdecode package to build on ARM64, getting relocation errors in libm.a when doing the final linking of the Python extension. I see that in native_client/BUILD we link against libstdc++.a, is that a requirement on our ARM platforms? The strange thing is, it's building fine on ARMv7...

@lissyx
Copy link
Collaborator

lissyx commented Nov 5, 2018

@lissyx I'm having trouble getting the ctcdecode package to build on ARM64, getting relocation errors in libm.a when doing the final linking of the Python extension. I see that in native_client/BUILD we link against libstdc++.a, is that a requirement on our ARM platforms? The strange thing is, it's building fine on ARMv7...

They should share the same linking strategy, but maybe you lack some -PIE or something like that?

@@ -1,7 +1,7 @@
build:
template_file: test-linux-opt-base.tyml
dependencies:
- "linux-amd64-ctc-opt"
- "linux-amd64-cpu-opt"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's a good idea, completion time of linux-amd64-cpu-opt is much higher, it's going to delay everything behind, we should re-use linux-amd64-ctc-opt to build the new python wheel instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that it's not a new Python wheel, it's several Python wheels, one for each architecture/Python version, so I tried to leverage the existing infrastructure for CPU builds.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but it's still better IMHO to do that outside, we can easily hack around do_deepspeech_python_build

@reuben
Copy link
Contributor Author

reuben commented Nov 5, 2018

Initially the compiler error mentioned -fPIC, but compiling with that flag made no difference, it still fails with relocation issues in libm.a.

@reuben
Copy link
Contributor Author

reuben commented Nov 5, 2018

I think the problem is that definitions.mk doesn't cover all that's needed to get a build working on ARM, as it's missing some of the stuff that was added in the tensorflow CROSSTOOL file like pointing to a different linker.

@lissyx
Copy link
Collaborator

lissyx commented Nov 5, 2018

I think the problem is that definitions.mk doesn't cover all that's needed to get a build working on ARM, as it's missing some of the stuff that was added in the tensorflow CROSSTOOL file like pointing to a different linker.

It's more simply just unable to find the proper libm because of absolute paths in ld scripts. Quick hack that works for me:

diff --git a/native_client/ctcdecode/Makefile b/native_client/ctcdecode/Makefile
index 14cc591..39c5e63 100644
--- a/native_client/ctcdecode/Makefile
+++ b/native_client/ctcdecode/Makefile
@@ -4,6 +4,8 @@ include ../definitions.mk

 NUM_PROCESSES ?= 1

+LDFLAGS_NEEDED += $(RASPBIAN)/lib/aarch64-linux-gnu/libm-2.24.so
+
 all: bindings

 clean:
@@ -16,4 +18,4 @@ bindings:
        AS=$(AS) CC=$(CC) CXX=$(CXX) LD=$(LD) CFLAGS="$(CFLAGS) $(CXXFLAGS)" LDFLAGS="$(LDFLAGS_NEEDED)" $(PYTHON_PATH) $(NUMPY_INCLUDE) python ./setup.py build_ext --num_processes $(NUM_PROCESSES) $(PYTHON_PLATFORM_NAME) $(SETUP_FLAGS)
        find temp_build -type f -name "*.o" -delete
        AS=$(AS) CC=$(CC) CXX=$(CXX) LD=$(LD) CFLAGS="$(CFLAGS) $(CXXFLAGS)" LDFLAGS="$(LDFLAGS_NEEDED)" $(PYTHON_PATH) $(NUMPY_INCLUDE) python ./setup.py bdist_wheel --num_processes $(NUM_PROCESSES) $(PYTHON_PLATFORM_NAME) $(SETUP_FLAGS)
-       rm -rf temp_build
\ No newline at end of file
+       rm -rf temp_build
diff --git a/native_client/ctcdecode/setup.py b/native_client/ctcdecode/setup.py
index e6b3855..ec31f9b 100644
--- a/native_client/ctcdecode/setup.py
+++ b/native_client/ctcdecode/setup.py
@@ -89,10 +89,6 @@ LIBS = []
 ARGS = ['-O3', '-DNDEBUG', '-DKENLM_MAX_ORDER=6', '-std=c++11',
         '-Wno-unused-local-typedef', '-Wno-sign-compare']

-if os.environ['TARGET'] == 'rpi3-armv8':
-    ARGS.append('-static-libstdc++')
-
-
 decoder_module = Extension(
     name='ds_ctcdecoder._swigwrapper',
     sources=['swigwrapper.i'] + FILES + glob.glob('*.cpp'),

@reuben
Copy link
Contributor Author

reuben commented Nov 5, 2018

We have liftoff! Thanks @lissyx!

@reuben reuben force-pushed the remove-old-ctc branch 6 times, most recently from 53ed610 to c5d6c6d Compare November 8, 2018 15:31
@reuben
Copy link
Contributor Author

reuben commented Nov 9, 2018

In 38b5447 I've updated the docs and added some convenience code in util/taskcluster.py to get the URL of the ds_ctcdecoder package for easy installation.

@reuben
Copy link
Contributor Author

reuben commented Nov 9, 2018

@tilmankamp I ended up removing the COORD global entirely as it's only used in the train function now. Now globals related code is in util/config.py and the coordinator remains in util/coordinator.py.

@reuben reuben changed the title WIP Remove old CTC decoder Remove old CTC decoder (Fixes #1675) Nov 9, 2018
Copy link
Contributor

@kdavis-mozilla kdavis-mozilla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The majority of the PR is fine.

However there are a few places that some substitutions, e.g. cluster to C.cluster, seem to have been missed and a few other small nits that should be changed, e.g. from X import * to from X import a, b, c.

native_client/ctcdecode/scorer.cpp Outdated Show resolved Hide resolved
native_client/ctcdecode/scorer.h Outdated Show resolved Hide resolved
DeepSpeech.py Outdated Show resolved Hide resolved
DeepSpeech.py Outdated Show resolved Hide resolved
DeepSpeech.py Outdated Show resolved Hide resolved
evaluate.py Show resolved Hide resolved
native_client/ctcdecode/scorer.cpp Show resolved Hide resolved
util/taskcluster.py Outdated Show resolved Hide resolved
util/coordinator.py Outdated Show resolved Hide resolved
util/coordinator.py Outdated Show resolved Hide resolved
@reuben
Copy link
Contributor Author

reuben commented Nov 11, 2018

Everything should be addressed now.

@kdavis-mozilla
Copy link
Contributor

LGTM

util/taskcluster.py Outdated Show resolved Hide resolved
@reuben reuben force-pushed the remove-old-ctc branch 3 times, most recently from 0a38df2 to bd71f83 Compare November 12, 2018 16:07
@reuben reuben merged commit d125acf into master Nov 12, 2018
@reuben reuben deleted the remove-old-ctc branch November 12, 2018 17:55
@JRMeyer
Copy link
Contributor

JRMeyer commented Nov 15, 2018

@reuben, does this PR mean we don't have to generate the old trie model (i.e. from v0.3.0) for training and new trie model for testing?

@reuben
Copy link
Contributor Author

reuben commented Nov 15, 2018 via email

@JRMeyer
Copy link
Contributor

JRMeyer commented Nov 17, 2018

@reuben I'm loading in an older checkpoint (v0.3.0), and I get the following error:

KeyError: 'CTCBeamSearchDecoderWithLM'

After some searching, I think this is related to a missing libctc_decoder_with_kenlm.so file (c.f. #1306, #950)

The following is the script I'm running to load the checkpoint:

import tensorflow as tf
from tensorflow.python.tools import freeze_graph
from ds_ctcdecoder import ctc_beam_search_decoder, Scorer

sess = tf.Session()
sess.run(tf.global_variables_initializer())
 
tf.contrib.rnn

tf.load_op_library('../tmp/native_client/libdeepspeech.so')

new_saver = tf.train.import_meta_graph('../v030/ckpt/model.v0.2.0.meta')
new_saver.restore(sess, '../v030/ckpt/model.v0.2.0')

Is this error a by-product of this PR? Are the older checkpoints on the release page not forwards compatible after this PR?

-josh

@reuben
Copy link
Contributor Author

reuben commented Nov 17, 2018

This works fine for me:

python -u DeepSpeech.py --checkpoint_dir v0.3_checkpoint --epoch -1 --train_files data/ldc93s1/ldc93s1.csv --dev_files data/ldc93s1/ldc93s1.csv --test_files data/ldc93s1/ldc93s1.csv
Preprocessing ['data/ldc93s1/ldc93s1.csv']
Preprocessing done
Preprocessing ['data/ldc93s1/ldc93s1.csv']
Preprocessing done
W Parameter --validation_step needs to be >0 for early stopping to work
I STARTING Optimization
I Training epoch 250473...
I Training of Epoch 250473 - loss: 50.891705
I FINISHED Optimization - training time: 0:00:30
100% (1 of 1) |##########################################################################################################################################################| Elapsed Time: 0:00:00 Time:  0:00:00
Preprocessing ['data/ldc93s1/ldc93s1.csv']
Preprocessing done
Computing acoustic model predictions...
100% (1 of 1) |##########################################################################################################################################################| Elapsed Time: 0:00:01 Time:  0:00:01
Decoding predictions...
100% (1 of 1) |##########################################################################################################################################################| Elapsed Time: 0:00:00 Time:  0:00:00
Test - WER: 0.363636, CER: 9.000000, loss: 17.909977
--------------------------------------------------------------------------------
WER: 0.363636, CER: 9.000000, loss: 17.909977
 - src: "she had your dark suit in greasy wash water all year"
 - res: "she had a dark suit and grease with water all year"
--------------------------------------------------------------------------------

@reuben
Copy link
Contributor Author

reuben commented Nov 17, 2018

(On latest master)

@JRMeyer
Copy link
Contributor

JRMeyer commented Nov 17, 2018

Strange, but good news.

@lock
Copy link

lock bot commented Jan 2, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Jan 2, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants