[MXNET-807] Support integer label type in ctc_loss operator #12468

apeforest · 2018-09-06T03:49:39Z

Description

This PR fixed part of the issues in #10995. It supports integer type in label just as WarpCTC does.
Also added a unit test to test large class issue fixed by another PR #11834

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Support integer label type
Created unit test for ctc_loss operator in CPU, GPU and integer labels

Comments

Also added a unit test for large arrays to cover the fix by another PR Fix mxnet ctc_loss bug #11834

lebeg · 2018-09-06T07:25:04Z

src/operator/contrib/ctc_loss-inl.h

+          PackLabelByLength(labels, in_data[kLabelLength].get<xpu, 1, DType>(s),
+                           &packed_labels, &label_lengths);
+      } else {
+        exceed_cudnn_limit = LabelTensorToPackedVector(labels, param_.blank_label == 0?0:-1,


Some formatting issues with whitespaces and indentation

Sorry, what was exactly the issue? The make lint seems to pass.

Sorry, should be 0 ? 0 : -1

lebeg · 2018-09-06T07:28:15Z

@apeforest Could you rebase your change on latest master?

samskalicky · 2018-09-06T16:21:14Z

@apeforest If the ctc_loss operator is complete, we should consider moving this to operator/nn and out of contrib.

…nteger

apeforest · 2018-09-06T16:56:15Z

@lebeg @samskalicky I have merged from master and the check now passed. Please review the PR again. Thanks

apeforest · 2018-09-06T17:00:24Z

Hi @szha, @samskalicky asks if we could move this operator from contrib to mxnet regular. Do you have any suggestion? Thanks!

szha · 2018-09-06T17:08:17Z

@Jerryzcn @zhiheng-huang you're likely interested in this.

szha

Please get rid of all indentation changes.

apeforest · 2018-09-06T17:25:29Z

@szha The extra indentation in the block was due to the addition of macro MSHADOW_TYPE_SWITCH on line 258. Other places were due to make lint failure.

szha · 2018-09-06T17:30:20Z

src/operator/contrib/ctc_loss-inl.h

-enum CTCLossOpForwardResource { kTempSpace };
+  enum CTCLossOpInputs { kData, kLabel };
+  enum CTCLossOpOutputs { kOut, kGrad };
+  enum CTCLossOpForwardResource { kTempSpace };


which case does this part fall into? how were we able to check it in without breaking master build if it's either of the cases?

Sorry, overlooked these lines. I have removed them.

szha · 2018-09-06T17:31:29Z

src/operator/contrib/ctc_loss-inl.h


    Tensor<xpu, 3, real_t> data_grad_computed =
-        out_data[ctc_loss::kGrad].get<xpu, 3, real_t>(s);
+      out_data[ctc_loss::kGrad].get<xpu, 3, real_t>(s);


Removed. However, these 4 space indentation seems a violation of Google C++ style guide. Are we ignoring them in the lint?
https://google.github.io/styleguide/cppguide.html

noise in changes was removed.

…t/incubator-mxnet into bugfix/ctc_loss_integer

szha

Actually, there already are tests.

szha · 2018-09-06T18:17:02Z

tests/python/unittest/test_contrib_operator.py

@@ -244,6 +244,64 @@ def assert_match(inputs, x, y, threshold, is_ascend=False):
    assert_match([[0.5, 0.6], [0.1, 0.2], [0.3, 0.4]], [1, -1, 0], [2, 0], 1e-12, False)
    assert_match([[0.5, 0.6], [0.1, 0.2], [0.3, 0.4]], [-1, 0, 1], [1, 2], 100, True)

+def test_ctc_loss_op():


CTC loss tests can be found at https://github.com/apache/incubator-mxnet/blob/master/tests/python/unittest/test_operator.py#L4500, and integration at https://github.com/apache/incubator-mxnet/blob/master/tests/python/unittest/test_loss.py#L186. Test cases are from hand calculated examples.

Feel free to add test cases for large labels there.

Thanks for the reference. Moving the unit test there.

szha · 2018-09-06T18:17:05Z

tests/python/unittest/test_contrib_operator.py

+        loss = mx.nd.contrib.ctc_loss(data=data, label=label)
+        loss = mx.nd.make_loss(loss)
+        expected_output = [9.604521, 7.096151, 4.906869, 5.5237527, 5.9895644, 5.584548, 
+                           5.528411, 5.765914, 6.740701, 5.2625823]


This testing strategy (i.e. compare the output from random input and labels with fixed seed from recorded output) is not meaningful and does not guarantee anything. It merely increases the line coverage.

Did not notice the unit test in test_operator.py. I have removed this one.

szha · 2018-09-06T18:23:45Z

Also, since this is still using legacy op interface, would you mind adopting the new operator interface for this?

apeforest · 2018-09-06T19:32:56Z

@szha This PR is to fix the unsupported integer label type. If we were to refactor this operator altogether, I would prefer to do it in another PR. What do you think?

Also, as @samskalicky suggested, do you think it is mature to move this operator from contrib to regular? If that's the case, we can create another ticket to perform this migration together with refactoring.

szha · 2018-09-06T19:34:46Z

tests/python/unittest/test_operator.py

+    label_len = 10
+    num_classes = 6000
+    x = np.random.uniform(size=(seq_len, batch_size, num_classes))
+    y = np.random.randint(0, num_classes, size=(batch_size, label_len))


again this does not seem like a good way of testing this.

Any suggestion to test the large classes? I could compare this with WarpCtc implementation result if that can be treated as golden.

Make a small example, calculate a the value and test for that, like in any other CTC tests. Since this is for testing the type, the batch size and sequence lengths are irrelevant.

The label type is tested in line 4520. This testcase is to test the large number of classes that would crash reported in issue #10995

szha · 2018-09-06T19:43:47Z

tests/python/unittest/test_operator.py

+@with_seed(1)
+def test_ctc_loss_with_large_classes():
+    ctx = default_context()
+    batch_size = 1024


How does this help to verify the correctness?

I simply used the example reported in the original issue to make sure this fix addressed that.

The issue that needs testing is the type of the labels, so a large batch size doesn't seem helpful or necessary for verifying the correctness.

Also, tests with fixed seed are treated as a test quality issue and are being eliminated right now.

The label type is tested in line 4520. This testcase is to test the large number of classes that would crash reported in issue #10995

OK, then make a test for it. Batch size is still not relevant, is it?

It is not the batch_size in training. It is the size of the vocabulary. We need this variable to create the 3D tensor.

updated the variable name and removed the fixed seed

No, it really is not, the vocabulary size, regardless of how you name it. Please check the API doc and see its usage.

Sorry for my misunderstanding the API. I have updated the unit tests based on your suggestion. Please review it again. Thanks!

szha · 2018-09-06T23:21:28Z

tests/python/unittest/test_operator.py

+    data = mx.nd.array(x, ctx=ctx)
+    label = mx.nd.array(y, ctx=ctx)
+    loss = mx.nd.contrib.ctc_loss(data=data, label=label)
+    assert loss.asnumpy().shape[0] == m


Why are you testing for shape?

Just to test the operator does not crash upon large number of classes.

This test does not crash on the master branch without the change either.

That's true. This unit test is not to test my fix. It is to test an earlier PR #11834 which did not include a unit test but was merged somehow.

Thanks for that. Still, the batch size is unnecessarily large. Why not make the test run faster? Also, there's still no test that covers the loss of precision problem that the integer label type solves, which is part of your fix. Would you mind adding that please?

Updated the batch size to 2.

apeforest · 2018-09-10T17:19:11Z

Created a ticket https://issues.apache.org/jira/browse/MXNET-912 to move this operator from contrib to regular.

apeforest · 2018-09-10T21:13:21Z

ctc_loss operator yields different result in python3 and python2 environment breaking the newly added unit test. I am investigating the rootcause.

…2468) * Support integer type in ctc_loss * Support any data type in ctc_loss operator * Enable integer type in labels and fix lint errors * Fix compilation error in GPU * Add unit tests * Undo indentation * Undo blank line * Undo blank line * Add unit test for large number of classes * move unit tests to test_operator.py per reviewer advice * update unit test * update unit test * update unit test using random seed * Update unit test * Fix unit test difference Python2 and Python3

apeforest added 5 commits September 5, 2018 17:30

Support integer type in ctc_loss

f0a757b

Support any data type in ctc_loss operator

7af7274

Enable integer type in labels and fix lint errors

5e99e7e

Fix compilation error in GPU

eb30964

Add unit tests

774c61b

apeforest requested a review from anirudh2290 as a code owner September 6, 2018 03:49

lebeg reviewed Sep 6, 2018

View reviewed changes

Merge remote-tracking branch 'upstream/master' into bugfix/ctc_loss_i…

d9dc6e6

…nteger

szha previously requested changes Sep 6, 2018

View reviewed changes

szha reviewed Sep 6, 2018

View reviewed changes

apeforest added 3 commits September 6, 2018 10:42

Undo indentation

59f48f2

Undo blank line

1b3d141

Undo blank line

299b1e7

apeforest added 2 commits September 6, 2018 18:07

Add unit test for large number of classes

ec5cc3c

Merge branch 'bugfix/ctc_loss_integer' of https://github.com/apefores…

59e5d7c

…t/incubator-mxnet into bugfix/ctc_loss_integer

szha reviewed Sep 6, 2018

View reviewed changes

apeforest added 3 commits September 6, 2018 12:07

move unit tests to test_operator.py per reviewer advice

c8b7cd4

update unit test

973daca

update unit test

217069e

szha reviewed Sep 6, 2018

View reviewed changes

update unit test using random seed

4574c7c

szha reviewed Sep 6, 2018

View reviewed changes

Update unit test

fa61a0a

Fix unit test difference Python2 and Python3

3fbb3f5

szha merged commit dcf4e12 into apache:master Sep 12, 2018

apeforest deleted the bugfix/ctc_loss_integer branch January 7, 2020 22:49

[MXNET-807] Support integer label type in ctc_loss operator #12468

[MXNET-807] Support integer label type in ctc_loss operator #12468

Conversation

apeforest commented Sep 6, 2018 • edited Loading

Description

Checklist

Essentials

Changes

Comments

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lebeg Sep 6, 2018 • edited Loading

Choose a reason for hiding this comment

lebeg commented Sep 6, 2018

samskalicky commented Sep 6, 2018

apeforest commented Sep 6, 2018

apeforest commented Sep 6, 2018

szha commented Sep 6, 2018

szha left a comment

Choose a reason for hiding this comment

apeforest commented Sep 6, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szha left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szha commented Sep 6, 2018

apeforest commented Sep 6, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apeforest Sep 7, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apeforest Sep 7, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apeforest commented Sep 10, 2018

apeforest commented Sep 10, 2018

apeforest commented Sep 6, 2018 •

edited

Loading

lebeg Sep 6, 2018 •

edited

Loading

apeforest Sep 7, 2018 •

edited

Loading

apeforest Sep 7, 2018 •

edited

Loading