Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[MXNET-807] Support integer label type in ctc_loss operator #12468

Merged
merged 17 commits into from
Sep 12, 2018

Conversation

apeforest
Copy link
Contributor

@apeforest apeforest commented Sep 6, 2018

Description

This PR fixed part of the issues in #10995. It supports integer type in label just as WarpCTC does.
Also added a unit test to test large class issue fixed by another PR #11834

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Support integer label type
  • Created unit test for ctc_loss operator in CPU, GPU and integer labels

Comments

PackLabelByLength(labels, in_data[kLabelLength].get<xpu, 1, DType>(s),
&packed_labels, &label_lengths);
} else {
exceed_cudnn_limit = LabelTensorToPackedVector(labels, param_.blank_label == 0?0:-1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some formatting issues with whitespaces and indentation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, what was exactly the issue? The make lint seems to pass.

Copy link
Contributor

@lebeg lebeg Sep 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, should be 0 ? 0 : -1

@lebeg
Copy link
Contributor

lebeg commented Sep 6, 2018

@apeforest Could you rebase your change on latest master?

@samskalicky
Copy link
Contributor

@apeforest If the ctc_loss operator is complete, we should consider moving this to operator/nn and out of contrib.

@apeforest
Copy link
Contributor Author

@lebeg @samskalicky I have merged from master and the check now passed. Please review the PR again. Thanks

@apeforest
Copy link
Contributor Author

Hi @szha, @samskalicky asks if we could move this operator from contrib to mxnet regular. Do you have any suggestion? Thanks!

@szha
Copy link
Member

szha commented Sep 6, 2018

@Jerryzcn @zhiheng-huang you're likely interested in this.

szha
szha previously requested changes Sep 6, 2018
Copy link
Member

@szha szha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please get rid of all indentation changes.

@apeforest
Copy link
Contributor Author

@szha The extra indentation in the block was due to the addition of macro MSHADOW_TYPE_SWITCH on line 258. Other places were due to make lint failure.

enum CTCLossOpForwardResource { kTempSpace };
enum CTCLossOpInputs { kData, kLabel };
enum CTCLossOpOutputs { kOut, kGrad };
enum CTCLossOpForwardResource { kTempSpace };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which case does this part fall into? how were we able to check it in without breaking master build if it's either of the cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, overlooked these lines. I have removed them.


Tensor<xpu, 3, real_t> data_grad_computed =
out_data[ctc_loss::kGrad].get<xpu, 3, real_t>(s);
out_data[ctc_loss::kGrad].get<xpu, 3, real_t>(s);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed. However, these 4 space indentation seems a violation of Google C++ style guide. Are we ignoring them in the lint?
https://google.github.io/styleguide/cppguide.html

@szha szha dismissed their stale review September 6, 2018 17:56

noise in changes was removed.

Copy link
Member

@szha szha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, there already are tests.

@@ -244,6 +244,64 @@ def assert_match(inputs, x, y, threshold, is_ascend=False):
assert_match([[0.5, 0.6], [0.1, 0.2], [0.3, 0.4]], [1, -1, 0], [2, 0], 1e-12, False)
assert_match([[0.5, 0.6], [0.1, 0.2], [0.3, 0.4]], [-1, 0, 1], [1, 2], 100, True)

def test_ctc_loss_op():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to add test cases for large labels there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reference. Moving the unit test there.

loss = mx.nd.contrib.ctc_loss(data=data, label=label)
loss = mx.nd.make_loss(loss)
expected_output = [9.604521, 7.096151, 4.906869, 5.5237527, 5.9895644, 5.584548,
5.528411, 5.765914, 6.740701, 5.2625823]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This testing strategy (i.e. compare the output from random input and labels with fixed seed from recorded output) is not meaningful and does not guarantee anything. It merely increases the line coverage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did not notice the unit test in test_operator.py. I have removed this one.

@szha
Copy link
Member

szha commented Sep 6, 2018

Also, since this is still using legacy op interface, would you mind adopting the new operator interface for this?

@apeforest
Copy link
Contributor Author

@szha This PR is to fix the unsupported integer label type. If we were to refactor this operator altogether, I would prefer to do it in another PR. What do you think?

Also, as @samskalicky suggested, do you think it is mature to move this operator from contrib to regular? If that's the case, we can create another ticket to perform this migration together with refactoring.

label_len = 10
num_classes = 6000
x = np.random.uniform(size=(seq_len, batch_size, num_classes))
y = np.random.randint(0, num_classes, size=(batch_size, label_len))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again this does not seem like a good way of testing this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any suggestion to test the large classes? I could compare this with WarpCtc implementation result if that can be treated as golden.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make a small example, calculate a the value and test for that, like in any other CTC tests. Since this is for testing the type, the batch size and sequence lengths are irrelevant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The label type is tested in line 4520. This testcase is to test the large number of classes that would crash reported in issue #10995

@with_seed(1)
def test_ctc_loss_with_large_classes():
ctx = default_context()
batch_size = 1024
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this help to verify the correctness?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I simply used the example reported in the original issue to make sure this fix addressed that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue that needs testing is the type of the labels, so a large batch size doesn't seem helpful or necessary for verifying the correctness.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, tests with fixed seed are treated as a test quality issue and are being eliminated right now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The label type is tested in line 4520. This testcase is to test the large number of classes that would crash reported in issue #10995

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, then make a test for it. Batch size is still not relevant, is it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not the batch_size in training. It is the size of the vocabulary. We need this variable to create the 3D tensor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the variable name and removed the fixed seed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it really is not, the vocabulary size, regardless of how you name it. Please check the API doc and see its usage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for my misunderstanding the API. I have updated the unit tests based on your suggestion. Please review it again. Thanks!

data = mx.nd.array(x, ctx=ctx)
label = mx.nd.array(y, ctx=ctx)
loss = mx.nd.contrib.ctc_loss(data=data, label=label)
assert loss.asnumpy().shape[0] == m
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you testing for shape?

Copy link
Contributor Author

@apeforest apeforest Sep 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to test the operator does not crash upon large number of classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test does not crash on the master branch without the change either.

Copy link
Contributor Author

@apeforest apeforest Sep 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true. This unit test is not to test my fix. It is to test an earlier PR #11834 which did not include a unit test but was merged somehow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for that. Still, the batch size is unnecessarily large. Why not make the test run faster? Also, there's still no test that covers the loss of precision problem that the integer label type solves, which is part of your fix. Would you mind adding that please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the batch size to 2.

@apeforest
Copy link
Contributor Author

Created a ticket https://issues.apache.org/jira/browse/MXNET-912 to move this operator from contrib to regular.

@apeforest
Copy link
Contributor Author

ctc_loss operator yields different result in python3 and python2 environment breaking the newly added unit test. I am investigating the rootcause.

@szha szha merged commit dcf4e12 into apache:master Sep 12, 2018
anirudh2290 pushed a commit to anirudh2290/mxnet that referenced this pull request Sep 19, 2018
…2468)

* Support integer type in ctc_loss

* Support any data type in ctc_loss operator

* Enable integer type in labels and fix lint errors

* Fix compilation error in GPU

* Add unit tests

* Undo indentation

* Undo blank line

* Undo blank line

* Add unit test for large number of classes

* move unit tests to test_operator.py per reviewer advice

* update unit test

* update unit test

* update unit test using random seed

* Update unit test

* Fix unit test difference Python2 and Python3
@apeforest apeforest deleted the bugfix/ctc_loss_integer branch January 7, 2020 22:49
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants