Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

add a compiler flag to use int64 as tensor size #14570

Merged
merged 37 commits into from
Apr 23, 2019

Conversation

apeforest
Copy link
Contributor

@apeforest apeforest commented Mar 29, 2019

Description

This PR fix #14496.

The performance degradation is introduced by PR #11742. I have verified the performance degradation in one of the operators transpose using a script below:

import mxnet as mx
import time
import numpy as np

sizes = [10, 50, 100,200,500]
iters = [10000,1000,500,200,20]
times = []
for size in range(len(sizes)):
    data = []
    s = sizes[size]
    print(s)
    for i in range(iters[size]):
        x = mx.nd.ones((s,s,s))
        mx.nd.waitall()
        start = time.time()
        y = mx.nd.transpose(x,(2,0,1))
        mx.nd.waitall()
        data.append((time.time() - start)*1000)
        #print(data[-1])                                                                                                                                                                            
    times.append(data)

print('mxnet version: %s' % mx.__version__)
for s in range(len(sizes)):
    print('--------------------')
    print('size: %s' % str(sizes[s]))
    print('p50: %4.2f ms' % np.percentile(times[s],50))
    print('p90: %4.2f ms' % np.percentile(times[s],90))
    print('p99: %4.2f ms' % np.percentile(times[s],99))

By changing the index_t type from int64_t to int32_t can consistently change the 50-percentile runtime of tranpose of size 500 from 5000ms to 2400ms.

By changing the data type in the operator (https://github.com/dmlc/mshadow/blob/master/mshadow/extension/transpose.h#L70) alone, can also reduce the 50-percentile runtime of size 500 to 2600ms.

I thereforce come to the conclusion that the performance degradation is caused by the runtime of integer arithmetic operations between 32-bit integers and 64-bit integers.

To further experiment with the arithmetic operations alone, I tested using a small program here. The runtime results are shown below:

result = 49995000
Add 32-bit time in ms 1359
result = 49995000
Add 64-bit time in ms 1971
result = 349965000
Add Mul 32-bit time in ms 1196
result = 349965000
Add Mul 64-bit time in ms 3477
result = 7137858
Add Div 32-bit time in ms 2878
result = 7137858
Add Div 64-bit time in ms 8499

I can think of three solutions to this problem:
(1) Add a compilation flag to choose data types for tensor size (This PR)
(2) Add an environment variable to choose data type for tensor size at runtime
(3) Choose data type for tensor size at runtime based on the size of the tensor

Given the expression template used in mshadow for operators, either (2) or (3) requries a significant change in the mshadow library. (1) can be used as a quick solution to fix performance degradation reported in several issues: #14496, #13928, #14563, #14569

Any other suggestion is appreciated.

This PR also depends on dmlc/mshadow#371

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@apeforest
Copy link
Contributor Author

@pengzhao-intel @TaoLv Please help to review. Thanks!

@abhinavs95
Copy link
Contributor

@mxnet-label-bot add [pr-work-in-progress, backend]

@marcoabreu marcoabreu added Backend Issues related to the backend of MXNet pr-work-in-progress PR is still work in progress labels Mar 29, 2019
@TaoLv
Copy link
Member

TaoLv commented Mar 30, 2019

Do we need to change CI to cover this new flag? Do we need to add it to the runtime feature set? @larroy

@wkcn
Copy link
Member

wkcn commented Mar 30, 2019

Thank you for the quick fix!

I think we can use the solution (3): 'Choose data type for tensor size at runtime based on the size of the tensor' in the performance bottleneck firstly.

@pengzhao-intel
Copy link
Contributor

Agree with adding the new flag to switch to int64 when the user explicitly know they want to use the large array, like DGL. For the normal case, int32 is enough w/o performance issue.

@TaoLv do we need to switch off MKLDNN when int64 is enabled before MKL-DNN 1.0?

@wkcn would you mind mentioning the advantage of 3)?

@wkcn
Copy link
Member

wkcn commented Mar 30, 2019

@pengzhao-intel Yes. For users, it is confusing that there will be two versions of mxnet: mxnet and mxnet-largetensor. In other words, users need to rebuild the source, when they use large tensor. It is too complex.

Besides, when there is a program which need large-tensor support and higher performance for small-tensor, only the solution 3 could meet the requirment.

We can implement 3) as follow:

#define KERNEL_LAUNCH_TYPE_SWITCH(N, ...) \
  do {                                    \
  if (N <= INT32_MAX)                     \
    {                                     \
      typedef int32_t IndexType;          \
      {__VA_ARGS__}                       \
    }                                     \
  else                                    \
    {                                     \
      typedef int64_t IndexType;          \
      {__VA_ARGS__}                       \
    }                                     \
  } while(0)

@apeforest
Copy link
Contributor Author

@wkcn Thanks for the suggestion. This macro is good for certain simple operators but I found it difficult to apply to all operators given the different architecture of operator implementation. E.g. transpose operator is currently using the expression template pattern in mshadow. It seems a big surgery to apply this macro to operators in mshadow. Besides, int64_t is used as type of data members in the tensor datastructure. This approach will not directly change the internal methods in tensor or requires static cast. Any other thoughts on this?

.gitmodules Outdated Show resolved Hide resolved
make/config.mk Show resolved Hide resolved
make/config.mk Outdated Show resolved Hide resolved
include/mxnet/tensor_blob.h Show resolved Hide resolved
include/mxnet/tensor_blob.h Show resolved Hide resolved
Makefile Show resolved Hide resolved
Copy link
Member

@TaoLv TaoLv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the new flag to CI and make sure we have appropriate unit tests for it. Thank you.

Copy link
Contributor

@larroy larroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a big fan of using signed type for dimension, but I find it non-blocking.

As this has been clarified, we need signed type.

include/mxnet/tensor_blob.h Show resolved Hide resolved
@samskalicky
Copy link
Contributor

@larroy the dimension can be -1 if its unknown, or if -1 refers to the last dimension. So we do have to use signed types.

include/mxnet/tensor_blob.h Outdated Show resolved Hide resolved
include/mxnet/tensor_blob.h Show resolved Hide resolved
@larroy
Copy link
Contributor

larroy commented Apr 8, 2019

@reminisce why aren't we using ndim=1 for scalars?

@reminisce
Copy link
Contributor

@larroy Tensors with ndim=1 are not scalars, but 1D tensors, aka vectors.

@apeforest apeforest changed the title [WIP] use a compile flag to use int64 tensor size use a compile flag to use int64 tensor size Apr 9, 2019
@apeforest
Copy link
Contributor Author

apeforest commented Apr 20, 2019

@apeforest Thank you. That sounds reasonable. But I remember I couldn't find the nightly test report for large tensor before. Do you know where we can read it once this PR is merged?

@TaoLv , I think the nightly tests are here: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/NightlyTests/activity

@marcoabreu Could you please confirm? Thanks!

@apeforest apeforest added pr-awaiting-review PR is waiting for code review and removed pr-work-in-progress PR is still work in progress labels Apr 20, 2019
@apeforest
Copy link
Contributor Author

@TaoLv Do you have any more concerns for this PR? Thanks!

@apeforest
Copy link
Contributor Author

Are we checking for narrowing / overflow since we have this additional wider type for dimensions?

This PR itself is not narrowing/widening the data type for dimension. It simply adds a compiler flag to control this.

@apeforest
Copy link
Contributor Author

@eric-haibin-lin @yuxihu @samskalicky @anirudh2290 @reminisce Please let me know if your concerns have been addressed. Thanks!

Copy link
Contributor

@samskalicky samskalicky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@marcoabreu
Copy link
Contributor

LGTM from a CI perspective

@TaoLv
Copy link
Member

TaoLv commented Apr 23, 2019

Thank you for the fix, @apeforest. Now it's approved. Just copy some comments from the mshadow PR here: int64 data type only affects the definition of some variables and the total size of a NDArray. We won't pass any NDArray with dim[x] > INT32_MAX into a operator. So it will not break the existing definition of BLAS and MKL-DNN APIs.

@apeforest
Copy link
Contributor Author

@marcoabreu @TaoLv @anirudh2290 could you please help to merge it if it’s good to ship? Thanks

@apeforest apeforest added pr-awaiting-merge Review and CI is complete. Ready to Merge and removed pr-awaiting-review PR is waiting for code review labels Apr 23, 2019
Copy link
Member

@yuxihu yuxihu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

'Test Large Tensor Size: CPU': {
node(NODE_LINUX_CPU) {
ws('workspace/large_tensor-cpu') {
utils.unpack_and_init('cpu_int64', mx_cmake_lib)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this should have been ubuntu_cpu_int64.

The nightly tests are failing because no one is publishing this stash: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/NightlyTestsForBinaries/detail/master/292/pipeline

As you can see from "Build / CPU: USE_INT64_TENSOR_SIZE", it publishes ubuntu_cpu_int64.

@edisongustavo edisongustavo mentioned this pull request Apr 26, 2019
7 tasks
marcoabreu pushed a commit that referenced this pull request Apr 26, 2019
The PR #14570 has
introduced a bug where the "stash name" is not matching:

- It is **packed** as `ubuntu_cpu_int64`
- It is **unpacked** as `cpu_int64`
access2rohit pushed a commit to access2rohit/incubator-mxnet that referenced this pull request May 14, 2019
The PR apache#14570 has
introduced a bug where the "stash name" is not matching:

- It is **packed** as `ubuntu_cpu_int64`
- It is **unpacked** as `cpu_int64`
haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019
* use a compile flag to use int64 tensor size

* use personal mshadow repo

* update data type

* update make config

* change size_t to index_t and add documentation

* update mshadow submodule to master

* fix compilation warning

* fix compiler warning

* fix compiler warning

* fix compiler warning

* fix compiler warning

* fix compiler error

* change nnvm::Tuple to mxnet::Tuple

* fix compiler warning

* fix compiler warning

* fix compiler warning

* fix compiler warning

* fix compiler warning

* fix lint

* update CI runtime_functons

* update runtime function

* correct runtime_functions

* udpate runtime functions

* add nightly test for large tensor

* update Jenkins files to test new compiler flag

* fix CI

* add runtime feature detect for the compiler flag

* change build from make to cmake

* fix CI

* move tests to nightly
haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019
The PR apache#14570 has
introduced a bug where the "stash name" is not matching:

- It is **packed** as `ubuntu_cpu_int64`
- It is **unpacked** as `cpu_int64`
@TaoLv TaoLv mentioned this pull request Aug 7, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Backend Issues related to the backend of MXNet pr-awaiting-merge Review and CI is complete. Ready to Merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

performance degradation from 1.3.1 to 1.4.0