add a compiler flag to use int64 as tensor size #14570

apeforest · 2019-03-29T18:11:47Z

Description

This PR fix #14496.

The performance degradation is introduced by PR #11742. I have verified the performance degradation in one of the operators transpose using a script below:

import mxnet as mx
import time
import numpy as np

sizes = [10, 50, 100,200,500]
iters = [10000,1000,500,200,20]
times = []
for size in range(len(sizes)):
    data = []
    s = sizes[size]
    print(s)
    for i in range(iters[size]):
        x = mx.nd.ones((s,s,s))
        mx.nd.waitall()
        start = time.time()
        y = mx.nd.transpose(x,(2,0,1))
        mx.nd.waitall()
        data.append((time.time() - start)*1000)
        #print(data[-1])                                                                                                                                                                            
    times.append(data)

print('mxnet version: %s' % mx.__version__)
for s in range(len(sizes)):
    print('--------------------')
    print('size: %s' % str(sizes[s]))
    print('p50: %4.2f ms' % np.percentile(times[s],50))
    print('p90: %4.2f ms' % np.percentile(times[s],90))
    print('p99: %4.2f ms' % np.percentile(times[s],99))

By changing the index_t type from int64_t to int32_t can consistently change the 50-percentile runtime of tranpose of size 500 from 5000ms to 2400ms.

By changing the data type in the operator (https://github.com/dmlc/mshadow/blob/master/mshadow/extension/transpose.h#L70) alone, can also reduce the 50-percentile runtime of size 500 to 2600ms.

I thereforce come to the conclusion that the performance degradation is caused by the runtime of integer arithmetic operations between 32-bit integers and 64-bit integers.

To further experiment with the arithmetic operations alone, I tested using a small program here. The runtime results are shown below:

result = 49995000
Add 32-bit time in ms 1359
result = 49995000
Add 64-bit time in ms 1971
result = 349965000
Add Mul 32-bit time in ms 1196
result = 349965000
Add Mul 64-bit time in ms 3477
result = 7137858
Add Div 32-bit time in ms 2878
result = 7137858
Add Div 64-bit time in ms 8499

I can think of three solutions to this problem:
(1) Add a compilation flag to choose data types for tensor size (This PR)
(2) Add an environment variable to choose data type for tensor size at runtime
(3) Choose data type for tensor size at runtime based on the size of the tensor

Given the expression template used in mshadow for operators, either (2) or (3) requries a significant change in the mshadow library. (1) can be used as a quick solution to fix performance degradation reported in several issues: #14496, #13928, #14563, #14569

Any other suggestion is appreciated.

This PR also depends on dmlc/mshadow#371

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Feature1, tests, (and when applicable, API doc)
Feature2, tests, (and when applicable, API doc)

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

apeforest · 2019-03-29T18:43:23Z

@pengzhao-intel @TaoLv Please help to review. Thanks!

abhinavs95 · 2019-03-29T19:32:33Z

@mxnet-label-bot add [pr-work-in-progress, backend]

TaoLv · 2019-03-30T01:58:30Z

Do we need to change CI to cover this new flag? Do we need to add it to the runtime feature set? @larroy

wkcn · 2019-03-30T04:53:43Z

Thank you for the quick fix!

I think we can use the solution (3): 'Choose data type for tensor size at runtime based on the size of the tensor' in the performance bottleneck firstly.

pengzhao-intel · 2019-03-30T08:50:44Z

Agree with adding the new flag to switch to int64 when the user explicitly know they want to use the large array, like DGL. For the normal case, int32 is enough w/o performance issue.

@TaoLv do we need to switch off MKLDNN when int64 is enabled before MKL-DNN 1.0?

@wkcn would you mind mentioning the advantage of 3)?

wkcn · 2019-03-30T09:47:50Z

@pengzhao-intel Yes. For users, it is confusing that there will be two versions of mxnet: mxnet and mxnet-largetensor. In other words, users need to rebuild the source, when they use large tensor. It is too complex.

Besides, when there is a program which need large-tensor support and higher performance for small-tensor, only the solution 3 could meet the requirment.

We can implement 3) as follow:

#define KERNEL_LAUNCH_TYPE_SWITCH(N, ...) \
  do {                                    \
  if (N <= INT32_MAX)                     \
    {                                     \
      typedef int32_t IndexType;          \
      {__VA_ARGS__}                       \
    }                                     \
  else                                    \
    {                                     \
      typedef int64_t IndexType;          \
      {__VA_ARGS__}                       \
    }                                     \
  } while(0)

apeforest · 2019-04-01T18:27:16Z

@wkcn Thanks for the suggestion. This macro is good for certain simple operators but I found it difficult to apply to all operators given the different architecture of operator implementation. E.g. transpose operator is currently using the expression template pattern in mshadow. It seems a big surgery to apply this macro to operators in mshadow. Besides, int64_t is used as type of data members in the tensor datastructure. This approach will not directly change the internal methods in tensor or requires static cast. Any other thoughts on this?

.gitmodules

make/config.mk

include/mxnet/tensor_blob.h

Makefile

TaoLv

Please add the new flag to CI and make sure we have appropriate unit tests for it. Thank you.

larroy

I'm not a big fan of using signed type for dimension, but I find it non-blocking.

As this has been clarified, we need signed type.

include/mxnet/tensor_blob.h

samskalicky · 2019-04-04T20:07:52Z

@larroy the dimension can be -1 if its unknown, or if -1 refers to the last dimension. So we do have to use signed types.

include/mxnet/tensor_blob.h

larroy · 2019-04-08T22:46:46Z

@reminisce why aren't we using ndim=1 for scalars?

reminisce · 2019-04-08T23:36:27Z

@larroy Tensors with ndim=1 are not scalars, but 1D tensors, aka vectors.

apeforest · 2019-04-20T03:34:09Z

@apeforest Thank you. That sounds reasonable. But I remember I couldn't find the nightly test report for large tensor before. Do you know where we can read it once this PR is merged?

@TaoLv , I think the nightly tests are here: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/NightlyTests/activity

@marcoabreu Could you please confirm? Thanks!

apeforest · 2019-04-22T16:13:49Z

@TaoLv Do you have any more concerns for this PR? Thanks!

apeforest · 2019-04-22T16:14:58Z

Are we checking for narrowing / overflow since we have this additional wider type for dimensions?

This PR itself is not narrowing/widening the data type for dimension. It simply adds a compiler flag to control this.

apeforest · 2019-04-22T17:55:52Z

@eric-haibin-lin @yuxihu @samskalicky @anirudh2290 @reminisce Please let me know if your concerns have been addressed. Thanks!

samskalicky

LGTM

marcoabreu · 2019-04-23T07:37:49Z

LGTM from a CI perspective

TaoLv · 2019-04-23T14:45:21Z

Thank you for the fix, @apeforest. Now it's approved. Just copy some comments from the mshadow PR here: int64 data type only affects the definition of some variables and the total size of a NDArray. We won't pass any NDArray with dim[x] > INT32_MAX into a operator. So it will not break the existing definition of BLAS and MKL-DNN APIs.

apeforest · 2019-04-23T15:02:08Z

@marcoabreu @TaoLv @anirudh2290 could you please help to merge it if it’s good to ship? Thanks

yuxihu

LGTM!

edisongustavo · 2019-04-26T11:51:41Z

tests/nightly/JenkinsfileForBinaries

+    'Test Large Tensor Size: CPU': {
+      node(NODE_LINUX_CPU) {
+        ws('workspace/large_tensor-cpu') {
+            utils.unpack_and_init('cpu_int64', mx_cmake_lib)


I guess this should have been ubuntu_cpu_int64.

The nightly tests are failing because no one is publishing this stash: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/NightlyTestsForBinaries/detail/master/292/pipeline

As you can see from "Build / CPU: USE_INT64_TENSOR_SIZE", it publishes ubuntu_cpu_int64.

The PR #14570 has introduced a bug where the "stash name" is not matching: - It is **packed** as `ubuntu_cpu_int64` - It is **unpacked** as `cpu_int64`

The PR apache#14570 has introduced a bug where the "stash name" is not matching: - It is **packed** as `ubuntu_cpu_int64` - It is **unpacked** as `cpu_int64`

* use a compile flag to use int64 tensor size * use personal mshadow repo * update data type * update make config * change size_t to index_t and add documentation * update mshadow submodule to master * fix compilation warning * fix compiler warning * fix compiler warning * fix compiler warning * fix compiler warning * fix compiler error * change nnvm::Tuple to mxnet::Tuple * fix compiler warning * fix compiler warning * fix compiler warning * fix compiler warning * fix compiler warning * fix lint * update CI runtime_functons * update runtime function * correct runtime_functions * udpate runtime functions * add nightly test for large tensor * update Jenkins files to test new compiler flag * fix CI * add runtime feature detect for the compiler flag * change build from make to cmake * fix CI * move tests to nightly

The PR apache#14570 has introduced a bug where the "stash name" is not matching: - It is **packed** as `ubuntu_cpu_int64` - It is **unpacked** as `cpu_int64`

use a compile flag to use int64 tensor size

41351f3

apeforest requested a review from szha as a code owner March 29, 2019 18:11

apeforest mentioned this pull request Mar 29, 2019

Add compile flag to choose data type for tensor size due to performance degradation dmlc/mshadow#371

Merged

marcoabreu added Backend Issues related to the backend of MXNet pr-work-in-progress PR is still work in progress labels Mar 29, 2019

use personal mshadow repo

e9bd3cc

samskalicky suggested changes Apr 1, 2019

View reviewed changes

.gitmodules Outdated Show resolved Hide resolved

make/config.mk Show resolved Hide resolved

apeforest added 2 commits April 2, 2019 09:18

Merge remote-tracking branch 'upstream/master' into perf/large-tensor

d8d21ed

update data type

caf8e7f

apeforest requested a review from anirudh2290 as a code owner April 2, 2019 17:34

update make config

0ea2cbc

samskalicky reviewed Apr 2, 2019

View reviewed changes

make/config.mk Outdated Show resolved Hide resolved

larroy reviewed Apr 2, 2019

View reviewed changes

include/mxnet/tensor_blob.h Show resolved Hide resolved

include/mxnet/tensor_blob.h Show resolved Hide resolved

anirudh2290 reviewed Apr 2, 2019

View reviewed changes

Makefile Show resolved Hide resolved

TaoLv suggested changes Apr 4, 2019

View reviewed changes

larroy approved these changes Apr 4, 2019

View reviewed changes

include/mxnet/tensor_blob.h Show resolved Hide resolved

eric-haibin-lin reviewed Apr 8, 2019

View reviewed changes

include/mxnet/tensor_blob.h Outdated Show resolved Hide resolved

reminisce previously requested changes Apr 8, 2019

View reviewed changes

include/mxnet/tensor_blob.h Outdated Show resolved Hide resolved

include/mxnet/tensor_blob.h Show resolved Hide resolved

change size_t to index_t and add documentation

3a3c02f

apeforest changed the title ~~[WIP] use a compile flag to use int64 tensor size~~ use a compile flag to use int64 tensor size Apr 9, 2019

Merge remote-tracking branch 'upstream/master' into perf/large-tensor

27584ea

apeforest added pr-awaiting-review PR is waiting for code review and removed pr-work-in-progress PR is still work in progress labels Apr 20, 2019

samskalicky approved these changes Apr 22, 2019

View reviewed changes

marcoabreu approved these changes Apr 23, 2019

View reviewed changes

TaoLv approved these changes Apr 23, 2019

View reviewed changes

apeforest added pr-awaiting-merge Review and CI is complete. Ready to Merge and removed pr-awaiting-review PR is waiting for code review labels Apr 23, 2019

yuxihu approved these changes Apr 23, 2019

View reviewed changes

eric-haibin-lin merged commit 0f63659 into apache:master Apr 23, 2019

apeforest mentioned this pull request Apr 24, 2019

performance degradation from 1.3.1 to 1.4.0 #14496

Closed

samskalicky mentioned this pull request Apr 25, 2019

training speed at 1.4.0 is slower than 1.3.1 with large number of classes. #14790

Closed

edisongustavo reviewed Apr 26, 2019

View reviewed changes

edisongustavo mentioned this pull request Apr 26, 2019

Fix int64 nightly tests #14809

Merged

7 tasks

marcoabreu pushed a commit that referenced this pull request Apr 26, 2019

Use correct stash name when running nightly tests (#14809)

680bade

The PR #14570 has introduced a bug where the "stash name" is not matching: - It is **packed** as `ubuntu_cpu_int64` - It is **unpacked** as `cpu_int64`

anirudhacharya mentioned this pull request Jul 18, 2019

[Discussion] 1.6.0 Roadmap #15589

Closed

TaoLv mentioned this pull request Aug 7, 2019

topk regression in v1.5 #15703

Closed

apeforest mentioned this pull request Aug 26, 2019

integer overflow bug in large NDArray AGAIN #16011

Closed

TaoLv mentioned this pull request Nov 14, 2019

CMake build with MKL_USE_ILP64 throws type mismatch #16471

Closed

stephenrawls mentioned this pull request Dec 2, 2019

large numpy array to mxnet ndarray conversion #16960

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add a compiler flag to use int64 as tensor size #14570

add a compiler flag to use int64 as tensor size #14570

apeforest commented Mar 29, 2019 •

edited

Loading

apeforest commented Mar 29, 2019

abhinavs95 commented Mar 29, 2019

TaoLv commented Mar 30, 2019

wkcn commented Mar 30, 2019

pengzhao-intel commented Mar 30, 2019

wkcn commented Mar 30, 2019 •

edited

Loading

apeforest commented Apr 1, 2019

TaoLv left a comment

larroy left a comment •

edited

Loading

samskalicky commented Apr 4, 2019

larroy commented Apr 8, 2019

reminisce commented Apr 8, 2019

apeforest commented Apr 20, 2019 •

edited

Loading

apeforest commented Apr 22, 2019

apeforest commented Apr 22, 2019

apeforest commented Apr 22, 2019

samskalicky left a comment

marcoabreu commented Apr 23, 2019

TaoLv commented Apr 23, 2019

apeforest commented Apr 23, 2019

yuxihu left a comment

edisongustavo Apr 26, 2019

add a compiler flag to use int64 as tensor size #14570

add a compiler flag to use int64 as tensor size #14570

Conversation

apeforest commented Mar 29, 2019 • edited Loading

Description

Checklist

Essentials

Changes

Comments

apeforest commented Mar 29, 2019

abhinavs95 commented Mar 29, 2019

TaoLv commented Mar 30, 2019

wkcn commented Mar 30, 2019

pengzhao-intel commented Mar 30, 2019

wkcn commented Mar 30, 2019 • edited Loading

apeforest commented Apr 1, 2019

TaoLv left a comment

Choose a reason for hiding this comment

larroy left a comment • edited Loading

Choose a reason for hiding this comment

samskalicky commented Apr 4, 2019

larroy commented Apr 8, 2019

reminisce commented Apr 8, 2019

apeforest commented Apr 20, 2019 • edited Loading

apeforest commented Apr 22, 2019

apeforest commented Apr 22, 2019

apeforest commented Apr 22, 2019

samskalicky left a comment

Choose a reason for hiding this comment

marcoabreu commented Apr 23, 2019

TaoLv commented Apr 23, 2019

apeforest commented Apr 23, 2019

yuxihu left a comment

Choose a reason for hiding this comment

edisongustavo Apr 26, 2019

Choose a reason for hiding this comment

apeforest commented Mar 29, 2019 •

edited

Loading

wkcn commented Mar 30, 2019 •

edited

Loading

larroy left a comment •

edited

Loading

apeforest commented Apr 20, 2019 •

edited

Loading