-
Notifications
You must be signed in to change notification settings - Fork 6.8k
add a compiler flag to use int64 as tensor size #14570
Conversation
@pengzhao-intel @TaoLv Please help to review. Thanks! |
@mxnet-label-bot add [pr-work-in-progress, backend] |
Do we need to change CI to cover this new flag? Do we need to add it to the runtime feature set? @larroy |
Thank you for the quick fix! I think we can use the solution (3): 'Choose data type for tensor size at runtime based on the size of the tensor' in the performance bottleneck firstly. |
Agree with adding the new flag to switch to int64 when the user explicitly know they want to use the large array, like DGL. For the normal case, int32 is enough w/o performance issue. @TaoLv do we need to switch off MKLDNN when int64 is enabled before MKL-DNN 1.0? @wkcn would you mind mentioning the advantage of 3)? |
@pengzhao-intel Yes. For users, it is confusing that there will be two versions of mxnet: mxnet and mxnet-largetensor. In other words, users need to rebuild the source, when they use large tensor. It is too complex. Besides, when there is a program which need large-tensor support and higher performance for small-tensor, only the solution 3 could meet the requirment. We can implement 3) as follow: #define KERNEL_LAUNCH_TYPE_SWITCH(N, ...) \
do { \
if (N <= INT32_MAX) \
{ \
typedef int32_t IndexType; \
{__VA_ARGS__} \
} \
else \
{ \
typedef int64_t IndexType; \
{__VA_ARGS__} \
} \
} while(0) |
@wkcn Thanks for the suggestion. This macro is good for certain simple operators but I found it difficult to apply to all operators given the different architecture of operator implementation. E.g. transpose operator is currently using the expression template pattern in mshadow. It seems a big surgery to apply this macro to operators in mshadow. Besides, int64_t is used as type of data members in the tensor datastructure. This approach will not directly change the internal methods in tensor or requires static cast. Any other thoughts on this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add the new flag to CI and make sure we have appropriate unit tests for it. Thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a big fan of using signed type for dimension, but I find it non-blocking.
As this has been clarified, we need signed type.
@larroy the dimension can be -1 if its unknown, or if -1 refers to the last dimension. So we do have to use signed types. |
@reminisce why aren't we using ndim=1 for scalars? |
@larroy Tensors with |
@TaoLv , I think the nightly tests are here: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/NightlyTests/activity @marcoabreu Could you please confirm? Thanks! |
@TaoLv Do you have any more concerns for this PR? Thanks! |
This PR itself is not narrowing/widening the data type for dimension. It simply adds a compiler flag to control this. |
@eric-haibin-lin @yuxihu @samskalicky @anirudh2290 @reminisce Please let me know if your concerns have been addressed. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
LGTM from a CI perspective |
Thank you for the fix, @apeforest. Now it's approved. Just copy some comments from the mshadow PR here: int64 data type only affects the definition of some variables and the total size of a NDArray. We won't pass any NDArray with dim[x] > INT32_MAX into a operator. So it will not break the existing definition of BLAS and MKL-DNN APIs. |
@marcoabreu @TaoLv @anirudh2290 could you please help to merge it if it’s good to ship? Thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
'Test Large Tensor Size: CPU': { | ||
node(NODE_LINUX_CPU) { | ||
ws('workspace/large_tensor-cpu') { | ||
utils.unpack_and_init('cpu_int64', mx_cmake_lib) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this should have been ubuntu_cpu_int64
.
The nightly tests are failing because no one is publishing this stash: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/NightlyTestsForBinaries/detail/master/292/pipeline
As you can see from "Build / CPU: USE_INT64_TENSOR_SIZE", it publishes ubuntu_cpu_int64
.
The PR #14570 has introduced a bug where the "stash name" is not matching: - It is **packed** as `ubuntu_cpu_int64` - It is **unpacked** as `cpu_int64`
The PR apache#14570 has introduced a bug where the "stash name" is not matching: - It is **packed** as `ubuntu_cpu_int64` - It is **unpacked** as `cpu_int64`
* use a compile flag to use int64 tensor size * use personal mshadow repo * update data type * update make config * change size_t to index_t and add documentation * update mshadow submodule to master * fix compilation warning * fix compiler warning * fix compiler warning * fix compiler warning * fix compiler warning * fix compiler error * change nnvm::Tuple to mxnet::Tuple * fix compiler warning * fix compiler warning * fix compiler warning * fix compiler warning * fix compiler warning * fix lint * update CI runtime_functons * update runtime function * correct runtime_functions * udpate runtime functions * add nightly test for large tensor * update Jenkins files to test new compiler flag * fix CI * add runtime feature detect for the compiler flag * change build from make to cmake * fix CI * move tests to nightly
The PR apache#14570 has introduced a bug where the "stash name" is not matching: - It is **packed** as `ubuntu_cpu_int64` - It is **unpacked** as `cpu_int64`
Description
This PR fix #14496.
The performance degradation is introduced by PR #11742. I have verified the performance degradation in one of the operators transpose using a script below:
By changing the
index_t
type fromint64_t
toint32_t
can consistently change the 50-percentile runtime of tranpose of size 500 from 5000ms to 2400ms.By changing the data type in the operator (https://github.com/dmlc/mshadow/blob/master/mshadow/extension/transpose.h#L70) alone, can also reduce the 50-percentile runtime of size 500 to 2600ms.
I thereforce come to the conclusion that the performance degradation is caused by the runtime of integer arithmetic operations between 32-bit integers and 64-bit integers.
To further experiment with the arithmetic operations alone, I tested using a small program here. The runtime results are shown below:
I can think of three solutions to this problem:
(1) Add a compilation flag to choose data types for tensor size (This PR)
(2) Add an environment variable to choose data type for tensor size at runtime
(3) Choose data type for tensor size at runtime based on the size of the tensor
Given the expression template used in mshadow for operators, either (2) or (3) requries a significant change in the mshadow library. (1) can be used as a quick solution to fix performance degradation reported in several issues: #14496, #13928, #14563, #14569
Any other suggestion is appreciated.
This PR also depends on dmlc/mshadow#371
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes
Comments