Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Use single-bit for mask in dropout operator #16735

Open
wants to merge 75 commits into
base: master
Choose a base branch
from

Conversation

apeforest
Copy link
Contributor

@apeforest apeforest commented Nov 6, 2019

Description

Use single bit in mask for dropout to reduce memory.
This PR fixes #15968

Performance tests are run using the script below:

#!/usr/bin/python
import mxnet as mx
from mxnet import nd

from benchmark.opperf.utils.benchmark_utils import run_performance_test

mx.random.seed(17)
context = mx.cpu()
res = run_performance_test(nd.Dropout, run_backward=True, dtype='float32', ctx=context,
                           inputs=[
                               {"data" : (1024, 1024), "cudnn_off" : "False"}
                           ],
                           warmup=20, runs=100, profiler='native')
print(res)

Results:

Build Flavor fwd time (master) fwd time (pr) bwd time (master) bwd time (pr) memory (master) memory (pr)
CPU w/ MKL BLAS 0.19 0.74 0.06 0.17 6291.45 2162.68
CPU w/o MKL BLAS 0.73 0.89 0.07 0.11 6291.45 4259.83
GPU w/ cuDNN 0.16 0.15 0.11 0.11 4194.30 2162.68
GPU w/o cuDNN 4.36 2.64 0.13 0.11 4194.30 2162.68

Time measured in python: #13896

Build master this pr
GPU w/ cuDNN 25.9 ms 25.8 ms
GPU w/o cuDNNN 1.34 s 1.35 s
CPU w/ MKL 262 ms 337 ms
CPU w/o MKL 359 ms 426 ms

@eric-haibin-lin @TaoLv @PatricZhao @ptrendx @roywei please help to review

@apeforest apeforest changed the base branch from benchmark to master November 6, 2019 06:43
@apeforest apeforest changed the title Use single-bit for mask in dropout operator [DO NOT MERGE] Use single-bit for mask in dropout operator Nov 6, 2019
@apeforest apeforest changed the title [DO NOT MERGE] Use single-bit for mask in dropout operator Use single-bit for mask in dropout operator Dec 22, 2019
src/operator/nn/dropout.cc Outdated Show resolved Hide resolved
Copy link
Member

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TaoLv @PatricZhao can someone review the CPU changes?

src/operator/nn/dropout.cc Outdated Show resolved Hide resolved
});
// mask_out is set per bit position
// therefore bitwise shift need to be performed here
auto maskIdx = i / 8;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maskIdx -> mask_idx

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment for offset, val

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

bool maskVal = mshadow_op::threshold_eq::Map<real_t>(rand_num, pkeep);
if (maskVal) {
// set bit
mask_out[maskIdx] |= 1U << maskOffset;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this lead to race condition if the same maskIdx is being set by multiple threads? Shall each thread handle at least 8 bits?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I was thinking of setting the step to 8 but forgot to update it in the macro.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After checking into it more, I found ideally this should not happen because RandGenerator<xpu>::kMinNumRandomPerThread is 64 and therefore by design the step size inside LaunchRNG should be a multiple of 8. But then I looked into that piece of code again and found it looks like a bug in calculating the step. Please review my latest change in src/operator/random/sampler.h and let me know if it makes sense. Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this for loop parallelized?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I do not recommend writing code this way. There is not documenntation nor guarantee that kMinNumRandomPerThread will always be greater than 8 in the future. Nor does the dropout operator document any assumption about the value of kMinNumRandomPerThread. The code is delicate and will be broken if some contributor changes kMinNumRandomPerThread to values like 4. If there's any assumption, we should add an explicit check so that it won't be broken in the future

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. I will refactor this piece of code.

src/operator/nn/dropout-inl.h Outdated Show resolved Hide resolved
});
// mask_out is set per bit position
// therefore bitwise shift need to be performed here
auto maskIdx = i / 8;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this lead to race condition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above.

Copy link
Member

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

potential race condition

Copy link
Member

@TaoLv TaoLv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will help to save memory. But curious to know the performance impact.

src/operator/nn/dropout-inl.h Outdated Show resolved Hide resolved
src/operator/nn/dropout-inl.h Outdated Show resolved Hide resolved
src/operator/nn/dropout-inl.h Outdated Show resolved Hide resolved
tests/python/unittest/test_operator.py Outdated Show resolved Hide resolved
@apeforest apeforest force-pushed the perf/dropout-mask branch 2 times, most recently from 4457579 to 78a40d5 Compare December 26, 2019 22:54
@TaoLv
Copy link
Member

TaoLv commented Jan 4, 2020

@apeforest Thank you for the nice work! Do you have any numbers to share?

  • memory usage of a model in which dropout workspace used to be a problem?
  • operator performance benchmark?

@eric-haibin-lin
Copy link
Member

For GPT-2, the memory usage goes from 30GB to 26GB. For BERT, it goes from 26GB to 23GB. I didn't notice much difference in training throughput.

@apeforest
Copy link
Contributor Author

apeforest commented Jan 10, 2020

@TaoLv Thanks for your review. I ran operator profiling using benchmark.opperf.utils.benchmark_utils.run_performance_test. The result shows speed up in forward but some degradation in backward pass.

w/ this change:

[{'Dropout': [{'avg_time_forward_Dropout': 1.3266, 'max_storage_mem_alloc_cpu/0': 4259.8398, 'avg_time_backward_Dropout': 0.2682, 'inputs': {'data': (1024, 1024), 'p': 0.5}}]}]

w/o this change:

[{'Dropout': [{'avg_time_forward_Dropout': 1.7864, 'max_storage_mem_alloc_cpu/0': 6291.4561, 'avg_time_backward_Dropout': 0.1836, 'inputs': {'data': (1024, 1024), 'p': 0.5}}]}]

@apeforest apeforest force-pushed the perf/dropout-mask branch 2 times, most recently from 38c021a to 3874110 Compare January 10, 2020 18:17
@TaoLv
Copy link
Member

TaoLv commented Jan 12, 2020

@apeforest Thank you for testing it out. Given memory is not always a concern, can we make bit mask an option for dropout?

@eric-haibin-lin
Copy link
Member

@TaoLv I don't think adding an option is necessary. can we improve the backward kernel?

@TaoLv
Copy link
Member

TaoLv commented Jan 13, 2020

@apeforest Could you please also test the operator performance with USE_BLAS=mkl?

@pengzhao-intel
Copy link
Contributor

I'm ok with the result. @TaoLv any concern?

It's still 1.36x slower. I will take another look today.

If sacrificing performance (to some extent) can help improve usability, I think we need to consider the trade off.

As I mentioned above, I'm not taking this as a general usability issue. I don't think we want to sacrifice the performance on CPU while memory size is not a concern there. @pengzhao-intel

It will be a concern for the performance drop because we are working on model training recently.
I didn't follow up on all discussions in the thread. One quick question:
Does the slow down come from more computation in the new algorithm or the sub-optimal implementation?

Copy link
Member

@roywei roywei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cudnn part LGTM, one concern is the speed reported from profiler is quite different than measured from python side here

Let's make sure we know the performance impact on end to end python.

@apeforest
Copy link
Contributor Author

apeforest commented Feb 14, 2020

@roywei Using the test script in #13896

Build runtime (before) runtime (after)
CPU w/ MKL 262 ms ± 1.2 ms 337 ms ± 12.5 ms
CPU w/o MKL 359 ms ± 241 µs 426 ms ± 222 µs
GPU w/ cuDNN 25.9 ms ± 202 µs 25.8 ms ± 183 µs
GPU w/o cuDNNN 1.34 s ± 5.83 ms 1.35 s ± 13.1 ms

Using python timer to measure CPU performance with MKL:

This PR:

[{'Dropout': [{'avg_time_Dropout': 1.1714265774935484, 'p50_time_Dropout': 1.1715246364474297, 'p90_time_Dropout': 1.190436165779829, 'p99_time_Dropout': 1.2154309218749404, 'inputs': {'data': (1024, 1024)}}]}]

Master:

[{'Dropout': [{'avg_time_Dropout': 0.6394564639776945, 'p50_time_Dropout': 0.6996351294219494, 'p90_time_Dropout': 1.045508868992329, 'p99_time_Dropout': 1.59036863129586, 'inputs': {'data': (1024, 1024)}}]}]

@TaoLv
Copy link
Member

TaoLv commented Feb 14, 2020

Does the avg_time_Dropout include backward time? @apeforest

auto mask_idx = i >> 3; // div 8;
uint8_t mask_offset = i & 7; // mod 8
bool mask_val = maskptr[mask_idx] & (1U << mask_offset);
ingradptr[i] = outgradptr[i] * mask_val * pk_1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also use blocking in the backward path:

    const int blk_size = 64;
    const int nblk = count / blk_size;

#pragma omp parallel for num_threads(nthr) schedule(static, 8)
    for (index_t b = 0; b < nblk; ++b) {
      for (index_t k = 0; k < blk_size; ++k) {
        index_t i = b * blk_size + k;
        auto mask_idx = i >> 3;  // div 8;
        uint8_t mask_offset = i & 7;  // mod 8
        bool mask_val = maskptr[mask_idx] & (1U << mask_offset);
        ingradptr[i] = outgradptr[i] * mask_val * pk_1;
      }
    }

    // tail
    if (nblk * blk_size < count) {
      for (index_t i = nblk * blk_size; i < count; ++i) {
        auto mask_idx = i >> 3;  // div 8;
        uint8_t mask_offset = i & 7;  // mod 8
        bool mask_val = maskptr[mask_idx] & (1U << mask_offset);
        ingradptr[i] = outgradptr[i] * mask_val * pk_1;
      }
    }
  }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

Copy link
Contributor Author

@apeforest apeforest Feb 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After more thoughts, I think we actually don't need to do blocking in the backward pass as there is no write to maskptr and hence no cache eviction nor race condition.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're writing to ingradptr. We also hope the elements in one cache line will be handled by one openmp thread. With the original parallelization, one cache line is loaded and only one element in it is handled by the current thread. For the next thread, it need load the same cache line, and handle the next element.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However there is no read from ingradptr, therefore this is not a case of the false sharing, right? I tried this block and didn't noticed any noticeable performance gain.

@apeforest
Copy link
Contributor Author

Does the avg_time_Dropout include backward time? @apeforest

Yes, it includes backward time as my run_backward is set to True

@apeforest
Copy link
Contributor Author

apeforest commented Feb 14, 2020

Does the slow down come from more computation in the new algorithm or the sub-optimal implementation?
@PatricZhao The slowdown comes from extra computation in the new algorithm when Dropout uses MKL implementation. MKL already computed the mask but stored each mask as integer. The new algorithm simply repackage this int32 based mask into bit-based mask and therefore introduced extra runtime. In the ideal case, it would be to enhance MKL dropout to store mask using bits. But it requires modification of the VSL APIs.

@TaoLv
Copy link
Member

TaoLv commented Feb 15, 2020

Does the slow down come from more computation in the new algorithm or the sub-optimal implementation?

The new implementation increases both memory load and additional bit-wise operations. So performance slow down is expected.

@pengzhao-intel
Copy link
Contributor

Does the slow down come from more computation in the new algorithm or the sub-optimal implementation?

The new implementation increases both memory load and additional bit-wise operations. So performance slow down is expected.

What algorithm is used in TF and pytorch?

@TaoLv
Copy link
Member

TaoLv commented Feb 15, 2020

What algorithm is used in TF and pytorch?

@pengzhao-intel I don't think TF has a fused dropout operator. It's implement with several small operators. See https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/nn_ops.py#L4456. So the backward path should go through the backward of these small operators. Hence no bit-mask there.

For PyTorch, I see there is a fused one: https://github.com/pytorch/pytorch/blob/master/tools/autograd/templates/Functions.cpp#L634. The mask tensor should either be Boolean or has compatible type as grad. So no bit-mask either.

In the ideal case, it would be to enhance MKL dropout to store mask using bits. But it requires modification of the VSL APIs.

@apeforest , so far there is no dropout functionality in MKL or MKL-DNN. Here we just use VSL to generate random values. So even we can generate bit-mask, it will increase additional computation for mask_val * 1.0 / pkeep which can be reused from forward path.

Copy link
Member

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new implementation increases both memory load and additional bit-wise operations. So performance slow down is expected.
Why does it increase memory load?

Is there any plan for MKLDNN to support fast dropout with bit-mask like CuDNN? I think reducing memory consumption is quite important. CPU does not have memory capacity issue but it will be one for most GPUs and ASICs. I'd push for efficient implementation from MKLDNN in the long term.

@apeforest
Copy link
Contributor Author

apeforest commented Feb 15, 2020

Does the slow down come from more computation in the new algorithm or the sub-optimal implementation?

The new implementation increases both memory load and additional bit-wise operations. So performance slow down is expected.

The memory load is actually reduced even in the case of MKL, right? Please refer to the tests results in the PR description.

@TaoLv
Copy link
Member

TaoLv commented Feb 16, 2020

Why does it increase memory load?

If there are N elements, per the Bernoulli distribution generation in VSL, we still need to allocate memory and write N*4 bytes to it. To generate bit mask, we need load the N*4 bytes back and write N/8 bytes with bits.

@apeforest
Copy link
Contributor Author

Why does it increase memory load?

If there are N elements, per the Bernoulli distribution generation in VSL, we still need to allocate memory and write N*4 bytes to it. To generate bit mask, we need load the N*4 bytes back and write N/8 bytes with bits.

The memory for bit-mask is not extra memory. N*sizeof(DType) was used in the master branch: https://github.com/apache/incubator-mxnet/blob/master/src/operator/nn/dropout.cc#L124

So for the MKL dropout case,
master branch uses memory N*4 + N*sizeof(DType) vs. this PR N*4 + N/8. This memory reduction is verified through the MXNet profiler results reported in the PR description section.

@eric-haibin-lin
Copy link
Member

@PatricZhao @TaoLv what do you suggest as the resolution? If CPU performance is a concern, shall we add env_var to control the behavior? Do you agree in the long term we want to push for dropout API in MKLDNN with 1-bit mask?

@apeforest
Copy link
Contributor Author

Given your concern about the performance degradation in the case of MKL dropout, I have disabled this feature when MKL dropout is used. Please review the PR again and let me know if you think this is good to go. Thanks!

Copy link
Member

@TaoLv TaoLv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the turning around, @apeforest. It looks good to me in general but I notice that there are cases failing on dropout. I can approve once they get fixed. Thanks!

@apeforest
Copy link
Contributor Author

apeforest commented Feb 22, 2020

Hi @TaoLv and @PatricZhao I reverted my last commit of "Do not use bit-mask when MKL dropout is used."

It makes the code too bristle and also involves very complicate logic to check memory allocation at runtime. Here are the main reasons:

(1) MKL dropout support is currently not complete. It does not work if the input data type is smaller than int32 and it does not support broadcast option (when the option axes is specified). This limitation enforces a check at runtime which is not possible in the InferShape function

e.g. In this function, I will need to check if the dtype is greater than int32 in order to use a different shape for MKL Dropout.
https://github.com/apache/incubator-mxnet/pull/16735/files#diff-74c4dc433970c5df31a5e2c4b57c8d71R127

(2) Having different Dropout engine at runtime (based on data type and ) may cause inconsistency in the mixed precision case. Introducing another difference in mask memory allocation complicates this even further.

I think we should focus on enhancing MKL Dropout so that it (1) supports all the different cases as non MKL dropout (2) supports bit-mask.

Please let me know what you think. Thanks!

Lin

Copy link
Member

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a RFC for 1-bit dropout in MKLDNN which we can leverage: oneapi-src/oneDNN#656 (comment)

@eric-haibin-lin eric-haibin-lin added the pr-awaiting-response PR is reviewed and waiting for contributor to respond label Jul 28, 2020
@sxjscience
Copy link
Member

Is there anyone that can take a look at this PR ?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-awaiting-response PR is reviewed and waiting for contributor to respond
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 bit mask for Dropout
7 participants