Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Add force_deterministic option for sparse embedding #9882

Merged
merged 18 commits into from
Mar 1, 2018

Conversation

eric-haibin-lin
Copy link
Member

@eric-haibin-lin eric-haibin-lin commented Feb 25, 2018

Description

(reopen of #9846)
Add force_deterministic option for contrib.SparseEmbedding. The option guarantees deterministic gradient during backward pass. The backward performance of force_deterministic=True is 50% slower on p2 instance / 80% slower on p3 instance compared to force_deterministic=False.

  • The benchmark script is at the bottom
  • Changes in indexing_op-inl.cuh is simply a refactoring of the original code

Checklist

Essentials

  • Passed code style checking (make lint)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here
# the benchmark script also requires other files under example/rnn/word_lm

import numpy as np
import mxnet as mx, math
import argparse, math
import logging
from data import Corpus, CorpusIter
from model import *
from module import *
from mxnet.model import BatchEndParam

parser = argparse.ArgumentParser(description='PennTreeBank LSTM Language Model')
parser.add_argument('--data', type=str, default='./data/ptb.',
                    help='location of the data corpus')
parser.add_argument('--batch_size', type=int, default=128,
                    help='batch size')
parser.add_argument('--bptt', type=int, default=35,
                    help='sequence length')
parser.add_argument('--dim', type=int, default=1024*1024,
                    help='dim')
parser.add_argument('--force', action='store_true')
args = parser.parse_args()

if __name__ == '__main__':
    # args
    head = '%(asctime)-15s %(message)s'
    logging.basicConfig(level=logging.DEBUG, format=head)
    args = parser.parse_args()
    logging.info(args)
    ctx = mx.gpu()
    batch_size = args.batch_size
    bptt = args.bptt
    # data
    ctx = mx.gpu()
    corpus = Corpus(args.data)
    ntokens = len(corpus.dictionary)
    train_data = CorpusIter(corpus.train, batch_size, bptt)
    data = []
    for i in range(1):
        data.append(train_data.next().data[0].reshape((-1,)).astype('int64'))
    word = mx.sym.var('data')
    weight = mx.sym.var('embed_weight', stype='row_sparse')
    embed = mx.sym.contrib.SparseEmbedding(data=word, weight=weight, input_dim=args.dim,
                                           output_dim=512, name='embed', force_deterministic=args.force)
    grad_req = {'data': 'null', 'embed_weight': 'write'}
    exe_test = embed.simple_bind(mx.gpu(), grad_req=grad_req, data=(data[0].shape[0],))
    arg_map = dict(zip(embed.list_arguments(), exe_test.arg_arrays))
    grad_map = dict(zip(embed.list_arguments(), exe_test.grad_arrays))
    # init data
    arg_map["data"][:] = data[0].astype('float32')
    print(data[0])
    grad = mx.nd.ones(exe_test.outputs[0].shape).copyto(mx.gpu(0))
    # weight
    weight = arg_map["embed_weight"]
    weight[:] = 1
    exe_test.forward()
    # warm up
    for i in range(10):
        exe_test.backward([grad])
    import time
    mx.nd.waitall()
    a = time.time()
    for i in range(10000):
        exe_test.backward([grad])
    mx.nd.waitall()
    b = time.time()
    print(b - a)

@marcoabreu
Copy link
Contributor

Considering the big performance impact, would it make sense to print a prominent warning message making the user aware of the speed reduction?

const DType* ograd,
const nnvm::dim_t row_length,
const nnvm::dim_t num_threads_per_row,
const int SZ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think SZ should be used as a template argument, which is combined with this kind of loop: https://github.com/dmlc/mshadow/blob/master/mshadow/cuda/tensor_gpu-inl.cuh#L662-L668

const dim_t ograd_offset = idx * row_length;
const dim_t out_offset = row_id * row_length;
for (int i = feature_start; i < feature_end; i++) {
out[out_offset + i] += ograd[ograd_offset + i];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be faster if we use a local storage to store the values of out[offset +i] and write it back after we finish the loop?

if (tid == 0 || sorted_data[tid - 1] != sorted_data[tid]) {
   out_local[...] = out[...] 
   do {
      UPDATE_LOCAL(out_local, ograd)
   } while(...)
   out[...] = out_local[...]
}

using nnvm::dim_t;
if (req == kNullOp) return;
CHECK_EQ(req, kWriteTo) << "SparseEmbedding layer doesn't support "
<< "weight gradient calculation with req != write";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the Embedding layer, enabling AddTo in the backward pass is essential to the training speed of RNN (Because we use the same embedding for all the timestamps). I think we need to support kAddTo in the sparse embedding layer (Maybe in another PR).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for bringing this up. For embedding:

  • Usually the inputs are concatenated before passing to Embedding so it is only calculated once and no "addto" req is required.
  • "addto" req is usually not supported for sparse grad because it requires re-allocation of memory which is expensive
    Maybe we can revisit supporting "addto" req later.

int input_dim;
int output_dim;
int dtype;
bool force_deterministic;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

force_deterministic -> deterministic

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update it

.add_enum("int32", mshadow::kInt32)
.describe("Data type of weight.");
DMLC_DECLARE_FIELD(force_deterministic).set_default(false)
.describe("Force the gradient computation to be executed according to a deterministic order.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain that this is slower?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

MSHADOW_TYPE_SWITCH(ograd.type_flag_, DType, {
MSHADOW_IDX_TYPE_SWITCH(output.aux_type(kIdx), RType, {
// temp resource declarations
dim_t* lookup_table = NULL;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this huge chunk of code be pulled out into a template function so that it's steppable in the debugger?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I'll update it.

@@ -103,13 +162,125 @@ void SparseEmbeddingOpForwardRspImpl<gpu>(const OpContext& ctx,
}
}

inline void SparseEmbeddingOpBackwardDeterministicRspImpl(const OpContext& ctx,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the deterministic/nondeterministic versions divergent for more than 50% of their code or can they be combined somewhat? Looks kind of hard to maintain.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately they use totally different kernels:
non-deterministic:

  • mark row idx
  • prefix sum
  • add_grad_atomic_add

deterministic:

  • copy
  • range
  • sort
  • unique
  • add_grad_deterministic

Kernel<mark_lookup_table, gpu>::Launch(s, nnr, lookup_table, grad_row_idx);

// accumulate gradients
DType* grad_data = output.data().dptr<DType>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we set it to zero?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I should not have removed it. Will update

tid++;
} while (tid < data_size && sorted_data[tid - 1] == sorted_data[tid]);
for (int i = 0; i < num_features; i++) {
out[out_offset + i] = acc[i];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which one should be correct, out[out_offset + i] = acc[i]; or out[out_offset + i] += acc[i];?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be += instead

@eric-haibin-lin
Copy link
Member Author

@marcoabreu added warning msg.

@piiswrong piiswrong merged commit b8b869f into apache:master Mar 1, 2018
jinhuang415 pushed a commit to jinhuang415/incubator-mxnet that referenced this pull request Mar 30, 2018
* refactor embed backward kernelcallker

* pass unit test

* refactor

* fix dim bug

* add unique impl

* remove old op

* remove unused kernel

* Revert "remove unused kernel"

This reverts commit 948c5a3.

* Revert "remove old op"

This reverts commit 5d1cd64.

* fix kernellaucnher

* add force_determ option

* add doc

* fix lint

* update test

* CR comments

* lint

* set grad to be 0s initially

* add warning
@eric-haibin-lin eric-haibin-lin deleted the fix-embedding-unique branch May 9, 2018 18:05
rahul003 pushed a commit to rahul003/mxnet that referenced this pull request Jun 4, 2018
* refactor embed backward kernelcallker

* pass unit test

* refactor

* fix dim bug

* add unique impl

* remove old op

* remove unused kernel

* Revert "remove unused kernel"

This reverts commit 948c5a3.

* Revert "remove old op"

This reverts commit 5d1cd64.

* fix kernellaucnher

* add force_determ option

* add doc

* fix lint

* update test

* CR comments

* lint

* set grad to be 0s initially

* add warning
zheng-da pushed a commit to zheng-da/incubator-mxnet that referenced this pull request Jun 28, 2018
* refactor embed backward kernelcallker

* pass unit test

* refactor

* fix dim bug

* add unique impl

* remove old op

* remove unused kernel

* Revert "remove unused kernel"

This reverts commit 948c5a3.

* Revert "remove old op"

This reverts commit 5d1cd64.

* fix kernellaucnher

* add force_determ option

* add doc

* fix lint

* update test

* CR comments

* lint

* set grad to be 0s initially

* add warning
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants