[WIP] Fix non-determinism in sparse embedding #9846

eric-haibin-lin · 2018-02-21T09:10:01Z

Description

The original GPU sparse embedding operator uses atomic add which results non-deterministic gradient due to limited precision of fp32 and non-deterministic execution order. This PR replaces atomic add with sort to guarantee determinism.

Tested with example/rnn/word_lm/train.py. Using SparseEmbedding and Embedding results the same loss with fixed seed.

The fixes makes the backward pass ~50% slower compared to the atomic_add implementation, measured by the script at the end. Further optimization can be done using cub::Unique instead of cub::InclusiveSum to generate lookup table.

@ZiyueHuang @sxjscience

Checklist

Essentials

Passed code style checking (make lint)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Feature1, tests, (and when applicable, API doc)
Feature2, tests, (and when applicable, API doc)

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

# the benchmark script also requires other files under example/rnn/word_lm

import numpy as np
import mxnet as mx, math
import argparse, math
import logging
from data import Corpus, CorpusIter
from model import *
from module import *
from mxnet.model import BatchEndParam

parser = argparse.ArgumentParser(description='PennTreeBank LSTM Language Model')
parser.add_argument('--data', type=str, default='./data/ptb.',
                    help='location of the data corpus')
parser.add_argument('--batch_size', type=int, default=128,
                    help='batch size')
parser.add_argument('--bptt', type=int, default=35,
                    help='sequence length')
parser.add_argument('--dim', type=int, default=1024*1024,
                    help='dim')
args = parser.parse_args()

if __name__ == '__main__':
    # args
    head = '%(asctime)-15s %(message)s'
    logging.basicConfig(level=logging.DEBUG, format=head)
    args = parser.parse_args()
    logging.info(args)
    ctx = mx.gpu()
    batch_size = args.batch_size
    bptt = args.bptt

    # data
    ctx = mx.gpu()
    corpus = Corpus(args.data)
    ntokens = len(corpus.dictionary)
    train_data = CorpusIter(corpus.train, batch_size, bptt)
    data = []
    for i in range(1):
        data.append(train_data.next().data[0].reshape((-1,)).astype('int64'))
    word = mx.sym.var('data')
    weight = mx.sym.var('embed_weight', stype='row_sparse')
    embed = mx.sym.contrib.SparseEmbedding(data=word, weight=weight, input_dim=args.dim, output_dim=512, name='embed')
    grad_req = {'data': 'null', 'embed_weight': 'write'}
    exe_test = embed.simple_bind(mx.gpu(), grad_req=grad_req, data=(data[0].shape[0],))
    arg_map = dict(zip(embed.list_arguments(), exe_test.arg_arrays))
    grad_map = dict(zip(embed.list_arguments(), exe_test.grad_arrays))
    # init data
    arg_map["data"][:] = data[0].astype('float32')
    print(data[0])
    grad = mx.nd.ones(exe_test.outputs[0].shape).copyto(mx.gpu(0))
    # weight
    weight = arg_map["embed_weight"]
    weight[:] = 1
    exe_test.forward()
    # warm up
    for i in range(10):
        exe_test.backward([grad])
    import time
    mx.nd.waitall()
    a = time.time()
    for i in range(10000):
        exe_test.backward([grad])
    mx.nd.waitall()
    b = time.time()
    print(b - a)

* refactor embed backward kernelcallker * pass unit test * refactor * fix dim bug * add unique impl * remove old op * remove unused kernel

eric-haibin-lin requested a review from cjolivier01 as a code owner February 21, 2018 09:10

Fix embedding (#194)

9e80d58

* refactor embed backward kernelcallker * pass unit test * refactor * fix dim bug * add unique impl * remove old op * remove unused kernel

eric-haibin-lin force-pushed the fix-embedding branch from 948c5a3 to 9e80d58 Compare February 21, 2018 09:47

fix lint

c0b42ec

eric-haibin-lin changed the title ~~Fix non-determinism in sparse embedding~~ [WIP] Fix non-determinism in sparse embedding Feb 21, 2018

eric-haibin-lin closed this Feb 25, 2018

eric-haibin-lin mentioned this pull request Feb 25, 2018

Add force_deterministic option for sparse embedding #9882

Merged

7 tasks

eric-haibin-lin mentioned this pull request Apr 14, 2018

Add NEWS and README #10545

Merged

eric-haibin-lin deleted the fix-embedding branch September 18, 2018 23:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Fix non-determinism in sparse embedding #9846

[WIP] Fix non-determinism in sparse embedding #9846

eric-haibin-lin commented Feb 21, 2018

[WIP] Fix non-determinism in sparse embedding #9846

[WIP] Fix non-determinism in sparse embedding #9846

Conversation

eric-haibin-lin commented Feb 21, 2018

Description

Checklist

Essentials

Changes

Comments