-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Add force_deterministic option for sparse embedding #9882
Add force_deterministic option for sparse embedding #9882
Conversation
Considering the big performance impact, would it make sense to print a prominent warning message making the user aware of the speed reduction? |
src/operator/tensor/indexing_op.cu
Outdated
const DType* ograd, | ||
const nnvm::dim_t row_length, | ||
const nnvm::dim_t num_threads_per_row, | ||
const int SZ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think SZ should be used as a template argument, which is combined with this kind of loop: https://github.com/dmlc/mshadow/blob/master/mshadow/cuda/tensor_gpu-inl.cuh#L662-L668
src/operator/tensor/indexing_op.cu
Outdated
const dim_t ograd_offset = idx * row_length; | ||
const dim_t out_offset = row_id * row_length; | ||
for (int i = feature_start; i < feature_end; i++) { | ||
out[out_offset + i] += ograd[ograd_offset + i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be faster if we use a local storage to store the values of out[offset +i] and write it back after we finish the loop?
if (tid == 0 || sorted_data[tid - 1] != sorted_data[tid]) {
out_local[...] = out[...]
do {
UPDATE_LOCAL(out_local, ograd)
} while(...)
out[...] = out_local[...]
}
src/operator/tensor/indexing_op.cu
Outdated
using nnvm::dim_t; | ||
if (req == kNullOp) return; | ||
CHECK_EQ(req, kWriteTo) << "SparseEmbedding layer doesn't support " | ||
<< "weight gradient calculation with req != write"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the Embedding layer, enabling AddTo in the backward pass is essential to the training speed of RNN (Because we use the same embedding for all the timestamps). I think we need to support kAddTo in the sparse embedding layer (Maybe in another PR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for bringing this up. For embedding:
- Usually the inputs are concatenated before passing to Embedding so it is only calculated once and no "addto" req is required.
- "addto" req is usually not supported for sparse grad because it requires re-allocation of memory which is expensive
Maybe we can revisit supporting "addto" req later.
src/operator/tensor/indexing_op.h
Outdated
int input_dim; | ||
int output_dim; | ||
int dtype; | ||
bool force_deterministic; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
force_deterministic -> deterministic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will update it
src/operator/tensor/indexing_op.h
Outdated
.add_enum("int32", mshadow::kInt32) | ||
.describe("Data type of weight."); | ||
DMLC_DECLARE_FIELD(force_deterministic).set_default(false) | ||
.describe("Force the gradient computation to be executed according to a deterministic order."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explain that this is slower?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
src/operator/tensor/indexing_op.cu
Outdated
MSHADOW_TYPE_SWITCH(ograd.type_flag_, DType, { | ||
MSHADOW_IDX_TYPE_SWITCH(output.aux_type(kIdx), RType, { | ||
// temp resource declarations | ||
dim_t* lookup_table = NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this huge chunk of code be pulled out into a template function so that it's steppable in the debugger?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I'll update it.
src/operator/tensor/indexing_op.cu
Outdated
@@ -103,13 +162,125 @@ void SparseEmbeddingOpForwardRspImpl<gpu>(const OpContext& ctx, | |||
} | |||
} | |||
|
|||
inline void SparseEmbeddingOpBackwardDeterministicRspImpl(const OpContext& ctx, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the deterministic/nondeterministic versions divergent for more than 50% of their code or can they be combined somewhat? Looks kind of hard to maintain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately they use totally different kernels:
non-deterministic:
- mark row idx
- prefix sum
- add_grad_atomic_add
deterministic:
- copy
- range
- sort
- unique
- add_grad_deterministic
Kernel<mark_lookup_table, gpu>::Launch(s, nnr, lookup_table, grad_row_idx); | ||
|
||
// accumulate gradients | ||
DType* grad_data = output.data().dptr<DType>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we set it to zero?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I should not have removed it. Will update
src/operator/tensor/indexing_op.cu
Outdated
tid++; | ||
} while (tid < data_size && sorted_data[tid - 1] == sorted_data[tid]); | ||
for (int i = 0; i < num_features; i++) { | ||
out[out_offset + i] = acc[i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which one should be correct, out[out_offset + i] = acc[i];
or out[out_offset + i] += acc[i];
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be += instead
@marcoabreu added warning msg. |
2c97d35
to
86f3833
Compare
* refactor embed backward kernelcallker * pass unit test * refactor * fix dim bug * add unique impl * remove old op * remove unused kernel * Revert "remove unused kernel" This reverts commit 948c5a3. * Revert "remove old op" This reverts commit 5d1cd64. * fix kernellaucnher * add force_determ option * add doc * fix lint * update test * CR comments * lint * set grad to be 0s initially * add warning
* refactor embed backward kernelcallker * pass unit test * refactor * fix dim bug * add unique impl * remove old op * remove unused kernel * Revert "remove unused kernel" This reverts commit 948c5a3. * Revert "remove old op" This reverts commit 5d1cd64. * fix kernellaucnher * add force_determ option * add doc * fix lint * update test * CR comments * lint * set grad to be 0s initially * add warning
* refactor embed backward kernelcallker * pass unit test * refactor * fix dim bug * add unique impl * remove old op * remove unused kernel * Revert "remove unused kernel" This reverts commit 948c5a3. * Revert "remove old op" This reverts commit 5d1cd64. * fix kernellaucnher * add force_determ option * add doc * fix lint * update test * CR comments * lint * set grad to be 0s initially * add warning
Description
(reopen of #9846)
Add
force_deterministic
option forcontrib.SparseEmbedding
. The option guarantees deterministic gradient during backward pass. The backward performance offorce_deterministic=True
is 50% slower on p2 instance / 80% slower on p3 instance compared toforce_deterministic=False
.indexing_op-inl.cuh
is simply a refactoring of the original codeChecklist
Essentials
make lint
)Changes
Comments