-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[MKL-DNN] MKL-DNN RNN backward path enhancement #17183
Conversation
We compared the performance between 1cfaf3c and this PR (c91c9e1). The result of the performance test on LSTM and GRU is shown below. It can be inferred from the promotion rate that this patch has improved the performance of Backward. The performance of some forward paths has decreased significantly, which requires more investigation. num_layer=1, batch_size=64, seq_len=50, input_dim=512
|
@zixuanweeei Please remove [WIP] from the title and retrigger CI. @ciyongch please help to review. Thanks! |
@zixuanweeei Do you have any clue about the perf degradeation on forward pass? As the PR also done some refactor to the common code of forward and backward. |
Another perf test containing more warm-up loops gave the result below, which compared the performance between 89fe1f6 and 622d843. The refactor that have some impact on the forward pass is related to reorder. But reorder is only executed in the initialization process at the first iteration. RNN primitives may take more time to become stabilized. I think we can get this PR into master at first for settling the problem of gradients explode and release 1.6.x.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* Flush memory before RNN backward primitive * Add gluon rnn unit test for gradients check * Cache reorder * Re-write rnn supporting check * Update OpSignature.AddSign to avoid potential hash collision for rnn-packed memory Get the data type from mkldnn memory descriptor when setting grad handle
Thanks, @ciyongch . @TaoLv @pengzhao-intel Please take a review again. I rebased the current branch on the official master, so we can merge it if CI passes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Will merge the PR after CI pass.
* Flush memory before RNN backward primitive * Add gluon rnn unit test for gradients check * Cache reorder * Re-write rnn supporting check * Update OpSignature.AddSign to avoid potential hash collision for rnn-packed memory Get the data type from mkldnn memory descriptor when setting grad handle
) * [MKLDNN] mkldnn RNN operator enhancement (#17075) * mkldnn rnn operator enhancement `add` operation support Rename AddTo Add MXNET_USE_MKLDNN_RNN env Add Env var for switching to naive RNN impl and naive add/copy impl * Re-run CI, op:test_reduce failed on Unix-CPU * Rerun CI, Python2 CPU on Unix-CPU timeout * MKL-DNN RNN backward path enhancement (#17183) * Flush memory before RNN backward primitive * Add gluon rnn unit test for gradients check * Cache reorder * Re-write rnn supporting check * Update OpSignature.AddSign to avoid potential hash collision for rnn-packed memory Get the data type from mkldnn memory descriptor when setting grad handle
Description
Mainly aims to fix the gradients' explosion reported by #17086. Other enhancements are also enabled, which are listed in the Changes section.
Checklist
Changes
Comments
@ciyongch @TaoLv @pengzhao-intel