Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Change RNN OP to stateful #14476

Merged
merged 30 commits into from
Apr 13, 2019
Merged

Change RNN OP to stateful #14476

merged 30 commits into from
Apr 13, 2019

Conversation

lihaofd
Copy link
Contributor

@lihaofd lihaofd commented Mar 20, 2019

Description

In this PR, it refactored RNN OP to stateful by using FStatefulCompute
@pengzhao-intel, @TaoLv , @ciyongch

Feature changes

New features

  • Support Stateful RNN OP registration

Unit-test changes

  • Using exisiting Unit Test case like tests/python/unittests/test_operator.py to check consistency with LSTMCell/GRUCell/RNNCell output

Checklist

  • Passed code style checking (make lint).
  • All changes have test coverage.
  • Code is well-documented.

@TaoLv
Copy link
Member

TaoLv commented Mar 20, 2019

@szha @sbodenstein Please help to review.

@TaoLv TaoLv requested review from szha and sxjscience March 20, 2019 02:58
@szha
Copy link
Member

szha commented Mar 20, 2019

cc @stephenrawls since the work is related.

@pengzhao-intel
Copy link
Contributor

@szha @stephenrawls we are not 100% sure about the refactor of GPU part. Please help take a review and the suggestion and contributions are highly appreciated.

We can add the related folks into collaboration list for the code refactoring.

@pinaraws
Copy link

@mxnet-label-bot add[pr-awaiting-review]

@marcoabreu marcoabreu added the pr-awaiting-review PR is waiting for code review label Mar 20, 2019
@stephenrawls
Copy link
Contributor

This will definitely impact my cudnn var-length PR. It's been on back burner for last couple weeks but been meaning to push it through soon.

If this is going to get merged first I can just update my PR to work with this change.

I'm not entirely familiar with the code base. Can you let me know what making RNN a stateful operator means? I'm not sure if this is something that I need to worry about in my PR or not.

Mostly it looks like this is just taking the cudnn specific header file and putting all that code in the generic rnn-inl.h header file?

@szha
Copy link
Member

szha commented Mar 20, 2019

@stephenrawls This is to change the RNN op from legacy operator to a stateful operator. A stateful operator supports carrying arbitrary state from the forward computation to the backward computation. This makes it easy to pass states such as cudnn reserved spaces from forward op to the backward op.

@lihaofd
Copy link
Contributor Author

lihaofd commented Mar 25, 2019

ci is blocked by #14507

src/operator/rnn-inl.h Outdated Show resolved Hide resolved
DType* cx_ptr = NULL;
DType* cy_ptr = NULL;
DType * cx_ptr = NULL;
DType * cy_ptr = NULL;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DType* cx_ptr looks better and makes the style more consistent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

src/operator/rnn-inl.h Outdated Show resolved Hide resolved
@sxjscience
Copy link
Member

Conducted one round of review. Some minor problems.

src/operator/rnn-inl.h Outdated Show resolved Hide resolved
@szha
Copy link
Member

szha commented Apr 9, 2019

@lihaofd thanks for the fix. Could you rebase to the current master? The flaky tests might have been disabled on master already.

src/operator/rnn-inl.h Outdated Show resolved Hide resolved
src/operator/rnn.cc Outdated Show resolved Hide resolved
Copy link
Member

@szha szha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had an offline discussion with @eric-haibin-lin regarding the specific point on using storage handler for managing temporary space. Conclusions:

  • This is not a good practice as operator should not be able to interact with storage manager.
  • A good alternative is to create NDArray for the same purpose, and the operators can rely on its RAII property for life-cycle management.

However, we realize that this is a bad example that already happens in the existing code, so we would not want to impose the work to correct it on the PR requester. @lihaofd let me know if you prefer to correct this point (which would be really helpful!) or if you'd prefer to merge as is (which we can understand too).

@lihaofd
Copy link
Contributor Author

lihaofd commented Apr 12, 2019

@szha I suggest we can merge the current PR and keep enhancing it in the future. Thanks

Copy link
Member

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution. This is significant amount of work to refactor all these code. Please check my comments. While my intent is not to block this PR from merging, I hope to get some clarifications. I have no objection merging the PR if there is urgency/dependency.

3,
dim_w));

// Query weight layout
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are these?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cudnn provides an interface for querying the layout of the fused weight matrix. the current operator hard-codes it instead of relying on this interface since there isn't much change in the layout requirement.

// Allocate the reserve space
reserve_space_ = Storage::Get()->Alloc(reserve_space_byte_, Context::GPU(s->dev_id));
// Allocate the temp space
temp_space_ = Storage::Get()->Alloc(workspace_byte_, Context::GPU(s->dev_id));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand it correctly, temp_space_ can be requested by adding kTempResource to resource request by checking dev_type in FResourceRequestEx, instead of calling StorageManager directly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it was done that way in the previous implementation. given that storage manager already is exposed due to reserved space I requested that the temp space be made consistent. it also has the benefit that the space can be shared between forward and backward since the size doesn't change.

dep.push_back(out_data[rnn_enum::kStateOut]);
dep.push_back(out_grad[rnn_enum::kStateOut]);
/*
index description
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably make some enums like RNNOpInputs for these indices

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depends on param.mode == rnn_enum::kLstm and param.state_outputs, the index order will be different for different RNN Type with or without stateoutput

@szha szha merged commit 1c49e40 into apache:master Apr 13, 2019
haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019
* change RNN OP to stateful

* retrigger the ci

* fix windows compile issue

* move cudnnrnn class into rnn-inl.h

* fix some bugs

* fix bug about gpu case like tests/python/gpu/test_gluon_gpu.test_rnn_layers_fp32 etc

* fix gpu compile issue of unix-gpu and windows-gpu

* print log for test

* fix GPU NO CUDNN for unix-gpu case

* print log

* remove print log and make gpu case has same error message as master when USE_CUDA=1 USE_CUDNN=0

* fix typo bug

* retrigger the ci

* retrigger the ci

* retrigger the ci

* fix comments

* retrigger the ci

* fix comments

* retrigger the ci

* fix comments

* retrigger the ci

* retrigger the ci

* retrigger the ci

* retrigger the ci

* fix comments
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-awaiting-review PR is waiting for code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants