Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RNN + LSTM Layers #3948

Merged
merged 3 commits into from
Jun 2, 2016
Merged

RNN + LSTM Layers #3948

merged 3 commits into from
Jun 2, 2016

Conversation

jeffdonahue
Copy link
Contributor

This PR includes the core functionality (with minor changes) of #2033 -- the RNNLayer and LSTMLayer implementations (as well as the parent RecurrentLayer class) -- without the COCO data downloading/processing tools or the LRCN example.

Breaking off this chunk for merge should make users who are already using these layer types on their own happy, without adding a large review/maintenance burden for the examples (which have already broken multiple times due to changes in the COCO data distribution format...). On the other hand, without any example on how to format the input data for these layers, it will be fairly difficult to get started, so I'd still like to follow up with at least a simple sequence example for official inclusion in Caffe (maybe memorizing a random integer sequence -- I think I have some code for that somewhere) soon after the core functionality is merged.

There's still at least one documentation TODO: I added expose_hidden to allow direct access (via bottoms/tops) to the recurrent model's 0th timestep and Tth timestep hidden states, but didn't add anything to the list of bottoms/tops -- still need to do that. Otherwise, this should be ready for review.

weiliu89 added a commit to weiliu89/caffe that referenced this pull request Apr 7, 2016

/**
* @brief An abstract class for implementing recurrent behavior inside of an
* unrolled network. This Layer type cannot be instantiated -- instaed,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: "instaed"

@shelhamer shelhamer added the focus label Apr 8, 2016
weiliu89 added a commit to weiliu89/caffe that referenced this pull request Apr 9, 2016
@weiliu89
Copy link

It doesn't work with current net_spec.py. In specific, 1) it will fail when using L.LSTM() or L.RNN() since only RecurrentParameter is defined in the caffe.proto. 2) it will fail when using L.Recurrent() since RecurrentLayer is not registered (an abstract class).

I did a simple hack by adding the following in the param_name_dict() function in net_spec.py

param_names += ['recurrent', 'recurrent']
param_type_names += ['LSTM', 'RNN']

@shelhamer
Copy link
Member

@weiliu89 the recurrent parameter for these layers, like the convolution parameter for DeconvolutionLayer, is defined in net spec by naming it directly:

n = caffe.NetSpec()
...
n.lstm = L.LSTM(n.data, recurrent_param=dict(num_output=10))
...

Whether or not to map these shared parameter types as you suggest here or as suggested or DeconvolutionLayer in #3954 could be handled by a separate PR since recurrent layers are not the only instance of this.

const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom);

/// @brief A helper function, useful for stringifying timestep indices.
virtual string int_to_str(const int t) const;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little surprising to see a helper like this show up in the recurrent layer, but if there weren't any use for it elsewhere then I suppose it could live here. That said, there is already format.hpp and its format_int() function that was added for cross-platform compatibility in b72b031. How about making use of that instead?

@shelhamer
Copy link
Member

LGTM overall—my only comments were about comments and naming (and that one int -> string function). @longjon are you done with your review?

@ajtulloch
Copy link
Contributor

Looks great. Thanks for this @jeffdonahue. We've been using a variant of this for a while and it has performed great.

One thing we can additionally PR/gist (if it's useful) is a wrapper around the LSTM layer that allows for arbitrary length (batched) forward propagation - which came in handy when doing inference on arbitrary length sequences (relaxing the constraint around T_ while preserving memory efficiency for the forward pass by reusing activations across timesteps).

@jeffdonahue jeffdonahue force-pushed the recurrent-layer branch 3 times, most recently from bbd33d2 to 4b6c835 Compare April 26, 2016 23:52
@jeffdonahue
Copy link
Contributor Author

@shelhamer @longjon thanks for the review! Fixed as suggested.

@ajtulloch glad to hear it's been working for you guys, thanks for looking it over! I'm not sure I understand the idea of the wrapper though. I think this implementation should be able to do what you're saying -- memory efficient forward propagation over arbitrarily long sequences -- by feeding in T_=1 (1xNx...) data to the RecurrentLayer and setting cont=0 at the first timestep of the sequence, then cont=1 through the end (then starting over with cont=0 at the start of the next sequence). This should reuse the activation memory as you mentioned (using just O(N) memory rather than O(NT)). (In fact, this capability is the point of having the cont input in the first place.) Maybe your wrapper is a friendly interface that handles all the bookkeeping for this? In that case it definitely sounds like it would be helpful. Or maybe I'm totally misunderstanding?

@ajtulloch
Copy link
Contributor

ajtulloch commented May 3, 2016

@jeffdonahue yeah, the only contribution was around allowing variable-T_ inputs but still batching the i2h transform - this was substantially faster than the approach you describe (T_ = 1 and looping which I initially did), IIRC ~3x for some of our models. It costs a bit more memory (NxT_xD of the i2h, but only NxD for the h/c states for arbitrary T_), but saves NxT_xD for the h/c states. https://gist.github.com/ajtulloch/2b7a98de642df934456001de238ed5c7 is the CPU impl - it's a bit niche so I wouldn't advocate pulling it at all, but might be handy for someone who hits this issue in the future.

@jeffdonahue
Copy link
Contributor Author

Ah -- batching the input transformation regardless of sequence length indeed makes sense. Thanks in advance for posting the code!

niketanpansare pushed a commit to niketanpansare/systemml that referenced this pull request May 10, 2016
@MinaRe
Copy link

MinaRe commented May 13, 2016

Dear all

I have very big matrix(rows are ID and columns are label ) and I was wondering to know How can i do the training on caffe with just fully connected layers?

Thanks a lot.

niketanpansare pushed a commit to niketanpansare/systemml that referenced this pull request May 13, 2016
@dangweili
Copy link

When will this issue be merged ?

@yshean
Copy link

yshean commented May 21, 2016

Anyone successfully merged @jeffdonahue caffe:recurrent-layer and BVLC's caffe:master? Why does the assertion of CHECK_EQ(2 + num_recur_blobs + static_input_, unrolled_net_->num_inputs()); fail during make runtest?

[----------] 9 tests from LSTMLayerTest/0, where TypeParam = caffe::CPUDevice<float>
[ RUN      ] LSTMLayerTest/0.TestForward
F0521 03:29:55.683001  5650 recurrent_layer.cpp:142] Check failed: 2 + num_recur_blobs + static_input_ == unrolled_net_->num_inputs() (4 vs. 0) 

myfavouritekk added a commit to myfavouritekk/caffe that referenced this pull request May 24, 2016
RNN + LSTM Layers

* jeffdonahue/recurrent-layer:
  Add LSTMLayer and LSTMUnitLayer, with tests
  Add RNNLayer, with tests
  Add RecurrentLayer: an abstract superclass for other recurrent layer types
aralph added a commit to aralph/caffe that referenced this pull request Jun 1, 2016
…LSTM Layers BVLC#3948' by jeffdonahue for BVLC/caffe master.
@jeffdonahue
Copy link
Contributor Author

Thanks again for the reviews everyone. Sorry for the delays -- wanted to do some additional testing, but I'm now comfortable enough with this to merge.

@jeffdonahue jeffdonahue merged commit 58b10b4 into BVLC:master Jun 2, 2016
@ajtulloch
Copy link
Contributor

Very nice work @jeffdonahue.

@naibaf7
Copy link
Member

naibaf7 commented Jun 2, 2016

@jeffdonahue
Now also available on the OpenCL branch.

@jakirkham
Copy link

Any plans for a release?

@antran89
Copy link
Contributor

antran89 commented Jun 7, 2016

Can you have a link of a working tutorial/example on using these layers? It would be easier for new learners. I know you have it somewhere.

yjxiong pushed a commit to yjxiong/caffe that referenced this pull request Jun 15, 2016
@wenwei202
Copy link

Great work!!! @jeffdonahue I used https://github.com/jeffdonahue/caffe/tree/recurrent-rebase-cleanup/ as the example to do ./examples/coco_caption/train_language_model.sh. The code I used is BVLC master. It converges well at the beginning but diverges after Iteration 2399 as following:

I0630 15:15:16.417166 23801 solver.cpp:228] Iteration 2397, loss = 61.5563
I0630 15:15:16.417196 23801 solver.cpp:244] Train net output #0: cross_entropy_loss = 3.13294 (* 20 = 62.6589 loss)
I0630 15:15:16.417207 23801 sgd_solver.cpp:106] Iteration 2397, lr = 0.1
I0630 15:15:16.533344 23801 solver.cpp:228] Iteration 2398, loss = 61.561
I0630 15:15:16.533375 23801 solver.cpp:244] Train net output #0: cross_entropy_loss = 3.13485 (* 20 = 62.6971 loss)
I0630 15:15:16.533386 23801 sgd_solver.cpp:106] Iteration 2398, lr = 0.1
I0630 15:15:16.655758 23801 solver.cpp:228] Iteration 2399, loss = 61.5369
I0630 15:15:16.655824 23801 solver.cpp:244] Train net output #0: cross_entropy_loss = 2.98118 (* 20 = 59.6236 loss)
I0630 15:15:16.655838 23801 sgd_solver.cpp:106] Iteration 2399, lr = 0.1
I0630 15:15:16.776641 23801 solver.cpp:228] Iteration 2400, loss = 78.3731
I0630 15:15:16.776676 23801 solver.cpp:244] Train net output #0: cross_entropy_loss = 87.3366 (* 20 = 1746.73 loss)
I0630 15:15:16.776690 23801 sgd_solver.cpp:106] Iteration 2400, lr = 0.1
I0630 15:15:16.892026 23801 solver.cpp:228] Iteration 2401, loss = 95.2123
I0630 15:15:16.892060 23801 solver.cpp:244] Train net output #0: cross_entropy_loss = 87.3365 (* 20 = 1746.73 loss)
I0630 15:15:16.892071 23801 sgd_solver.cpp:106] Iteration 2401, lr = 0.1
I0630 15:15:17.007628 23801 solver.cpp:228] Iteration 2402, loss = 112.041
I0630 15:15:17.007663 23801 solver.cpp:244] Train net output #0: cross_entropy_loss = 87.3365 (* 20 = 1746.73 loss)
I0630 15:15:17.007675 23801 sgd_solver.cpp:106] Iteration 2402, lr = 0.1
I0630 15:15:17.123337 23801 solver.cpp:228] Iteration 2403, loss = 128.873
I0630 15:15:17.123373 23801 solver.cpp:244] Train net output #0: cross_entropy_loss = 87.3365 (* 20 = 1746.73 loss)
I0630 15:15:17.123384 23801 sgd_solver.cpp:106] Iteration 2403, lr = 0.1
I0630 15:15:17.239030 23801 solver.cpp:228] Iteration 2404, loss = 145.734
I0630 15:15:17.239061 23801 solver.cpp:244] Train net output #0: cross_entropy_loss = 87.3365 (* 20 = 1746.73 loss)
I0630 15:15:17.239074 23801 sgd_solver.cpp:106] Iteration 2404, lr = 0.1

Any suggestion?

@UsamaShafiq91
Copy link

@jeffdonahue I am new to caffe. Do you have any example about RNN. How to use RNN layer.
Any help will be appreciated.

@agethen
Copy link

agethen commented Jul 26, 2016

@jeffdonahue May I ask your help for a clarification?
Consider we have an Encoder-Decoder structure with two RNN/LSTM layers. Say we let the Encoder read features X, and the Decoder output its state H, and say that the state of the encoder is copied to the decoder via setting expose_hidden: true and connecting the blobs.

I can see in RecurrentLayer::Reshape that recur_input_blobs share their data with bottom blobs -- but they do not share their diff (unlike as for the top blobs)! Can the hidden state/cell state gradient then still travel backwards from decoder to encoder? Is this a misunderstanding on my side?
Thank you very much!

@wenwei202
Copy link

Hello, what makes it necessary to switch the dimension order of bottom blob from N * T * ... to T * N * ...? In this way, in the batch_size in the prototxt actually is the unrolled step, right?

fxbit pushed a commit to Yodigram/caffe that referenced this pull request Sep 1, 2016
@ayushchopra96
Copy link

Hi. @jeffdonahue @weiliu89 .
Is there support to access C(Cell State) and H(Hidden State) at each timestep of the process?
I needed to simulate an attention mechanism.

Thanks in Advance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.