Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve / Fix Weight Sharing #1211

Open
4 of 8 tasks
shelhamer opened this issue Oct 2, 2014 · 15 comments
Open
4 of 8 tasks

Improve / Fix Weight Sharing #1211

shelhamer opened this issue Oct 2, 2014 · 15 comments

Comments

@shelhamer
Copy link
Member

shelhamer commented Oct 2, 2014

Weight sharing as-is relies on a weight owner with which shared layers share their parameter blobs. This poses a few problems in relation to loss, loading and saving parameters, and weight initialization that are listed here for addressing.

@jeffdonahue @longjon

@jeffdonahue
Copy link
Contributor

Fix the resuming / fine-tuning issue for shared weights; see #959 (comment). Done in #594 as it turns out.

I just pushed a unit test for resuming from saved weights (4dc5bd0). It passes as expected, but fails when cherry-picked from 8dac339, before #594 was merged. Glad this was magically fixed, thanks @longjon!

@ducha-aiki
Copy link
Contributor

Would you consider the tied weights also? i have tried to implement them by myself, but with current weight sharing scheme it seemed too complicated.

@rodrigob
Copy link
Contributor

rodrigob commented Oct 6, 2014

@ducha-aiki what is the difference between tied weights and shared weights ?

@shelhamer I can look into dying if fillers are defined where parameters are shared; if you tell me what is the "caffe way of dying" (LOG(FATAL) and then ?).
Also, as example, for InnerProductLayer, can you share bias without sharing the product weights ?

@ducha-aiki
Copy link
Contributor

@rodrigob Tied weights are used in autoencoders. If encoder weights = W, then decoder weights = W^T, i.e transposed ones.
https://groups.google.com/forum/#!topic/theano-users/QilEmkFvDoE

@shelhamer
Copy link
Member Author

@ducha-aiki @rodrigob autoencoder-style shared weights are already possible by Caffe weight sharing if the blobs are shared with PERMISSIVE dimensionality checking:https://github.com/BVLC/caffe/blob/master/src/caffe/proto/caffe.proto#L273-L281 and the transpose shape is defined in the deconv layers.

While blobs can be shared permissively so that they have the same total cardinality but different dimensions this doesn't cover everything for W, W^T pairs since the input-output swapped inner product weights aren't the transpose.

@ducha-aiki
Copy link
Contributor

@shelhamer but the weights have different order in transposed matrix. I will check again, but when I have tried, that did not worked.

@jeffdonahue
Copy link
Contributor

Yeah, it would not work for pairs of inner product layers where the weights are transposed (using permissive would probably give very bad results). It would require a little bit of additional implementation -- probably the easiest would be to add a "transposed weights" option to the inner product layer so that the layer pair could use the same weight matrix.

@ducha-aiki
Copy link
Contributor

@jeffdonahue This is easy. The real problem are diffs, since they have not only different shape, but number of elements.

@jeffdonahue
Copy link
Contributor

What? Why would the diffs be a different number of elements? I think I'm missing something...

@ducha-aiki
Copy link
Contributor

@jeffdonahue Because size of diff == size of output.
An example from MNIST autoencoder:
name: "MNISTAutoencoder"
input: "data"
input_dim: 1
input_dim: 1
input_dim: 28
input_dim: 28
layers {
bottom: "data"
top: "encode1"
name: "encode1"
type: INNER_PRODUCT
inner_product_param {
num_output: 1000
}
}
layers {
bottom: "encode1"
top: "decode1"
name: "decode1"
type: INNER_PRODUCT
inner_product_param {
num_output: 784
}
}

@jeffdonahue
Copy link
Contributor

Right, the encode1 weights are 1000x784 (producing 1000D outputs from 784D inputs) and the decode1 weights have the transposed dimension, 784x1000 (producing 784D outputs from 1000D inputs). The weight gradients are the same dimension by definition.

@shelhamer
Copy link
Member Author

We should keep #1659 in mind too.

@yosipk
Copy link

yosipk commented Mar 18, 2015

Mocha has TiedInnerProductLayer [http://mochajl.readthedocs.org/en/latest/user-guide/layers/computation-layer.html#TiedInnerProductLayer, source: https://github.com/pluskid/Mocha.jl/blob/master/src/layers/tied-inner-product.jl], I guess Caffe could be similar, along the lines of @jeffdonahue suggestion to add a "transposed weights" option to the inner product layer.

@raingo
Copy link

raingo commented Jun 13, 2015

Do we have an update on these?

Shared weights are very important for recurrent nets.

@Jim61C
Copy link

Jim61C commented Apr 28, 2017

Hi, Do we have an update on the 7th problem mentioned above?

"Only the owner should initialize weights. Currently unnecessary work and memory is expended filling all weights, and then these are discarded to share with the weight owners."

I am currently facing a problem having multiple FC layers sharing weights due to memory issue and I believe that it is due to the fact that even if I share weights between those FC layers, they are still being initialized and take extra memory at the creation of the network, any idea on workaround of this will be greatly appreciated!

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants