Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exponential Moving Average #3696

Closed
soheilb opened this issue Sep 6, 2016 · 24 comments
Closed

Exponential Moving Average #3696

soheilb opened this issue Sep 6, 2016 · 24 comments

Comments

@soheilb
Copy link

soheilb commented Sep 6, 2016

It is often useful to have a second copy of the model being trained that maintains an exponential moving average of the model weights. This can results in improved performance on validation/test set and is already available in Tensorflow: https://www.tensorflow.org/versions/r0.10/api_docs/python/train.html#ExponentialMovingAverage
and can be a useful feature for training with Keras.

see 'Model Ensembles" section here for an intuitive description: http://cs231n.github.io/neural-networks-3/

I implemented this for my Keras experiments using callback function:
https://gist.github.com/soheilb/c5bf0ba7197caa095acfcb69744df756

Feedback is highly welcomed. I can later on submit a pull request for this if it turns out that implementing it as a callback function is a good enough solution.

@soheilb soheilb changed the title Callback function to implement Exponential Moving Average Exponential Moving Average Sep 6, 2016
@kuza55
Copy link
Contributor

kuza55 commented Sep 6, 2016

Thanks for adding this, I learnt something today :)

What's the performance of this like, given you perform the weight decay on the CPU?

On one hand doing this on GPU is probably faster, but adding 50% GPU memory usage (double weights, but no extra gradients) is also not always great. One option could be having a consume_less parameter the same way Recurrent layers do to decide whether to run this on CPU or GPU.

@fchollet
Copy link
Collaborator

fchollet commented Sep 6, 2016

This technique is called Polyak averaging. We could call it PolyakAveragingCheckpointer or something like that. The proper way to do is to do a MAE of the weights that is updated after every batch. It should be implemented in a way similar to the moving averages in the BN layer. It isn't entirely clear if it can be achieved (efficiently) with a simple callback. The update should be made part of the main graph run.

It would definitely be a valuable addition to Keras. Have you though about allowing saving after a number of batches, instead of after every epochs? Can be useful when each epoch is very long. Same is true for the regular model checkpointer as well.

@soheilb
Copy link
Author

soheilb commented Sep 6, 2016

@kuza55 Sure :) I tested this on Theano only and the transfer should happen on GPU if the GPU is selected as device. Model parameters on Keras Theano backend are created using shared variables and reside on GPU and thus the copy is being done on GPU. I am noticing very little overhead as you guessed when running on GPU, but have just tired it on lstms and this needs further testings.
Moreover, the consume_less parameter for recurrent layers does not decide whether to run operations on CPU or GPU, this will be decided using Device flag on theano. It only provides implementations that are optimized for different devices (e.g. you can select consume_less to CPU and still run the layer on GPU).

@soheilb
Copy link
Author

soheilb commented Sep 6, 2016

@fchollet Thanks for the comment. If you look at my gist, I am indeed updating the parameters after every batch as you mentioned. I only save the MAE copy to disk after every epoch which can be changed to save after a number of epochs as you suggested. I will check the BN layer later today to see how I can make it as part of the main graph run. Any other thoughts?

@kuza55
Copy link
Contributor

kuza55 commented Sep 6, 2016

I'm not sure if I don't understand Theano or you misunderstood my comments, but my reading of your code is that you use K.batch_get_value to read all the weights into Python, and then compute the weighted average on the CPU, then transfer the weights back to the GPU, whereas you could use K.update etc to run the updates on the GPU without needing to bring the weights onto the CPU.

@soheilb
Copy link
Author

soheilb commented Sep 7, 2016

Oh I see your point. You are right, I was mainly thinking about set_value() function.

I am not sure if we can add the moving average update operation to the main graph when using callback method. The training function for Keras models is compiled before any callback functions are called upon (see _make_train_function in engine/training.py).

One option to bring the update into main graph would be to update the compile function of the Model class and append the new update operations to self.updates. We should also take care of transferring the moving averaged weights to the original model weights at the end of training. Not sure how the API should look like then.

@crowsonkb
Copy link
Contributor

crowsonkb commented Sep 10, 2016

Thanks for this!

In section 7.2 of the Adam paper they suggest using the beta_2 (default 0.999) parameter as the 'momentum parameter' in an EWMA of weights, and reference Polyak. They also give the initialization bias correction (like that used in Adam itself for momentum). I didn't understand it when I first read it and it was due to comments here that I worked it out. I had great success yesterday applying Polyak+Adam to image synthesis from convnets (neural style) - the results had far less noise than any unaveraged optimizer I tried. I tried to incorporate it into Keras' Adam optimizer and failed, since I couldn't figure out how to get gradients from something other than the averaged weights.

@kuza55
Copy link
Contributor

kuza55 commented Sep 10, 2016

So, I'm thinking the best way to implement this in the main graph would be to create layer that replaces your output layer.

The PolyakAveragingLayer would be a simple identity pass through in terms of data, but would examine the _keras_history of its input, finding all the layers that go into it, and extracting the trainable_weights property, allocate the shadow weights and register an update_op.

The question of where to allocate the weights remains; in TF there is tf.Operation.device which would let us allocate it wherever the original is, not sure about Theano.

I'm not sure how this should integrate with validation (should these weights be used for validation?) or inference (should these be used for inference by default); or should there just be a special function on the layer to save the model to a file similar to _make_mv_model ?

@fchollet
Copy link
Collaborator

The proper way to do this is most likely to add a polyak_averaging option in compile, which would create EMA ops to be run as part of the main graph call.

When calling a test-time function, we would extract the model's weights, replace them with the EMA weights, run the predictions, then set back the initial weights.

@fchollet
Copy link
Collaborator

Note that Polyak averaging in general is useful and widespread enough that it does deserve a compile option.

@crowsonkb
Copy link
Contributor

Polyak averaging proper takes an equal-weighted average of every past iterate - it's not an EMA. (See Polyak's paper.) I came across the EMA idea in the Adam paper. When I implemented it for image synthesis I just picked Polyak's formulation because it was easier for me to implement and with only a few hundred iterations the difference was negligible.

Likewise it seems there are two different things people do with the averaged weights: either they use them for validation and inference or they use them as part of an ensemble with the unaveraged weights and therefore want to validate and save both.

@soheilb
Copy link
Author

soheilb commented Sep 11, 2016

@crowsonkb Thanks for citing relevant paper of Adam paper.

@fchollet I like polyak_averaging option in compile, maybe accepting three values 0 (default, no averaging), 1 (EMA with a decay parameter), 2 (simple average). Should we compile a separate function than train_function to make sure the EMA ops are performed only after the weights are updated by the optimizer? Any thought regarding @kuza55 's comment on device usage for this operations? In TF implementation, EMA operations are performed on same device as the original variables to lower communication bandwidth.

@kuza55
Copy link
Contributor

kuza55 commented Sep 11, 2016

So, I sketched out an implementation of Polyak averaging as part of model.compile ( master...kuza55:polyak ) and I don't particularly like it since it seems like the actual need to be in the core is pretty minimal. The main reason to put it in the core would be so that it can easily get all the weights and register it's update op and so that it can switch out the weights for validation/inference.

Here is an alternate proposal: A Model wrapper that takes a model, makes a copy with unshared weights (how? maybe via the model de/serialization code?), and returns a model that wraps these two models, sets its own update ops and uses the K.learning_phase() property to decide whether it should run the learning model and then update the Polyak model, or whether it should run the Polyak model, or run both and sum the output. You can then easily use it on sub-graphs too.

@stale stale bot added the stale label May 23, 2017
@stale
Copy link

stale bot commented May 23, 2017

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs, but feel free to re-open it if needed.

@joeyearsley
Copy link
Contributor

@fchollet Is this in the roadmap for the future?

@stale stale bot removed the stale label Jun 21, 2017
@alalbiol
Copy link

alalbiol commented Jul 3, 2017

I also find this feature very interesting. In the meanwhile, I found an intermediate solution for my needs.
Since the averaged model is only useful at the end of training, I only average weights after N steps since the beginning, reducing the overload

In my case I use the train_on_batch loop because my data does not fit in memory so it is really easy for me to implement it after the N steps for an additional number of steps

Another advantage of my approach (for me) is that I have not much GPU memory, so averaging the parameters in CPU is an advantage

@stale
Copy link

stale bot commented Oct 1, 2017

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.

@stale stale bot added the stale label Oct 1, 2017
@stale stale bot closed this as completed Oct 31, 2017
@lishen
Copy link

lishen commented Dec 13, 2017

Stumbled upon this issue when search online. This is a really useful feature. There is already a user implemented callback here: https://github.com/alno/kaggle-allstate-claims-severity/blob/master/keras_util.py. I hope this can be officially supported.

@SebiSebi
Copy link

Any updates on this one? Would be really helpful :)

@alexhernandezgarcia
Copy link

I think this is one the most important features that Keras is missing!

@hfawaz
Copy link

hfawaz commented Jun 4, 2019

Yes this is important indeed.

@veqtor
Copy link

veqtor commented Jul 9, 2019

This i think is almost required for a lot of GAN architectures like biggan and stylegan so would be almost a requirement for Tf.keras for GAN research

@Squadrick
Copy link

Squadrick commented Aug 29, 2019

It's been implemented in TensorFlow Addons: tfa.optimizers.MovingAverage.

Compatible with tf.keras and TF 2.0.

@veqtor
Copy link

veqtor commented Sep 10, 2019

Great stuff, can an EMA-model exist alongside the model being trained somehow? (in certain GANs such as StyleGAN you train the Discriminator on the EMA of the Generator and vice-versa)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests