-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoiding duplicated computations by having a single observable model #1855
base: master
Are you sure you want to change the base?
Conversation
I'm looking into the first point, decoupling the computation of the observables from their masking and loss. Some questions @goord Currently in
What does this rotation do? And is it possible to change this to (at the cost of concatenating masks inside
subsequent loss layer:
Also I've probably seen this before but still I'm confused why there is both a mask applied directly to the observables, in |
Yes these rotations are triggered by the 'data_transformation_tr', which is used if you represent the experimental data in a covariance-diagonal basis I guess. I'm not sure when this is actually used, and I'm not sure whether this code path is propoerly tested in the trvl-mask-layers branch... |
The mask in LossInvCovmat is not used for masking training/validation I think. |
ed168a1
to
61188c8
Compare
This is a rewrite of line 289 in the master. I don't know why the diagonal basis is not used for the experimental output layer, perhaps @scarlehoff or @RoyStegeman can explain us. If you look at n3fit_data.py you can see that in the diagonal basis, the training and validation covmats are being masked and then inverted, but the full covmat inverse (inv_true) is computed in the old basis. |
Because when they were separated it didn't really matter and it is decoupled from training / validation (the idea of diagonalising is to be able to do the split removing the correlations within a dataset between training and validation). |
Hm I don't fully understand, but is it ok to uniformize this? I now calculate all observables without any mask once, so using the same settings, and then mask the folds and the tr/val split afterwards. |
You can try the diag-DIS runcard to check the observable rotation: DIS_diagonal_l2reg_example.yml.txt |
Seems to work fine, and gives the same results as trvl-mask-layers. |
The chi2 (should not) depend of the diagonalization, since the total covmat is only used to report the total chi2, nobody cared about diagonalising that because it was not needed.
Yes because see above. |
Ok perfect, thanks :) |
@scarlehoff @Radonirinaunimi Right now I'm thinking that to remove the repeated calculation of observables, the easiest is to combine the training and validation models into one model that computes both of their losses, adding a Does that sound ok? Of course it would be nicer to instead just have one model and 3 different losses, but that will take longer to implement. |
I don't understand what you mean. The easiest way looks much more complex to me since you need to filter out things and any bug there will "break" the validation. |
Also, I'm not completely sure you can achieve your goal here? You need to compute everything twice for every epoch just the same. |
What I mean is we would have one model of the form (say we only have the DEUTERON observable) This I think is what requires the least changes. I haven't worked it all out yet, but in the end the |
In this PR I've already decoupled the computation of the observable from the masking+loss, that was quite simple and gives identical results. The tricky part is how to use that to avoid this repeated computation of the observable (and PDF). |
Yes, but your goal is to reduce the number of calls. However, you will need to call the model once for the optimization. So there is no way to avoid the repeated computation of the observable. |
Ah I hadn't thought about that, you're right that conventionally the validation at step t is computed after training for t steps. My proposal would have a shift by one step (epoch) with respect to this convention, in effect computing the validation losses from step t-1 at step t. But I don't think that's a big deal right? Changes should be tiny from one step to the next. |
It is a big deal because that tiny change can move you from a physical to an unphysical situation by means of positivity, integrability and probably even normalisation. But also, in general, it's simply not correct since the epoch at which you wanted to stop was the previous one. |
True, but that should be easy to solve. Just save the weights at every step and when the stopping condition hits, instead of just stopping, revert to the previous epoch. |
Check the speed up you would get in a fit and whether it doesn't become much more complicated. The forward pass is not the heaviest part of the calculation and a 50% speedup there won't translate to a speed up of the entire fit. So unless it makes a big different I'm strongly against adding this. It adds a lot of bug-cross section and complicates quite a bit using custom optimizers. |
If the final code is not very complex I'd be happy with it. From what you explain in the comments it looks complicated. Specially the idea of the internal filtering of losses. The tensorboard profile is not enough to know what would be the effect on the fit (specially if it is old, many things have changed in the meanwhile). Note that you will still need to wait and check positivity, check the chi2 etc. Btw, going back to the experimental model. Note that the comparison data in the experimental model is different from training/validation so it doesn't really matter how you do it, you need to recompute the losses for that one. |
a8e57c8
to
3d26ce0
Compare
The bulk of the work is done, and I've tested that the speedup is indeed about 33% (tested with 1 replica on CPU with NNPDF40_nnlo_as_01180_1000.yml, but it should be the same for e.g. 100 replicas on GPU, as it just skips about 33% of the work). Outcomes on a small runcard I've tested are identical training chi2s, identical validation chi2s but shifted 1 step, and identical final chi2's (train/val/exp) after reverting the weights one step, which was trivial to do (only at the cost of storing 2 copies of the weights, but they are tiny). The structure now is that we have one Note that all losses can be given all observables as input, Keras will just select only the one it needs. It needs some minor tweaks and more testing, but before going into that I would like to know if you agree now with this approach broadly speaking @scarlehoff. timing scriptimport time
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model
import numpy as np
i_small = Input(shape=(1,))
i_big = Input(shape=(1000,))
d_small = Dense(1)(i_small)
d_large = Dense(300_000)(i_big)
d2_large = Dense(300_000)(i_big)
d3_large = Dense(1)(d2_large)
model = Model(inputs=[i_small, i_big], outputs=[d_small, d3_large])
model_small = Model(inputs=[i_small, i_big], outputs=[d_small])
model_smallest = Model(inputs=[i_small], outputs=[d_small])
model.compile(optimizer="adam", loss="mse")
model_small.compile(optimizer="adam", loss="mse")
model_smallest.compile(optimizer="adam", loss="mse")
x_small = np.random.rand(1000, 1)
x_big = np.random.rand(1000, 1000)
y_small = np.random.rand(1000, 1)
y_big = np.random.rand(1000, 1)
def timefit(model, xs, ys):
start = time.time()
model.fit(xs, ys)
end = time.time()
print(f"Time for fit: {end-start:.5} s")
timefit(model_smallest, x_small, y_small)
timefit(model_small, [x_small, x_big], [y_small, y_big])
timefit(model, [x_small, x_big], [y_small, y_big]) |
It does look better than I expected: I cannot commit to having a look before late next week though. What is the chance of separating this from If that would be too much, can we focus on finalizing / merging #1788 in the next few weeks and then go back here? |
Rebasing to master would be difficult, but waiting for trvl-mask-layers to be merged is fine, that should have priority anyway. |
Then let's do that. Next week I'll have a look (is it finished?) at And then we go back to this one. |
Perfect! Indeed |
66721b6
to
8fda1ca
Compare
de3b55c
to
f794be6
Compare
6bd0fb6
to
c2e4935
Compare
c5d97c9
to
374a0a0
Compare
f696bfd
to
0c8e095
Compare
f98efb0
to
54f5e9a
Compare
Goal
The goal of this PR is to speed up the code by a factor 2 by a refactoring that avoids redoing the same computations.
Currently there are separate training and validation models.
At every training step the validation model is run from scratch on x inputs, while the only difference with the training model is in the final masking just before computing the loss.
This will hopefully also improve readability. From an ML point of view the current naming is very confusing. Instead of having a training model and a validation model, we can have a single observable model, and on top of that a training and validation loss. (Just talking about names here, they may still be MetaModels)
The same holds of course for the experimental model, except that there is no significant performance cost there. But for consistency and readability let's try to treat that on the same footing.
This PR branches off of trvl-mask-layers because that PR changes the masking. That one should be merged before this one.
Current implementation
Models creation
The models are constructed in
ModelTrainer._model_generation
.Specifically in the function
_pdf_injection
, which is given the pdfs, a list of observables and a corresponding list of masks.For the different "models", both the values of the mask but also the list of observables change, as not all models use all observables, in particular the positivity and integrability ones.
This function just calls the observables on the pdfs with the mask as argument.
And each observable's call method, defined here, does two steps: 1. compute the observable, 2. apply the mask and compute the loss.
Models usage
Once they are created, the training model is, obviously, used for training here.
The validation model used to initialize the
Stopping
object. The only thing that happens there is that itscompute_losses
method is called. Similarly for the experimental model, where it is called directly in theModelTrainer
(here).Changes proposed
ObservableWrapper
class. Just remove those parts fromObservableWrapper
, and create perhaps anObservableLoss
layer that does this.observables_model
.observables_model
, to create a model identical to the current training model.self.experimental["model"].compute_losses()
withexperimental_loss(observables)
.