Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoiding duplicated computations by having a single observable model #1855

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

APJansen
Copy link
Collaborator

Goal

The goal of this PR is to speed up the code by a factor 2 by a refactoring that avoids redoing the same computations.
Currently there are separate training and validation models.
At every training step the validation model is run from scratch on x inputs, while the only difference with the training model is in the final masking just before computing the loss.

This will hopefully also improve readability. From an ML point of view the current naming is very confusing. Instead of having a training model and a validation model, we can have a single observable model, and on top of that a training and validation loss. (Just talking about names here, they may still be MetaModels)

The same holds of course for the experimental model, except that there is no significant performance cost there. But for consistency and readability let's try to treat that on the same footing.

This PR branches off of trvl-mask-layers because that PR changes the masking. That one should be merged before this one.

Current implementation

Models creation

The models are constructed in ModelTrainer._model_generation.
Specifically in the function _pdf_injection, which is given the pdfs, a list of observables and a corresponding list of masks.
For the different "models", both the values of the mask but also the list of observables change, as not all models use all observables, in particular the positivity and integrability ones.
This function just calls the observables on the pdfs with the mask as argument.
And each observable's call method, defined here, does two steps: 1. compute the observable, 2. apply the mask and compute the loss.

Models usage

Once they are created, the training model is, obviously, used for training here.
The validation model used to initialize the Stopping object. The only thing that happens there is that its compute_losses method is called. Similarly for the experimental model, where it is called directly in the ModelTrainer (here).

Changes proposed

  1. Decouple the masking and loss computation from the ObservableWrapper class. Just remove those parts from ObservableWrapper, and create perhaps an ObservableLoss layer that does this.
  2. Apply this pure observable class to the pdfs, for all observables, to create an observables_model.
  3. Create 3 loss models, that take as input all observables, do both a masking and a selection and a computation of losses.
  4. For the training one, put it on top of the observables_model, to create a model identical to the current training model.
  5. Add the output of the observables_model to the output list of this training model, so these can be reused.
  6. The validation and experimental models can be discarded, instead we have the validation and experimental losses that are applied to the output of the observables_model. So e.g. we can replace self.experimental["model"].compute_losses() with experimental_loss(observables).

@APJansen
Copy link
Collaborator Author

I'm looking into the first point, decoupling the computation of the observables from their masking and loss. Some questions @goord

Currently in _generate_experimental_layer this happens:

  1. observables computed
  2. masks applied one by one
  3. masked observables concatenated
  4. some rotation
  5. loss computed

What does this rotation do?

And is it possible to change this to (at the cost of concatenating masks inside observable_generator):
Here:

  1. observables computed
  2. UNmasked observables concatenated
  3. (rotation?)

subsequent loss layer:

  1. masks applied to concatenated observables
  2. loss computed

Also I've probably seen this before but still I'm confused why there is both a mask applied directly to the observables, in _generate_experimental_layer, and also inside the LossInvcovmat itself?

@goord
Copy link
Collaborator

goord commented Nov 27, 2023

Yes these rotations are triggered by the 'data_transformation_tr', which is used if you represent the experimental data in a covariance-diagonal basis I guess. I'm not sure when this is actually used, and I'm not sure whether this code path is propoerly tested in the trvl-mask-layers branch...

@goord
Copy link
Collaborator

goord commented Nov 27, 2023

The mask in LossInvCovmat is not used for masking training/validation I think.

@APJansen
Copy link
Collaborator Author

@goord Why does the experimental output have rotation=None while the others have rotation=obsrot, around here? Is that intentional? It interferes a bit with how I thought the observable computation would decouple from the masked loss.

@goord
Copy link
Collaborator

goord commented Jan 12, 2024

@goord Why does the experimental output have rotation=None while the others have rotation=obsrot, around here? Is that intentional? It interferes a bit with how I thought the observable computation would decouple from the masked loss.

This is a rewrite of line 289 in the master. I don't know why the diagonal basis is not used for the experimental output layer, perhaps @scarlehoff or @RoyStegeman can explain us.

If you look at n3fit_data.py you can see that in the diagonal basis, the training and validation covmats are being masked and then inverted, but the full covmat inverse (inv_true) is computed in the old basis.

@scarlehoff
Copy link
Member

Because when they were separated it didn't really matter and it is decoupled from training / validation (the idea of diagonalising is to be able to do the split removing the correlations within a dataset between training and validation).

@APJansen
Copy link
Collaborator Author

Hm I don't fully understand, but is it ok to uniformize this? I now calculate all observables without any mask once, so using the same settings, and then mask the folds and the tr/val split afterwards.
It's passing all the tests and also giving identical results for the main runcard.

@goord
Copy link
Collaborator

goord commented Jan 12, 2024

Hm I don't fully understand, but is it ok to uniformize this? I now calculate all observables without any mask once, so using the same settings, and then mask the folds and the tr/val split afterwards. It's passing all the tests and also giving identical results for the main runcard.

You can try the diag-DIS runcard to check the observable rotation: DIS_diagonal_l2reg_example.yml.txt

@APJansen
Copy link
Collaborator Author

Seems to work fine, and gives the same results as trvl-mask-layers.

@scarlehoff
Copy link
Member

I don't fully understand

The chi2 (should not) depend of the diagonalization, since the total covmat is only used to report the total chi2, nobody cared about diagonalising that because it was not needed.

but is it ok to uniformize this?

Yes because see above.

@APJansen
Copy link
Collaborator Author

Ok perfect, thanks :)

@APJansen
Copy link
Collaborator Author

@scarlehoff @Radonirinaunimi
Positivity is included in the validation model, I remember we discussed this before, and if I remember correctly there was some disagreement on whether this was necessary or not, is that right?
If I remove it, I get an error from this line, which can be fixed by changing fitstate.validation to fitstate._training, after which it runs normally (though I haven't done any comparisons).

Right now I'm thinking that to remove the repeated calculation of observables, the easiest is to combine the training and validation models into one model that computes both of their losses, adding a "_tr" and "_val" postfix and filtering as appropriate when summing to get the final train/val losses. The experimental one can stay separate as the performance loss there is negligible.

Does that sound ok?

Of course it would be nicer to instead just have one model and 3 different losses, but that will take longer to implement.

@scarlehoff
Copy link
Member

I don't understand what you mean. The easiest way looks much more complex to me since you need to filter out things and any bug there will "break" the validation.

@scarlehoff
Copy link
Member

Also, I'm not completely sure you can achieve your goal here?

You need to compute everything twice for every epoch just the same.

@APJansen
Copy link
Collaborator Author

What I mean is we would have one model of the form (say we only have the DEUTERON observable) x -> pdf -> DEUTERON -> (DEUTERON_tr, DEUTERON_val), where the tuple are two separate layers containing the respective losses, and DEUTERON by itself is the full observable, without any mask (the computation up to and including that layer is the one we don't want to repeat).

This I think is what requires the least changes. I haven't worked it all out yet, but in the end the Stopping class won't need a separate validation model, and we don't need this compute_losses in MetaModel, it should just receive all the losses, both train and val, inside the history dict. And the default_loss defined above MetaModel would need to check if the key ends in "_tr" and give 0 otherwise.

@APJansen
Copy link
Collaborator Author

In this PR I've already decoupled the computation of the observable from the masking+loss, that was quite simple and gives identical results. The tricky part is how to use that to avoid this repeated computation of the observable (and PDF).

@scarlehoff
Copy link
Member

scarlehoff commented Jan 12, 2024

where the tuple are two separate layers containing the respective losses

Yes, but your goal is to reduce the number of calls. However, you will need to call the model once for the optimization.
Then the weights are updated.
Then you call it a second time to compute the validation for that step.
Then you check whether the positivity passes and whatsnot.

So there is no way to avoid the repeated computation of the observable.

@APJansen
Copy link
Collaborator Author

Ah I hadn't thought about that, you're right that conventionally the validation at step t is computed after training for t steps. My proposal would have a shift by one step (epoch) with respect to this convention, in effect computing the validation losses from step t-1 at step t. But I don't think that's a big deal right? Changes should be tiny from one step to the next.
This should give a 50% speedup (not 100% because the validation only has a forward pass while the training also has a backward pass), I think that's very much worth it.

@scarlehoff
Copy link
Member

It is a big deal because that tiny change can move you from a physical to an unphysical situation by means of positivity, integrability and probably even normalisation.

But also, in general, it's simply not correct since the epoch at which you wanted to stop was the previous one.

@APJansen
Copy link
Collaborator Author

True, but that should be easy to solve. Just save the weights at every step and when the stopping condition hits, instead of just stopping, revert to the previous epoch.
The weights are tiny so I think that shouldn't be a problem.

@scarlehoff
Copy link
Member

Check the speed up you would get in a fit and whether it doesn't become much more complicated.

The forward pass is not the heaviest part of the calculation and a 50% speedup there won't translate to a speed up of the entire fit. So unless it makes a big different I'm strongly against adding this. It adds a lot of bug-cross section and complicates quite a bit using custom optimizers.

@APJansen
Copy link
Collaborator Author

This old tensorboard profile illustrates the speedup. The gaps will be mostly removed by epochs-to-batches. Of the rest the validation step is more than 50% of the training step.
image

While I didn't think of the issue you mentioned, I still think it should be possible with minimal addition of complexity (it removes some and adds some).
I'll have a go and then we can see if it's worth it.

@scarlehoff
Copy link
Member

scarlehoff commented Jan 12, 2024

If the final code is not very complex I'd be happy with it. From what you explain in the comments it looks complicated. Specially the idea of the internal filtering of losses.

The tensorboard profile is not enough to know what would be the effect on the fit (specially if it is old, many things have changed in the meanwhile). Note that you will still need to wait and check positivity, check the chi2 etc.

Btw, going back to the experimental model. Note that the comparison data in the experimental model is different from training/validation so it doesn't really matter how you do it, you need to recompute the losses for that one.

@APJansen APJansen force-pushed the merge-observable-splits branch 4 times, most recently from a8e57c8 to 3d26ce0 Compare January 19, 2024 13:26
@APJansen
Copy link
Collaborator Author

The bulk of the work is done, and I've tested that the speedup is indeed about 33% (tested with 1 replica on CPU with NNPDF40_nnlo_as_01180_1000.yml, but it should be the same for e.g. 100 replicas on GPU, as it just skips about 33% of the work).

Outcomes on a small runcard I've tested are identical training chi2s, identical validation chi2s but shifted 1 step, and identical final chi2's (train/val/exp) after reverting the weights one step, which was trivial to do (only at the cost of storing 2 copies of the weights, but they are tiny).

The structure now is that we have one ModelTrainer.observables_model that computes all observables, and a dictionary ModelTrainer.losses with keys "training", "validation", "experimental", and values for each are again dictionaries of per experiment losses, appropriately masked and filtered for each split.

Note that all losses can be given all observables as input, Keras will just select only the one it needs.
Also, we can include the experimental observables in the same model without performance cost, as tensorflow will realise during training that those outputs aren't being used in the loss and avoid computing them. I verified this using the script below, where the small and smallest model have identical training times, while the big one takes 20 times longer.

It needs some minor tweaks and more testing, but before going into that I would like to know if you agree now with this approach broadly speaking @scarlehoff.
I've also cleaned up the stopping module a bit more, where it was needed for the main changes here, though certainly more can be done. (FitState and FitHistory could at this point be easily removed entirely, but this PR is already getting too big)

timing script
import time
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model
import numpy as np

i_small = Input(shape=(1,))
i_big = Input(shape=(1000,))

d_small = Dense(1)(i_small)
d_large = Dense(300_000)(i_big)
d2_large = Dense(300_000)(i_big)
d3_large = Dense(1)(d2_large)

model = Model(inputs=[i_small, i_big], outputs=[d_small, d3_large])
model_small = Model(inputs=[i_small, i_big], outputs=[d_small])
model_smallest = Model(inputs=[i_small], outputs=[d_small])

model.compile(optimizer="adam", loss="mse")
model_small.compile(optimizer="adam", loss="mse")
model_smallest.compile(optimizer="adam", loss="mse")

x_small = np.random.rand(1000, 1)
x_big = np.random.rand(1000, 1000)
y_small = np.random.rand(1000, 1)
y_big = np.random.rand(1000, 1)

def timefit(model, xs, ys):
    start = time.time()
    model.fit(xs, ys)
    end = time.time()
    print(f"Time for fit: {end-start:.5} s")

timefit(model_smallest, x_small, y_small)
timefit(model_small, [x_small, x_big], [y_small, y_big])
timefit(model, [x_small, x_big], [y_small, y_big])

@scarlehoff
Copy link
Member

It does look better than I expected:

Screenshot 2024-01-19 at 14 34 09

I cannot commit to having a look before late next week though. What is the chance of separating this from trvl-mask-layer and having it branch off master?

If that would be too much, can we focus on finalizing / merging #1788 in the next few weeks and then go back here?

@APJansen
Copy link
Collaborator Author

Rebasing to master would be difficult, but waiting for trvl-mask-layers to be merged is fine, that should have priority anyway.

@scarlehoff
Copy link
Member

Then let's do that. Next week I'll have a look (is it finished?) at trvl-mask-layers and rebase, squash or whatever to update the 2-years old commits. I think you have been keeping it up to date with master so it should be trivial.

And then we go back to this one.

@APJansen
Copy link
Collaborator Author

Perfect! Indeed trvl-mask-layers is tested and ready to review, and I just merged master into it today.

This was referenced Feb 13, 2024
@goord goord force-pushed the trvl-mask-layers branch 5 times, most recently from de3b55c to f794be6 Compare February 15, 2024 19:22
@APJansen APJansen force-pushed the trvl-mask-layers branch 3 times, most recently from 6bd0fb6 to c2e4935 Compare February 20, 2024 12:17
@APJansen APJansen force-pushed the merge-observable-splits branch 2 times, most recently from c5d97c9 to 374a0a0 Compare February 21, 2024 14:29
Base automatically changed from trvl-mask-layers to master February 22, 2024 12:32
@APJansen APJansen force-pushed the merge-observable-splits branch 2 times, most recently from f696bfd to 0c8e095 Compare February 24, 2024 15:21
@APJansen APJansen force-pushed the merge-observable-splits branch from f98efb0 to 54f5e9a Compare March 5, 2024 17:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants