Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi dense logistics #1818

Merged
merged 55 commits into from
Dec 4, 2023
Merged

Multi dense logistics #1818

merged 55 commits into from
Dec 4, 2023

Conversation

APJansen
Copy link
Collaborator

@APJansen APJansen commented Oct 16, 2023

The aim of this PR is to make all the remaining changes necessary to enable the implementation of the MultiDense layers, without changing the numerics at all.
This boils down to stacking the different replicas as soon as possible (i.e. here), and making any changes that result from that.

TODO:

  • photons: see below for a detailed discussion.
  • Fix loading of weights in this block. I'm not sure how that's working at the moment, all replicas read from the same file. Is this something that is actually used? @scarlehoff ?
  • verify that hyperopt is working
  • figure out what to do with the line n3pdfs.append(N3PDF(pdf_models, name=f"fold_{k}")) [here], see below for a detailed discussion (
    n3pdfs.append(N3PDF(pdf_models, name=f"fold_{k}"))
    )

Comments

The biggest changes are the addition of 3 methods to MetaModel: get_replica_weights, set_replica_weights and split_replicas. For now these rely on the different replicas being different models. Once the MultiDense layer is implemented that won't be the case, and the code in these functions will need to change to extract the right entry in the replica axis of all weights, but the changes required should be limited to these functions.

The reason for split_replicas is that if I don't do this and keep everything as a single model, it will require a lot more changes, also in valid phys, and since performance wise the difference is negligible, I thought it was cleanest to split into separate replicas after training.

Status

Work in Progress.

Currently I'm a bit stuck on stuck on this, it runs (at least if I avoid the points above by just running the basic runcard), and during training the results are identical to master. Also the weights of the single replica models are identical to the original model, but the best chi2 reported per replica after training are very different.

Later

Once the above is done, the joining of replicas can be pushed even further back before actually including the MultiDense layers, namely directly after the NN layers. Whether I'll do it in this PR or the final one will depend on if it changes the numerics or not (I think it may).

@APJansen APJansen force-pushed the multi-dense-logistics branch from 8fef1b0 to 0c0ee66 Compare October 16, 2023 13:38
@Radonirinaunimi Radonirinaunimi added Refactoring enhancement New feature or request labels Oct 16, 2023
@Radonirinaunimi Radonirinaunimi linked an issue Oct 16, 2023 that may be closed by this pull request
@Radonirinaunimi
Copy link
Member

Thanks @APJansen! Could you also please update accordingly #1803 to refer which PR addresses which point (for example, this goes into point (2))? In this case, it would be easy for people to sort of situate the status of each PR.

@APJansen APJansen self-assigned this Oct 16, 2023
@APJansen
Copy link
Collaborator Author

Yes @Radonirinaunimi, I've updated #1782 for which this is the last prepatory PR. That one then is step 6 in #1803.

@APJansen
Copy link
Collaborator Author

Photons

My understanding of what currently happens in master:

  • An AddPhoton layer is initialized once here (so shared between replicas), with a photon.compute.Photon class passed as argument. (This in turn is initialized with a theory_id, lux_params and list of replica ids)
  • The Photon object contains integrals, which are used in the MSR layer with a replica index to extract the right one. This is trivial to "parellalize".
  • For each replica the AddPhoton layer is called on the pdf here, with additional argument the replica index (at the last stage after MSR normalization)
  • Since AddPhoton.register_photons method hasn't been called yet, AddPhoton._pdf_ph is still None, so this does nothing.
  • In model_trainer.py, first the PDF model is created including the steps above. Then for each replica the AddPhoton layer is extracted (this is the same layer every time!), and register_photon is called on it every time here. In each call it does some computation for every replica, that seems to be always the same for every call, so it seems this could have been called just once?

Questions to @niclaurenti

  • Is this correct?
  • It seems to me AddPhotons.call does nothing when the model is built. I thought this would mean it doesn't get included in the computational graph, but probably that is wrong? Is it intended to do nothing at first (on the symbolic input), but do something every next time?
  • The AddPhotons.call also seems trivial to parallelize, rather than taking the replica at the specified index of its internal photon values self._pdf_ph, and replacing a single replica PDF with this, just take the whole thing and replace the joint replica PDF with all. Is that right?
  • Is it ok to remove the loop here and call register_photon just once?

@APJansen
Copy link
Collaborator Author

The N3PDF line here

The list of single replica PDFs is wrapped in an N3PDF object, a list of those is made, one for each fold, and this is passed every trial to a hyper loss.

There it has the potential to touch a lot of validphys code that I'm not familiar with.
Unless anyone else finds this easy to adapt, perhaps it's best to:

  • pass the multi-replica pdf itself directly to the hyper loss
  • inside the hyper loss, convert this to a list of single replica pdfs wrapped in an N3PDF
  • leave the rest unchanged

This comes at the cost of having to split into individual replicas at every trial, and not having the benefit of parallelization in the computations done there. But this should be a very minor part of the overall runtime.

Details: what happens to it in the hyper loss (depending on which one is chosen)

  • xplotting_grid(n3pdf, ...) (on one in the list)
  • distance_grids(n3pdfs, ...)
  • n3pdf(input_x) (on one in the list)

Those are all validphys functions, in turn they call:

  • basis.grid_values(n3pdf, ...)
  • n3pdf.stats_class(...)
  • n3pdf.get_members()
  • n3pdf.stats_class(...)

@APJansen APJansen force-pushed the multi-dense-logistics branch from b42e815 to 3cbc9e2 Compare October 18, 2023 13:16
@APJansen
Copy link
Collaborator Author

I remember now there was a better reason for having a list of all the single replicas rather than a single one: with a single one you have to make deep copies to avoid overwriting the weights of the previous one. The copying was giving me loads of issues, because of all the extras that MetaModel has over a normal Keras model.

Your solution @scarlehoff of using a generator avoids that, and it was a much simpler change than I thought. It does come at the cost of having to redo the same work of generating a model for every replica at every fold, and because of the above we cannot just generate one and overwrite its weights. I did a quick test and it takes about 0.1 second per replica.

I don't have a strong preference for either approach, the performance cost is negligible compared to a full training cycle. Do you still prefer the generator?

@scarlehoff
Copy link
Member

Yes, I think ~2 minutes per fold in a worst case (1000 replicas) is more than acceptable.

APJansen and others added 3 commits November 28, 2023 13:11
@APJansen
Copy link
Collaborator Author

The two added attributes in the photon were necessary after all, to create the single replica model from the multireplica one. And they are used outside of the class itself, so I didn't add the leading underscore.

@scarlehoff
Copy link
Member

I tried a 25 replica fit (asking for 20 replicas in postfit, could've asked for all 25 actually) running in parallel in a GPU with feature scaling in master and this branch, seems to work fine.

https://vp.nnpdf.science/_tH_NYiGRXmEoxTeIZCkPg==

The different version of numpy is because both GPUs were in different computers.

@scarlehoff
Copy link
Member

Regarding the QED fit, since we are running just one replica, the regression test should be enough.

@niclaurenti could you confirm that this runcard

still represents the way you are running QED fits? (otherwise, let me know the differences and I'll try to run an example of an actual fit)

@scarlehoff scarlehoff added run-fit-bot Starts fit bot from a PR. and removed run-fit-bot Starts fit bot from a PR. labels Nov 29, 2023
@APJansen
Copy link
Collaborator Author

I merged master, the conflict was just a formatting thing.

@scarlehoff scarlehoff added run-fit-bot Starts fit bot from a PR. and removed run-fit-bot Starts fit bot from a PR. labels Nov 29, 2023
Copy link

Greetings from your nice fit 🤖 !
I have good news for you, I just finished my tasks:

Check the report carefully, and please buy me a ☕ , or better, a GPU 😉!

@scarlehoff
Copy link
Member

Oh, so the automatic fit was send but hidden... interesting. Well, there will be another post in a few hours then.

I've done a few more tests that I wanted to do and everything looks ok:

  • Nothing changes time-wise when you single for NNPDF4.0 master vs this one
  • The diagonal covmat fit stays exactly the same when doing 1-replica fit with DIS_diagonal_l2reg_example.yml (master vs this)
  • DIS_diagonal_l2reg_example.yml can be run in parallel without crashing (and the results are equal to master to the last digit after 4000 epochs)

I think that's all we mentioned this morning right?

It would be nice to have a fitbot that runs also in parallel (maybe 5 or 10 replicas) as soon as the trvl per replica is also included (right now it's not so interesting I'd say but the moment the masks change the ways in which things can break down the line greatly increase :P)

Copy link
Member

@scarlehoff scarlehoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. Added a few final comments but they are all at the level of "change NN to NN_PREFIX" so after those are done I'm fine with merging once @Radonirinaunimi and @RoyStegeman are ok as well.

@scarlehoff
Copy link
Member

@RoyStegeman @Radonirinaunimi do you still want to have a look at this?

(we are not in a rush, but just to know whether to wait or not)

Copy link
Member

@RoyStegeman RoyStegeman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already did a while ago, now that the feature scaling problem is understood I think I would only have some cosmetic comments which are not really worth wasting time on.

Copy link
Member

@Radonirinaunimi Radonirinaunimi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've looked a bit into this before the FS issue was resolved and even back then everything but the FS was "ok". I don't think I will have enough time to look into this again in the next 2 days, so please fell free to merge.

@APJansen APJansen merged commit d6acb11 into master Dec 4, 2023
@APJansen APJansen deleted the multi-dense-logistics branch December 4, 2023 15:17
@APJansen APJansen mentioned this pull request Dec 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request escience Refactoring run-fit-bot Starts fit bot from a PR. urgent
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Realising a factor 20-30 speedup on GPU
5 participants