-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi dense logistics #1818
Multi dense logistics #1818
Conversation
…o get and set weights per replica
8fef1b0
to
0c0ee66
Compare
Yes @Radonirinaunimi, I've updated #1782 for which this is the last prepatory PR. That one then is step 6 in #1803. |
PhotonsMy understanding of what currently happens in master:
Questions to @niclaurenti
|
The N3PDF line hereThe list of single replica PDFs is wrapped in an There it has the potential to touch a lot of validphys code that I'm not familiar with.
This comes at the cost of having to split into individual replicas at every trial, and not having the benefit of parallelization in the computations done there. But this should be a very minor part of the overall runtime. Details: what happens to it in the hyper loss (depending on which one is chosen)
Those are all validphys functions, in turn they call:
|
b42e815
to
3cbc9e2
Compare
…rily) to allow loading of weights
I remember now there was a better reason for having a list of all the single replicas rather than a single one: with a single one you have to make deep copies to avoid overwriting the weights of the previous one. The copying was giving me loads of issues, because of all the extras that MetaModel has over a normal Keras model. Your solution @scarlehoff of using a generator avoids that, and it was a much simpler change than I thought. It does come at the cost of having to redo the same work of generating a model for every replica at every fold, and because of the above we cannot just generate one and overwrite its weights. I did a quick test and it takes about 0.1 second per replica. I don't have a strong preference for either approach, the performance cost is negligible compared to a full training cycle. Do you still prefer the generator? |
Yes, I think ~2 minutes per fold in a worst case (1000 replicas) is more than acceptable. |
Co-authored-by: Juan M. Cruz-Martinez <juacrumar@lairen.eu>
This reverts commit 552bc97.
The two added attributes in the photon were necessary after all, to create the single replica model from the multireplica one. And they are used outside of the class itself, so I didn't add the leading underscore. |
I tried a 25 replica fit (asking for 20 replicas in postfit, could've asked for all 25 actually) running in parallel in a GPU with feature scaling in master and this branch, seems to work fine. https://vp.nnpdf.science/_tH_NYiGRXmEoxTeIZCkPg== The different version of numpy is because both GPUs were in different computers. |
Regarding the QED fit, since we are running just one replica, the regression test should be enough. @niclaurenti could you confirm that this runcard
|
# Conflicts: # n3fit/src/n3fit/model_trainer.py
I merged master, the conflict was just a formatting thing. |
Greetings from your nice fit 🤖 !
Check the report carefully, and please buy me a ☕ , or better, a GPU 😉! |
Oh, so the automatic fit was send but hidden... interesting. Well, there will be another post in a few hours then. I've done a few more tests that I wanted to do and everything looks ok:
I think that's all we mentioned this morning right? It would be nice to have a fitbot that runs also in parallel (maybe 5 or 10 replicas) as soon as the trvl per replica is also included (right now it's not so interesting I'd say but the moment the masks change the ways in which things can break down the line greatly increase :P) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. Added a few final comments but they are all at the level of "change NN to NN_PREFIX" so after those are done I'm fine with merging once @Radonirinaunimi and @RoyStegeman are ok as well.
@RoyStegeman @Radonirinaunimi do you still want to have a look at this? (we are not in a rush, but just to know whether to wait or not) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already did a while ago, now that the feature scaling problem is understood I think I would only have some cosmetic comments which are not really worth wasting time on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've looked a bit into this before the FS issue was resolved and even back then everything but the FS was "ok". I don't think I will have enough time to look into this again in the next 2 days, so please fell free to merge.
The aim of this PR is to make all the remaining changes necessary to enable the implementation of the
MultiDense
layers, without changing the numerics at all.This boils down to stacking the different replicas as soon as possible (i.e. here), and making any changes that result from that.
TODO:
n3pdfs.append(N3PDF(pdf_models, name=f"fold_{k}"))
[here], see below for a detailed discussion (nnpdf/n3fit/src/n3fit/model_trainer.py
Line 952 in 0c0ee66
Comments
The biggest changes are the addition of 3 methods to
MetaModel
:get_replica_weights
,set_replica_weights
andsplit_replicas
. For now these rely on the different replicas being different models. Once theMultiDense
layer is implemented that won't be the case, and the code in these functions will need to change to extract the right entry in the replica axis of all weights, but the changes required should be limited to these functions.The reason for
split_replicas
is that if I don't do this and keep everything as a single model, it will require a lot more changes, also in valid phys, and since performance wise the difference is negligible, I thought it was cleanest to split into separate replicas after training.Status
Work in Progress.
Currently I'm a bit stuck on stuck on this, it runs (at least if I avoid the points above by just running the basic runcard), and during training the results are identical to master. Also the weights of the single replica models are identical to the original model, but the best chi2 reported per replica after training are very different.
Later
Once the above is done, the joining of replicas can be pushed even further back before actually including the
MultiDense
layers, namely directly after the NN layers. Whether I'll do it in this PR or the final one will depend on if it changes the numerics or not (I think it may).