Multi dense logistics #1818

APJansen · 2023-10-16T13:27:08Z

The aim of this PR is to make all the remaining changes necessary to enable the implementation of the MultiDense layers, without changing the numerics at all.
This boils down to stacking the different replicas as soon as possible (i.e. here), and making any changes that result from that.

TODO:

photons: see below for a detailed discussion.
Fix loading of weights in this block. I'm not sure how that's working at the moment, all replicas read from the same file. Is this something that is actually used? @scarlehoff ?
verify that hyperopt is working
figure out what to do with the line n3pdfs.append(N3PDF(pdf_models, name=f"fold_{k}")) [here], see below for a detailed discussion (

nnpdf/n3fit/src/n3fit/model_trainer.py

Line 952 in 0c0ee66

n3pdfs.append(N3PDF(pdf_models, name=f"fold_{k}"))

)

Comments

The biggest changes are the addition of 3 methods to MetaModel: get_replica_weights, set_replica_weights and split_replicas. For now these rely on the different replicas being different models. Once the MultiDense layer is implemented that won't be the case, and the code in these functions will need to change to extract the right entry in the replica axis of all weights, but the changes required should be limited to these functions.

The reason for split_replicas is that if I don't do this and keep everything as a single model, it will require a lot more changes, also in valid phys, and since performance wise the difference is negligible, I thought it was cleanest to split into separate replicas after training.

Status

Work in Progress.

Currently I'm a bit stuck on stuck on this, it runs (at least if I avoid the points above by just running the basic runcard), and during training the results are identical to master. Also the weights of the single replica models are identical to the original model, but the best chi2 reported per replica after training are very different.

Later

Once the above is done, the joining of replicas can be pushed even further back before actually including the MultiDense layers, namely directly after the NN layers. Whether I'll do it in this PR or the final one will depend on if it changes the numerics or not (I think it may).

…o get and set weights per replica

…r training

Radonirinaunimi · 2023-10-16T13:50:23Z

Thanks @APJansen! Could you also please update accordingly #1803 to refer which PR addresses which point (for example, this goes into point (2))? In this case, it would be easy for people to sort of situate the status of each PR.

APJansen · 2023-10-16T14:11:51Z

Yes @Radonirinaunimi, I've updated #1782 for which this is the last prepatory PR. That one then is step 6 in #1803.

APJansen · 2023-10-17T14:21:20Z

Photons

My understanding of what currently happens in master:

An AddPhoton layer is initialized once here (so shared between replicas), with a photon.compute.Photon class passed as argument. (This in turn is initialized with a theory_id, lux_params and list of replica ids)
The Photon object contains integrals, which are used in the MSR layer with a replica index to extract the right one. This is trivial to "parellalize".
For each replica the AddPhoton layer is called on the pdf here, with additional argument the replica index (at the last stage after MSR normalization)
Since AddPhoton.register_photons method hasn't been called yet, AddPhoton._pdf_ph is still None, so this does nothing.
In model_trainer.py, first the PDF model is created including the steps above. Then for each replica the AddPhoton layer is extracted (this is the same layer every time!), and register_photon is called on it every time here. In each call it does some computation for every replica, that seems to be always the same for every call, so it seems this could have been called just once?

Questions to @niclaurenti

Is this correct?
It seems to me AddPhotons.call does nothing when the model is built. I thought this would mean it doesn't get included in the computational graph, but probably that is wrong? Is it intended to do nothing at first (on the symbolic input), but do something every next time?
The AddPhotons.call also seems trivial to parallelize, rather than taking the replica at the specified index of its internal photon values self._pdf_ph, and replacing a single replica PDF with this, just take the whole thing and replace the joint replica PDF with all. Is that right?
Is it ok to remove the loop here and call register_photon just once?

APJansen · 2023-10-18T07:26:17Z

The N3PDF line here

The list of single replica PDFs is wrapped in an N3PDF object, a list of those is made, one for each fold, and this is passed every trial to a hyper loss.

There it has the potential to touch a lot of validphys code that I'm not familiar with.
Unless anyone else finds this easy to adapt, perhaps it's best to:

pass the multi-replica pdf itself directly to the hyper loss
inside the hyper loss, convert this to a list of single replica pdfs wrapped in an N3PDF
leave the rest unchanged

This comes at the cost of having to split into individual replicas at every trial, and not having the benefit of parallelization in the computations done there. But this should be a very minor part of the overall runtime.

Details: what happens to it in the hyper loss (depending on which one is chosen)

xplotting_grid(n3pdf, ...) (on one in the list)
distance_grids(n3pdfs, ...)
n3pdf(input_x) (on one in the list)

Those are all validphys functions, in turn they call:

basis.grid_values(n3pdf, ...)
n3pdf.stats_class(...)
n3pdf.get_members()
n3pdf.stats_class(...)

…rsion to an N3PDF object inside

…rily) to allow loading of weights

…lues in the list

APJansen · 2023-11-28T11:48:21Z

I remember now there was a better reason for having a list of all the single replicas rather than a single one: with a single one you have to make deep copies to avoid overwriting the weights of the previous one. The copying was giving me loads of issues, because of all the extras that MetaModel has over a normal Keras model.

Your solution @scarlehoff of using a generator avoids that, and it was a much simpler change than I thought. It does come at the cost of having to redo the same work of generating a model for every replica at every fold, and because of the above we cannot just generate one and overwrite its weights. I did a quick test and it takes about 0.1 second per replica.

I don't have a strong preference for either approach, the performance cost is negligible compared to a full training cycle. Do you still prefer the generator?

scarlehoff · 2023-11-28T12:08:18Z

Yes, I think ~2 minutes per fold in a worst case (1000 replicas) is more than acceptable.

Co-authored-by: Juan M. Cruz-Martinez <juacrumar@lairen.eu>

This reverts commit 552bc97.

APJansen · 2023-11-28T13:19:29Z

The two added attributes in the photon were necessary after all, to create the single replica model from the multireplica one. And they are used outside of the class itself, so I didn't add the leading underscore.

scarlehoff · 2023-11-28T15:34:56Z

I tried a 25 replica fit (asking for 20 replicas in postfit, could've asked for all 25 actually) running in parallel in a GPU with feature scaling in master and this branch, seems to work fine.

https://vp.nnpdf.science/_tH_NYiGRXmEoxTeIZCkPg==

The different version of numpy is because both GPUs were in different computers.

scarlehoff · 2023-11-28T15:37:38Z

Regarding the QED fit, since we are running just one replica, the regression test should be enough.

@niclaurenti could you confirm that this runcard

nnpdf/n3fit/src/n3fit/tests/regressions/quickcard_qed.yml

Line 89 in 20b2432

fiatlux:

still represents the way you are running QED fits? (otherwise, let me know the differences and I'll try to run an example of an actual fit)

# Conflicts: # n3fit/src/n3fit/model_trainer.py

APJansen · 2023-11-29T09:39:35Z

I merged master, the conflict was just a formatting thing.

github-actions · 2023-11-29T11:54:48Z

Greetings from your nice fit 🤖 !
I have good news for you, I just finished my tasks:

Fit Name: NNBOT-b66cbb249-2023-11-29
Fit Report: https://vp.nnpdf.science/yEULEUrKSSmuBVLO7XmZ-A==
Fit Data: https://data.nnpdf.science/fits/NNBOT-b66cbb249-2023-11-29.tar.gz

Check the report carefully, and please buy me a ☕ , or better, a GPU 😉!

scarlehoff · 2023-11-29T13:11:28Z

Oh, so the automatic fit was send but hidden... interesting. Well, there will be another post in a few hours then.

I've done a few more tests that I wanted to do and everything looks ok:

Nothing changes time-wise when you single for NNPDF4.0 master vs this one
The diagonal covmat fit stays exactly the same when doing 1-replica fit with DIS_diagonal_l2reg_example.yml (master vs this)
DIS_diagonal_l2reg_example.yml can be run in parallel without crashing (and the results are equal to master to the last digit after 4000 epochs)

I think that's all we mentioned this morning right?

It would be nice to have a fitbot that runs also in parallel (maybe 5 or 10 replicas) as soon as the trvl per replica is also included (right now it's not so interesting I'd say but the moment the masks change the ways in which things can break down the line greatly increase :P)

scarlehoff

This looks good to me. Added a few final comments but they are all at the level of "change NN to NN_PREFIX" so after those are done I'm fine with merging once @Radonirinaunimi and @RoyStegeman are ok as well.

n3fit/src/n3fit/backends/keras_backend/MetaModel.py

n3fit/src/n3fit/model_gen.py

scarlehoff · 2023-12-04T08:58:09Z

@RoyStegeman @Radonirinaunimi do you still want to have a look at this?

(we are not in a rush, but just to know whether to wait or not)

RoyStegeman

I already did a while ago, now that the feature scaling problem is understood I think I would only have some cosmetic comments which are not really worth wasting time on.

Radonirinaunimi

I've looked a bit into this before the FS issue was resolved and even back then everything but the FS was "ok". I don't think I will have enough time to look into this again in the next 2 days, so please fell free to merge.

APJansen added 5 commits October 16, 2023 09:58

Join replicas into single MetaModel

433ab62

Next step in joining replicas, in model_generation

f44a96b

Next step in joining replicas, in stopping, added MetaModel methods t…

97d3e5d

…o get and set weights per replica

Next step in joining replicas, split back into separate replicas afte…

a145d11

…r training

Rewrite hyperoptimization penalties in terms of single pdf_model

0c0ee66

APJansen force-pushed the multi-dense-logistics branch from 8fef1b0 to 0c0ee66 Compare October 16, 2023 13:38

Radonirinaunimi added Refactoring enhancement New feature or request labels Oct 16, 2023

Radonirinaunimi linked an issue Oct 16, 2023 that may be closed by this pull request

Realising a factor 20-30 speedup on GPU #1803

Closed

APJansen self-assigned this Oct 16, 2023

This was referenced Oct 16, 2023

Multi Replica PDF #1782

Closed

Realising a factor 20-30 speedup on GPU #1803

Closed

APJansen requested review from niclaurenti and scarlehoff October 17, 2023 14:22

APJansen added 3 commits October 18, 2023 13:50

Make hyperopt losses dependent on multi-replica PDF, moving the conve…

b871ec6

…rsion to an N3PDF object inside

Merge branch 'master' into multi-dense-logistics

8316562

Rewrite weight loading to multi-replica pdf

3cbc9e2

APJansen force-pushed the multi-dense-logistics branch from b42e815 to 3cbc9e2 Compare October 18, 2023 13:16

APJansen added 8 commits October 18, 2023 15:18

Remove loop over pdf_models when registering photon

c80ec48

Fix saturation penalty to give loss per replica, fix test

ca65c95

Fix test_vpinterface by splitting pdf into single replicas in the test

d20968d

Change individual replica PDF from Lambda layer to MetaModel (tempora…

dbd1fb6

…rily) to allow loading of weights

Fix getting of add photon layer

c63f331

range(replicas) -> replicas

4b84d1c

load weights using replica indices starting from 0 rather than the va…

a39d792

…lues in the list

Another fix...

ee43aab

Define NN and preprocessing_factor as constants

a28da06

APJansen and others added 3 commits November 28, 2023 13:11

Simplify input generation in msr test

2ea4878

Co-authored-by: Juan M. Cruz-Martinez <juacrumar@lairen.eu>

Remove unused photon attributes

552bc97

Revert "Remove unused photon attributes"

f0359d7

This reverts commit 552bc97.

scarlehoff added run-fit-bot Starts fit bot from a PR. and removed run-fit-bot Starts fit bot from a PR. labels Nov 29, 2023

APJansen added 2 commits November 29, 2023 10:33

Add exception in case photons and multiple replicas are combined

9d7c493

Merge branch 'master' into multi-dense-logistics

dcf9684

# Conflicts: # n3fit/src/n3fit/model_trainer.py

scarlehoff added run-fit-bot Starts fit bot from a PR. and removed run-fit-bot Starts fit bot from a PR. labels Nov 29, 2023

RoyStegeman added the escience label Nov 29, 2023

scarlehoff approved these changes Nov 29, 2023

View reviewed changes

APJansen added 3 commits November 30, 2023 09:10

Rename NN, preprocessing prefixes

ab219a6

Add error if no generator set

8cb24e9

Add comments

50d77f5

RoyStegeman approved these changes Dec 4, 2023

View reviewed changes

Radonirinaunimi approved these changes Dec 4, 2023

View reviewed changes

APJansen merged commit d6acb11 into master Dec 4, 2023

APJansen deleted the multi-dense-logistics branch December 4, 2023 15:17

APJansen mentioned this pull request Dec 4, 2023

Multi Replica PDF #1880

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi dense logistics #1818

Multi dense logistics #1818

APJansen commented Oct 16, 2023 •

edited

Loading

Radonirinaunimi commented Oct 16, 2023

APJansen commented Oct 16, 2023

APJansen commented Oct 17, 2023

APJansen commented Oct 18, 2023

APJansen commented Nov 28, 2023

scarlehoff commented Nov 28, 2023

APJansen commented Nov 28, 2023

scarlehoff commented Nov 28, 2023

scarlehoff commented Nov 28, 2023

APJansen commented Nov 29, 2023

github-actions bot commented Nov 29, 2023

scarlehoff commented Nov 29, 2023

scarlehoff left a comment

scarlehoff commented Dec 4, 2023

RoyStegeman left a comment

Radonirinaunimi left a comment

Multi dense logistics #1818

Multi dense logistics #1818

Conversation

APJansen commented Oct 16, 2023 • edited Loading

TODO:

Comments

Status

Later

Radonirinaunimi commented Oct 16, 2023

APJansen commented Oct 16, 2023

APJansen commented Oct 17, 2023

Photons

My understanding of what currently happens in master:

Questions to @niclaurenti

APJansen commented Oct 18, 2023

The N3PDF line here

Details: what happens to it in the hyper loss (depending on which one is chosen)

APJansen commented Nov 28, 2023

scarlehoff commented Nov 28, 2023

APJansen commented Nov 28, 2023

scarlehoff commented Nov 28, 2023

scarlehoff commented Nov 28, 2023

APJansen commented Nov 29, 2023

github-actions bot commented Nov 29, 2023

scarlehoff commented Nov 29, 2023

scarlehoff left a comment

Choose a reason for hiding this comment

scarlehoff commented Dec 4, 2023

RoyStegeman left a comment

Choose a reason for hiding this comment

Radonirinaunimi left a comment

Choose a reason for hiding this comment

APJansen commented Oct 16, 2023 •

edited

Loading