-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi Replica PDF #1782
Multi Replica PDF #1782
Conversation
I'm guessing you already talked with @goord and are aware of some of the issues he found in #1661, most notably the ambiguity on how to treat the training/validation split for dataset with one single point, when you merged it with your PRs. I would also ask that you finish the tests with the small PR you've been doing to ensure that the changes are incrementally merged (and so they are not broken by other changes that might be done in parallel to the code). (I see this PR is a draft so maybe this was already your plan). Now, the answer to your questions:
Yes to both. But note that this is not due to interactions with validphys (which we could modify at will) but rather because each replica is independent of all the others. I.e., training (+data, trvlsplit, stopping, lagrange multipliers, etc) should be independent. This is in practice the main point, as long as every replica at the end is independent of all the others I'd say there is freedom on how to get there. Edit: in other words, if the little interaction with vp at the end of the fit (to compute the arclength, and little more) is an issue we can easily fix that as long as the interpolation grids at the end are correct. Regarding the photon or the hyperopt penalties (again, I guess this was your plan already, but writing it here to make sure we are all on the same page), I'd suggest leaving that for after the standard multireplica fit in GPU is well tested and merged. The photon might not even be suitable for GPU paralellization since a non negligible amount of time is spent calculating the photon with fiatlux so the best thing would be to make sure that the QED fit is not broken when running it in the "normal 1-replica way". |
And I imagine that there are users that want to be able to evaluate a single PDF without having to evaluate all replicas. I actually know nothing about what happens with these PDFs once trained, can you say something about that or link to something?
Yes this whole branch only makes sense to merge after the trvl-mask-layers branch is merged. The hyperopt penalties can trivially be parallelized across replicas, the photon I'm not sure, if not there just needs to be an interface extracting single replicas from the joined model. |
In particle physics we collide protons, i.e. bound states of quarks and gluons, with another proton or lepton. However, in perturbative QCD we can only calculate Feynman diagrams with individual incoming quarks/gluons, not with incoming hadrons. However, to connect the pQCD calculation to what can be measured in experiments, each Feynman diagram essentially needs to be weighted by the probability of finding the corresponding incoming states inside the proton in order to make a connection between the proton-lepton collision and the quark/gluon-electron Feynman diagram. These weights are what the PDFs provide, if you will. There are different "factorization" arguments for different processes that provide the theoretical underpinning of this factorizing of the quark from the proton (there are no formal proofs for all processes though). For a general introduction to QCD/collider physics any set of lecture notes on the topic will do. For a more specific discussion of what NNPDF does you could have a look here: https://arxiv.org/pdf/2008.12305.pdf (see equation 1 in these notes for the factorization equation I explained above). Perhaps you don't want to read the entire thing, but up to section 2.2 might be useful. |
Dear @APJansen and @scarlehoff , |
@RoyStegeman Thanks, this I knew, sorry for not being clear. (My background is in theoretical physics as well, though mostly on black hole physics, but I did take master courses on QFT and particle physics so I know the basics) |
Ah I see, I am aware of your background (without some basic knowledge of QFT/SM I don't think my explanation would be very helpful anyway) but indeed understood that you were asking about some collider physics notes, my bad! The code is public but not really used outside our collaboration. Some parts of the codes that produce the FKtables (these are in different repositories) are being used by others, and we hope to convince more people to use our codes, but doing theory predictions serves a more general purpose than a PDF fit using the NNPDF methodology as implemented in this repo. Besides allowing people to check/reproduce our work by making it open source some of the tools in The bottom line is thus: it's open source because we invite people to check our work and we hope it can be of use to some others as well, though in practice it will of course mainly be us who use the code and others just use the PDF grids we produce with it. |
@scarlehoff Can you comment on this? Looking at this again, I'm trying to rewrite everything in terms of Also, later in the same class, replicas are set to non-trainable, but this only takes effect when the model is recompiled, which as far as I can see is not happening here. (And this won't be possible any longer I think once all replicas are a single model) |
Hi @APJansen, feel free to modify the strategy as you wish. In any case, for your situation one possible strategy might be this (talking without actually having put my hands in the code to test what the problems/issues might be):
And then continue training until all replicas have triggered the stopping condition. At that point you readd all weights back since you got them at their best and then you have a network in which each replica has been trained independently. |
Yes that was my plan, except that I hadn't thought of step 3, thanks! |
So I've thought about it, and actually it shouldn't matter for the coupling between the replicas whether individual replicas are set to I've also verified that commenting out this line doesn't change anything. To test I added a log message when the function is being called, to make sure with the runcard I'm using the stopping conditions are being met. I tested with 5 replicas, and results are identical. So to conclude, step 3 is not an issue, and the setting of one replica to non trainable won't be possible after this refactor, but it wasn't being done in the first place, and the speedup from the refactor should outweigh that of a proper implementation of setting individual replicas to non-trainable. |
Closing this, all of this has been done. |
Question
This will be some work, so before continuing past this I'd like to confirm that you agree that once finished this will be a beneficial change.
Idea
The idea of this PR is to refactor the tensorflow model from taking a list of single-replica pdfs into taking a single multiple replica pdf, a single pdf whose output has an extra axis representing the replica. This is much faster on the GPU, see tests below.
The main ingredient to make this possible is a
MultiDense
layer, (see here) which is essentially just a dense layer where the weights have one extra dimension, with size the number of replicas. For the first layer, which takes x's as input, this is exactly it. For deeper layers, the input already has a replica axis, and so the right index of the input has to be multiplied by the corresponding axis of the weights.Development Strategy
To integrate this into the code, many small changes are necessary.
To make it as simple as possible to review and test, I aim to make small, independent changes that ideally are beneficial, or at least not detrimental, on their own. Wherever it's sensible I'll first create a unit test that covers the changes I want to make, and make sure it still passes after, and wherever possible I'll try to have the outputs be identical up to numerical errors. I'll put all of these on their own branch and with their own PR (maybe I should create a special label for those PRs?).
Once those small changes are merged, the actual implementation should be easily managable to review.
This PR itself for now is a placeholder, where I just added the commit so that I can create a draft PR and so you can check out the
MultiDense
layer.I expect that as a final result you'll still want single replica pdf. I will add code that, once all computations are done, just splits the multi replica pdf into single ones, so the saving and any interaction with validphys will remain unchanged.
Performance
Timing
These are the timing tests I did on a 1/4 node on Snellius, with one GPU. I'm reporting the average seconds per epoch that is printed in debug mode.
Memory
Memory also appears to be significantly reduced.
I checked the peak cpu memory usage using libmemprofile, on the basic runcard with 200 replicas, and found 3.5Gb versus 16.5 for the trvl-mask-layers branch.
Status
I have a test branch where this is working up to the end of the model training, which is what I used to obtain the timings above.