Merge prefactors into single layer #1881

APJansen · 2023-12-04T18:43:25Z

Join the preprocessing layers into one.

This also required changing the way weights are extracted and set, as a replica now is only a single slice of a weight tensor. Locally for me the way I implemented it is giving errors that seem specific to Mac, running on the GPU.

Status: Done, ready for review

I started changing stuff in vpinterface, but I think what I should do instead is similar to what I did before with the PDF as a whole: use the replica_axis flag I made there, to keep everything as is in the single replica models (i.e. not have an extra replica axis of length 1 there) so nothing there has to change.
It is producing very different results, because the initialization is different. Keeping it the same would be very ugly, but I can do it temporarily just to verify that that's the only source of differences.
Fix test_fit

Edit: the changes of this PR changed the layout of the weights and thus weights saved for earlier versions of the code need to be restructured so they remain equivalent. The following script (comment: #1881 (comment)) achieves exactly that:

test script

import numpy as np
import h5py

def reformat_h5(i_replica):
    old_file = h5py.File(f"weights_{i_replica}_master.h5", 'r')
    new_file = h5py.File(f"weights_{i_replica}.h5", 'r+')

    def replace_numbers(name, obj):
        name_new_to_old_rules = {
                "NNs": f"NN_{i_replica - 1}",
                "preprocessing_factor": f"preprocessing_factor_{i_replica - 1}",
                "dense_3": "dense",
                "dense_4": "dense_1",
                "dense_5": "dense_2",
            }
        if isinstance(obj, h5py.Dataset):
            old_name = name
            for new, old in name_new_to_old_rules.items():
                old_name = old_name.replace(new, old)

            print(f"Replacing {name} with {old_name}")
            data_to_put_in = old_file[old_name][...]
            if "preprocessing" in name:
                data_to_put_in = np.expand_dims(data_to_put_in, axis=0)

            obj[...] = data_to_put_in

    new_file.visititems(replace_numbers)

reformat_h5(1)
reformat_h5(2)

APJansen · 2023-12-13T09:57:42Z

@scarlehoff @RoyStegeman @Radonirinaunimi Just so you know, I rebased this and updated the status above, in case you want to have a look. But it's still WIP so no need to review yet.

scarlehoff

I've left a suggestion which I think might simplify the code and bypass completely the problem you have in mac (and, probably, bypass future problems with other tensorflow versions because setting all weights will always work, breaking down a layer into smaller variables is more tricky).

n3fit/src/n3fit/model_gen.py

n3fit/src/n3fit/backends/keras_backend/MetaModel.py

APJansen · 2023-12-15T09:49:35Z

I have done a manual test, first creating a prefactor layer in this branch and getting the output on some test input. Then in master create the same layers, and set the weights to those in this branch. Comparing outputs, they're identical, both for 1 and for 3 replicas.

This is also the reason the tests are still failing, it's only some regression tests that fail, so they should be updated.
Can you remind me how this was done?

Also, I changed the name of the layer from preprocessing_{i_replica} to just preprocessing.
Are you ok with that, for both the multiple replica model and the single replica ones, with the latter also keeping a length 1 axis dimension? Contrary to the comment in the initial post, I don't think it's worth differentiating this. For the output of the PDF it was worth it because this saved updating validphys, but I don't think that looks into the internals of the model, so here it's not necessary.

The same will apply of course to the NN_{i_replica} layers, and in both cases saved models will need to be updated because of this change in structure.

test script

import numpy as np
import pickle

from n3fit.layers import Preprocessing


flav_info = [
    {"fl": i, "largex": [0, 1], "smallx": [1, 2]}
    for i in ["u", "ubar", "d", "dbar", "c", "g", "s", "sbar"]
]

np.random.seed(42)
x_in = np.random.rand(1, 10, 1)

branch = "master"

if branch == "parallel-prefactor":
    print("Running parallel-prefactor")
    weights_parallel = {}
    outs_parallel = {}
    for large_x in [True, False]:
       for num_replicas in [1, 3]:
           preproc = Preprocessing(
                   flav_info=flav_info,
                   input_shape=(1,),
                   name="preprocessing_factor",
                   seed=42,
                   large_x=large_x,
                   num_replicas=num_replicas,
               )
           preproc.build((None, 1,))
           weights_parallel[(large_x, num_replicas)] = preproc.get_weights()
           outs_parallel[(large_x, num_replicas)] = preproc(x_in)
    with open("tmp/preproc_weights.pkl", "wb") as f:
       pickle.dump(weights_parallel, f)
    with open("tmp/preproc_outs.pkl", "wb") as f:
       pickle.dump(outs_parallel, f)

if branch == "master":
    print("Running master")
    with open("tmp/preproc_weights.pkl", "rb") as f:
        weights_parallel = pickle.load(f)
    with open("tmp/preproc_outs.pkl", "rb") as f:
        outs_parallel = pickle.load(f)
    for large_x in [True, False]:
        for num_replicas in [1, 3]:
            print(f"large_x: {large_x}, num_replicas: {num_replicas}")
            preprocs = []
            for i_replica in range(num_replicas):
                preproc = Preprocessing(
                    flav_info=flav_info,
                    input_shape=(1,),
                    name=f"preprocessing_factor_{i_replica}",
                    seed=42,
                    large_x=large_x,
                )
                preproc.build((None, 1))

                weights_this_replica = []
                for w in weights_parallel[(large_x, num_replicas)]:
                    weights_this_replica.append(w[i_replica])

                preproc.set_weights(weights_this_replica)
                preprocs.append(preproc)
            outs = np.stack(
                [preprocs[i_replica](x_in) for i_replica in range(num_replicas)], axis=1
            )
            np.testing.assert_equal(outs, outs_parallel[(large_x, num_replicas)])

scarlehoff · 2023-12-15T09:57:40Z

The tests for 3.11 are pointing to something else though:

        TypeError: 'NoneType' object cannot be interpreted as an integer

RE the label of the replicas. No I don't think that matters, it's completely internal after all.

For the regression instead, just used the quickcard_1.json (and similarly names) to update the ones that failed.
It needs to be done manually because they don't fail often enough so that we've automatized the process... (obv, for you they fail more because you're touching the fitting code directly)

APJansen · 2024-01-08T12:57:13Z

Trying to fix the regression tests, I am running n3fit quickcard.yml 1 after editing the quickcard, changing load: "weights.h5" to save: "weights.h5" first. Then I copy the resulting quickcard/nn/replica_1/weights.h5 into tests/regressions/weights_1.h5, and quickcard/nn/replica_1/quickcard.json into tests/regressions/quickcard.json. And similar with 2 instead of 1 (forgetting about the qed runcard for now).

That should fix it right? But locally at least, this still gives an assertion error, with a max relative difference of 15%..

APJansen · 2024-01-10T11:10:03Z

I just repeated the procedure above on Snellius, updating the weights_1.h5 and quickcard_1.json from a run on Snellius and running the corresponding test. There as well it still fails.

RoyStegeman · 2024-01-10T11:56:10Z

That should fix it right?

Looking at what exactly the test does this should indeed not work. Namely, it uses the hdf5 file to initialise the NN, and subsequently still performs the fit. So when doing the steps as you describe, the json files you saved correspond to the starting point while the json files they're compared to correspond to the end of the fit. One possible solution is of course to run the fit twice: once to produce the hdf5 file, and then load the weights from that file to produce the json files needed for the comparison.

It's not so clear to me why the the weights are loaded from the file as opposed to just initialised using the rng in this regression test (maybe to test the loading functionality?). And of course I don't know how the current hdf5 files were generated (in what kind of fit setup), though in principle for this regression test that doesn't matter. @scarlehoff can probably clarify on these points.

scarlehoff · 2024-01-10T12:04:36Z

The setting of the weights is to give a bit of stability to the regression test, I wanted to have one that works and one that fails (to fit, I mean). It would be interesting to have a replica_3 that does not start from a fixed point, however that will introduce instabilities and the regression tests will fail randomly much more.

In any case, I don't think the starting point should be changed? You should be able to either use the same weights, the number of parameters on the network should not change in this PR.

scarlehoff · 2024-01-10T12:08:57Z

Just to be precise, "fixing the regression test" here should mean:

Converting the current weights into the new format (and if you can please leave the script here in the PR for future reference, since it might be necessary to use it in the future; it's not a widely used feature so I'm ok with this compromise, but I don't think we should merge something that breaks all previous weights)
Seeing how the numbers change due to this changes (hopefully not much when the starting point is the same).
Updating the numbers in the .json reference file only once (2.) is properly checked.

APJansen · 2024-01-10T12:11:28Z

Ah I see, that makes sense, I'll look into it, thanks!

APJansen · 2024-01-10T15:03:10Z

That fixed it! The script I used is below, it uses saved weights from this branch to get the structure, and then looks in the old file to copy over the data. Locally it didnt pass the test, though with small difference, but as you can see in the CI it does, I didn't change the outputs.

Now i just need to fix this issue with 3.11.

reformatting weights script

def reformat_h5(i_replica):
    old_file = h5py.File(f"weights_{i_replica}_master.h5", 'r')
    new_file = h5py.File(f"weights_{i_replica}.h5", 'r+')

    def replace_numbers(name, obj):
        name_new_to_old_rules = {
                "NNs": f"NN_{i_replica - 1}",
                "preprocessing_factor": f"preprocessing_factor_{i_replica - 1}",
                "dense_3": "dense",
                "dense_4": "dense_1",
                "dense_5": "dense_2",
            }
        if isinstance(obj, h5py.Dataset):
            old_name = name
            for new, old in name_new_to_old_rules.items():
                old_name = old_name.replace(new, old)

            print(f"Replacing {name} with {old_name}")
            data_to_put_in = old_file[old_name][...]
            if "preprocessing" in name:
                data_to_put_in = np.expand_dims(data_to_put_in, axis=0)

            obj[...] = data_to_put_in

    new_file.visititems(replace_numbers)

reformat_h5(1)
reformat_h5(2)

APJansen · 2024-01-10T17:18:37Z

Fixed! And ready for review.

I don't know if you think it's necessary to do more tests on top of the CI, but if you do I would propose to postpone that and just do it in #1905, which branches off of this.

scarlehoff · 2024-01-10T18:04:47Z

For merging, I'd say the same as with:
#1888 (comment)

Btw, one point that I wanted to raise during the meeting this morning (which I could not attend) is that please either squash or rebase (given the amount of commits -and intermediate merges- probably you need both) these PR such that their history is not mixed with master.

APJansen · 2024-01-11T06:41:22Z

Btw, one point that I wanted to raise during the meeting this morning (which I could not attend) is that please either squash or rebase (given the amount of commits -and intermediate merges- probably you need both) these PR such that their history is not mixed with master.

Yes I tried that yesterday (on nn-layers-refactor), but I got some error I didn't understand, so I merged instead, but I can take another look.

n3fit/src/n3fit/backends/keras_backend/MetaModel.py

scarlehoff · 2024-01-19T17:39:36Z

n3fit/src/n3fit/model_gen.py

+    compute_preprocessing_factor = Preprocessing(
+        flav_info=flav_info,
+        input_shape=(1,),
+        name="preprocessing_factor",


Please, import this from MetaModel to make sure that it is consistent.

I have done this everywhere you suggest, but it doesnt seem very logical to me to import this from n3fit.backends, and maybe a bit overkill for such simple names, what do you think?

Well, you can add a constants.py or something if you don't like the import from n3fit.backends. If they are supposed to be the same they should be the same. But in model_gen you need to import from n3fit.backends anyway so it doesn't create any circular problems.

A separate constants.py makes sense, but then lots of other stuff should move there too, I'd say let's leave that for later ;P

scarlehoff · 2024-01-19T17:42:45Z

n3fit/src/n3fit/model_gen.py

+        flav_info=flav_info,
+        input_shape=(1,),
+        name="preprocessing_factor",
+        seed=seed[0] + number_of_layers,


This means that training replicas 1 to 10 and 2 to 10 won't give me the same initial results, right?

However, would 1 to 10 and 1 to 9 produce the exact same result from 1 to 9? I would like to know that.

You mean n3fit 1 -r 10 vs n3fit 2 -r 10, and n3fit 1 -r 10 versus n3fit 1 -r 9? I'll test, I'm actually not sure how those arguments get fed into pdfNN_layer_generator.

Running in parallel I mean. You normally start at replica 1 but it is not necessarily always like that.

The reason why I'm asking is that I think right now if you instantiate, say, 10 replicas (to fit them in parallel for instance) each of them will get a separate seed for the generation of the pseudo-data and, until now, a separate seed for each of the Neural Networks.
As long as the seeds were separated that meant the initial point could be reproducible.

Instead, now all replicas are being instantiated with the first seed, so obviously if you fit 1 to 10 you will get a result different from 2 to 10 (because the corresponding first seed is different).

But I am wondering whether 1 to 10 would be equivalent to 1 to 9 for the first 9 replicas.

Then, one thing to be done for when parallel fit becomes the standard is recover that reproducibility. For now reproducibility is limited to sequential CPU replicas. Reproducibility cannot be achieved in any easy way in GPU, but at least the initial state can be made to be the same.

I checked n3fit 1 -r 4, n3fit 2 -r 4, and n3fit 1 -r 5, and there were no similar results anywhere. For the MultiDense layers I did make sure that initialization is identical to master, here I did not.

Perhaps it's easiest to use the solution from there: MultiInitializer?

If it works out of the box for the preprocessing, yes indeed.
But if it doesn't we can leave that for later.

Btw, since you're at it, check that with seed[0] + number_of_layers we are not being silly and using the same seed twice. That was something that worked well when every layer was using seed +=1 but not sure whether that's the case anymore.

If it's ok to leave for later, I'll try it in the next PR #1905, to save some hassle of moving the code between branches.

The seeds should still be ok, I haven't changed the seeding of the NN layers, neither in this PR nor in the next.

n3fit/src/n3fit/model_trainer.py

n3fit/src/n3fit/tests/test_modelgen.py

n3fit/src/n3fit/vpinterface.py

scarlehoff

Just to confirm, is this finished now? (can it join the bulk of PRs to be merge to master?)

Does anybody else want to look at it? (not sure whether @Radonirinaunimi or @RoyStegeman already did have look)

Radonirinaunimi · 2024-01-23T08:51:42Z

Just to confirm, is this finished now? (can it join the bulk of PRs to be merge to master?)

Does anybody else want to look at it? (not sure whether @Radonirinaunimi or @RoyStegeman already did have look)

Yes, thanks! I'd like to have at look at this and #1788 today (before the CM).

scarlehoff · 2024-01-23T08:53:00Z

#1788 needs to be first squashed and rebased on top of what-will-be-master before being ready to be merged though

Radonirinaunimi · 2024-01-23T09:31:58Z

#1788 needs to be first squashed and rebased on top of what-will-be-master before being ready to be merged though

Yes, this I understood. But we can nonetheless start reviewing that, no? Or, do you expect to have many merged changes into master that'll later affect the review of that PR?

scarlehoff · 2024-01-23T09:32:49Z

do you expect to have many merged changes into master that'll later affect the review of that PR?

I don't know. But wanted to warn you just in case!

APJansen · 2024-01-23T09:59:55Z

Just to confirm, is this finished now? (can it join the bulk of PRs to be merge to master?)

Yes if you are happy with it this is ready to be merged!

Or, do you expect to have many merged changes into master that'll later affect the review of that PR?

I don't think so, just did a quick test merging this branch into #1788 (without pushing) and there were no merge conflicts at all.

scarlehoff · 2024-01-23T15:32:07Z

btw @APJansen did you already rebased / merged on top of / with #1914 ?
(to launch the fitbot here again)

APJansen · 2024-01-23T15:52:55Z

I haven't, just tried to merge it in here, (after rebasing didn't work). Merging went through without conflicts, but a quick test run fails with ModuleNotFoundError: No module named 'validphys._version'

scarlehoff · 2024-01-23T15:54:04Z

but a quick test run fails with ModuleNotFoundError: No module named 'validphys._version'

You need to reinstall. pip install -e . would work but also cmake if you want that.

Radonirinaunimi

Apart from the pending issue regarding the full reproducibility, this also LGTM.

Below are just some tiny fixes regarding docstrings.

n3fit/src/n3fit/backends/keras_backend/MetaModel.py

Simplify handling of dropout Factor out layer_generator in generate_dense_network Refactor dense_per_flavor_network Move setting of last nodes to generate_nn Add constant arguments Add constant arguments Move dropout to generate_nn Move concatenation of per_flavor layers into generate_nn Make the two layer generators almost equal remove separate dense and dense_per_flavor functions Add documentation. Simplify per_flavor layer concatenation Reverse order of loops over replicas and layers Fixes for dropout Fixes for per_flavour Fix issue with copying over nodes for per_flavour layer Fix seeds in per_flavour layer Add error for combination of dropout with per_flavour layers Add basis_size argument to per_flavour layer Fix model_gen tests to use new generate_nn in favor of now removed generate_dense and generate_dense_per_flavour Allow for nodes to be a tuple Move dropout, per_flavour check to checks Clarify layer type check Co-authored-by: Juan M. Cruz-Martinez <juacrumar@lairen.eu> Clarify naming in nn_generator Remove initializer_name argument clarify comment Co-authored-by: Juan M. Cruz-Martinez <juacrumar@lairen.eu> Add comment on shared layers Rewrite comprehension over replica seeds Add check on layer type Merge prefactors into single layer Add replica dimension to preprocessing factor in test Update preprocessing layer in vpinterface Remove assigning of weight slices Simplify loading weights from file Update regression data Always return a single NNs model for all replicas, adjust weight getting and setting accordingly Revert "Update regression data" This reverts commit 6f79368. Change structure of regression weights Remove now unused postfix Update regression weights Give explicit shape to scatter_to_one Update developing weights structure fix prefix typo add double ticks rename layer name constants use constants defined in metamodel.py for layer names Explain need for is_stacked_single_replicas shorten line fix constant loading Simplify get_replica_weights NNs -> all_NNs Clarify get_layer_replica_weights Co-authored-by: Juan M. Cruz-Martinez <juacrumar@lairen.eu> Clarify set_layer_replica_weights Remove comment about python 3.11 Co-authored-by: Juan M. Cruz-Martinez <juacrumar@lairen.eu> Fix typo in comment Co-authored-by: Tanjona Rabemananjara <rrabeman@nikhef.nl> Fix formatting in docstring Co-authored-by: Tanjona Rabemananjara <rrabeman@nikhef.nl> Rewording docstring Co-authored-by: Tanjona Rabemananjara <rrabeman@nikhef.nl>

APJansen · 2024-01-26T08:30:41Z

The clean install worked straight away! :)

And I've rebased this onto the latest master and squashed it into a single commit.

github-actions · 2024-01-26T21:19:49Z

Greetings from your nice fit 🤖 !
I have good news for you, I just finished my tasks:

Fit Name: NNBOT-c3f896a7d-2024-01-26
Fit Report: https://vp.nnpdf.science/hlvtGIQiQ2GmqQi64xNYvQ==
Fit Data: https://data.nnpdf.science/fits/NNBOT-c3f896a7d-2024-01-26.tar.gz

Check the report carefully, and please buy me a ☕ , or better, a GPU 😉!

APJansen added Refactoring n3fit Issues and PRs related to n3fit escience labels Dec 4, 2023

APJansen mentioned this pull request Dec 4, 2023

Multi Replica PDF #1880

Closed

APJansen force-pushed the replica-axis-first branch from 453a98f to d5e14f9 Compare December 8, 2023 09:59

Base automatically changed from replica-axis-first to master December 8, 2023 16:59

APJansen force-pushed the parallel-prefactor branch from 1cb492c to 5a85502 Compare December 13, 2023 09:17

scarlehoff reviewed Dec 13, 2023

View reviewed changes

n3fit/src/n3fit/model_gen.py Outdated Show resolved Hide resolved

n3fit/src/n3fit/backends/keras_backend/MetaModel.py Outdated Show resolved Hide resolved

APJansen force-pushed the parallel-prefactor branch from e2c20b2 to c650cf3 Compare January 9, 2024 11:37

APJansen mentioned this pull request Jan 9, 2024

Multi dense layer #1905

Merged

APJansen force-pushed the parallel-prefactor branch from c650cf3 to 5453531 Compare January 9, 2024 14:42

APJansen force-pushed the parallel-prefactor branch 2 times, most recently from d77af4d to 5867fec Compare January 10, 2024 16:34

APJansen requested review from Radonirinaunimi and RoyStegeman January 10, 2024 17:17

APJansen marked this pull request as ready for review January 10, 2024 17:18

APJansen added the run-fit-bot Starts fit bot from a PR. label Jan 10, 2024

scarlehoff mentioned this pull request Jan 19, 2024

Pre-merge branch for batch of changes #1913

Merged

scarlehoff changed the base branch from master to develop_merge_20240119 January 19, 2024 17:06

scarlehoff reviewed Jan 19, 2024

View reviewed changes

scarlehoff mentioned this pull request Jan 22, 2024

When running replicas in parallel, make it so the initial state of the network does not depend on the number of replicas. #1916

Closed

scarlehoff approved these changes Jan 23, 2024

View reviewed changes

scarlehoff mentioned this pull request Jan 23, 2024

WIP Merging #1914

Merged

scarlehoff force-pushed the develop_merge_20240119 branch from 595e3a9 to 8cb04b3 Compare January 24, 2024 08:50

Radonirinaunimi approved these changes Jan 24, 2024

View reviewed changes

n3fit/src/n3fit/backends/keras_backend/MetaModel.py Outdated Show resolved Hide resolved

n3fit/src/n3fit/backends/keras_backend/MetaModel.py Outdated Show resolved Hide resolved

n3fit/src/n3fit/backends/keras_backend/MetaModel.py Outdated Show resolved Hide resolved

Base automatically changed from develop_merge_20240119 to master January 24, 2024 09:32

APJansen force-pushed the parallel-prefactor branch from 9bcb73f to 8810e11 Compare January 26, 2024 08:30

scarlehoff merged commit c3f896a into master Jan 26, 2024
8 checks passed

scarlehoff deleted the parallel-prefactor branch January 26, 2024 09:59

scarlehoff restored the parallel-prefactor branch January 26, 2024 19:46

scarlehoff added run-fit-bot Starts fit bot from a PR. and removed run-fit-bot Starts fit bot from a PR. labels Jan 26, 2024

scarlehoff deleted the parallel-prefactor branch January 27, 2024 17:51

scarlehoff mentioned this pull request Jan 27, 2024

Update fit bot for 4.0.8 #1922

Merged

scarlehoff added a commit that referenced this pull request Jan 28, 2024

fix bug from #1881: weights were saved by ref

d8f64e0

scarlehoff mentioned this pull request Jan 29, 2024

Make weight initialization reproducible #1923

Merged

Merge prefactors into single layer #1881

Merge prefactors into single layer #1881

Conversation

APJansen commented Dec 4, 2023 • edited by scarlehoff Loading

Status: Done, ready for review

APJansen commented Dec 13, 2023

scarlehoff left a comment

Choose a reason for hiding this comment

APJansen commented Dec 15, 2023

scarlehoff commented Dec 15, 2023

APJansen commented Jan 8, 2024

APJansen commented Jan 10, 2024

RoyStegeman commented Jan 10, 2024 • edited Loading

scarlehoff commented Jan 10, 2024

scarlehoff commented Jan 10, 2024 • edited Loading

APJansen commented Jan 10, 2024

APJansen commented Jan 10, 2024

APJansen commented Jan 10, 2024

scarlehoff commented Jan 10, 2024

APJansen commented Jan 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scarlehoff left a comment

Choose a reason for hiding this comment

Radonirinaunimi commented Jan 23, 2024

scarlehoff commented Jan 23, 2024 • edited Loading

Radonirinaunimi commented Jan 23, 2024

scarlehoff commented Jan 23, 2024

APJansen commented Jan 23, 2024

scarlehoff commented Jan 23, 2024 • edited Loading

APJansen commented Jan 23, 2024

scarlehoff commented Jan 23, 2024 • edited Loading

Radonirinaunimi left a comment

Choose a reason for hiding this comment

APJansen commented Jan 26, 2024

github-actions bot commented Jan 26, 2024

APJansen commented Dec 4, 2023 •

edited by scarlehoff

Loading

RoyStegeman commented Jan 10, 2024 •

edited

Loading

scarlehoff commented Jan 10, 2024 •

edited

Loading

scarlehoff commented Jan 23, 2024 •

edited

Loading

scarlehoff commented Jan 23, 2024 •

edited

Loading

scarlehoff commented Jan 23, 2024 •

edited

Loading