Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge prefactors into single layer #1881

Merged
merged 1 commit into from
Jan 26, 2024
Merged

Merge prefactors into single layer #1881

merged 1 commit into from
Jan 26, 2024

Conversation

APJansen
Copy link
Collaborator

@APJansen APJansen commented Dec 4, 2023

Join the preprocessing layers into one.

This also required changing the way weights are extracted and set, as a replica now is only a single slice of a weight tensor. Locally for me the way I implemented it is giving errors that seem specific to Mac, running on the GPU.

Status: Done, ready for review

  • I started changing stuff in vpinterface, but I think what I should do instead is similar to what I did before with the PDF as a whole: use the replica_axis flag I made there, to keep everything as is in the single replica models (i.e. not have an extra replica axis of length 1 there) so nothing there has to change.
  • It is producing very different results, because the initialization is different. Keeping it the same would be very ugly, but I can do it temporarily just to verify that that's the only source of differences.
  • Fix test_fit

Edit: the changes of this PR changed the layout of the weights and thus weights saved for earlier versions of the code need to be restructured so they remain equivalent. The following script (comment: #1881 (comment)) achieves exactly that:

test script
import numpy as np
import h5py

def reformat_h5(i_replica):
    old_file = h5py.File(f"weights_{i_replica}_master.h5", 'r')
    new_file = h5py.File(f"weights_{i_replica}.h5", 'r+')

    def replace_numbers(name, obj):
        name_new_to_old_rules = {
                "NNs": f"NN_{i_replica - 1}",
                "preprocessing_factor": f"preprocessing_factor_{i_replica - 1}",
                "dense_3": "dense",
                "dense_4": "dense_1",
                "dense_5": "dense_2",
            }
        if isinstance(obj, h5py.Dataset):
            old_name = name
            for new, old in name_new_to_old_rules.items():
                old_name = old_name.replace(new, old)

            print(f"Replacing {name} with {old_name}")
            data_to_put_in = old_file[old_name][...]
            if "preprocessing" in name:
                data_to_put_in = np.expand_dims(data_to_put_in, axis=0)

            obj[...] = data_to_put_in

    new_file.visititems(replace_numbers)

reformat_h5(1)
reformat_h5(2)

@APJansen APJansen added Refactoring n3fit Issues and PRs related to n3fit escience labels Dec 4, 2023
@APJansen APJansen mentioned this pull request Dec 4, 2023
Base automatically changed from replica-axis-first to master December 8, 2023 16:59
@APJansen
Copy link
Collaborator Author

@scarlehoff @RoyStegeman @Radonirinaunimi Just so you know, I rebased this and updated the status above, in case you want to have a look. But it's still WIP so no need to review yet.

Copy link
Member

@scarlehoff scarlehoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a suggestion which I think might simplify the code and bypass completely the problem you have in mac (and, probably, bypass future problems with other tensorflow versions because setting all weights will always work, breaking down a layer into smaller variables is more tricky).

n3fit/src/n3fit/model_gen.py Outdated Show resolved Hide resolved
n3fit/src/n3fit/backends/keras_backend/MetaModel.py Outdated Show resolved Hide resolved
@APJansen
Copy link
Collaborator Author

I have done a manual test, first creating a prefactor layer in this branch and getting the output on some test input. Then in master create the same layers, and set the weights to those in this branch. Comparing outputs, they're identical, both for 1 and for 3 replicas.

This is also the reason the tests are still failing, it's only some regression tests that fail, so they should be updated.
Can you remind me how this was done?

Also, I changed the name of the layer from preprocessing_{i_replica} to just preprocessing.
Are you ok with that, for both the multiple replica model and the single replica ones, with the latter also keeping a length 1 axis dimension? Contrary to the comment in the initial post, I don't think it's worth differentiating this. For the output of the PDF it was worth it because this saved updating validphys, but I don't think that looks into the internals of the model, so here it's not necessary.

The same will apply of course to the NN_{i_replica} layers, and in both cases saved models will need to be updated because of this change in structure.

test script
import numpy as np
import pickle

from n3fit.layers import Preprocessing


flav_info = [
    {"fl": i, "largex": [0, 1], "smallx": [1, 2]}
    for i in ["u", "ubar", "d", "dbar", "c", "g", "s", "sbar"]
]

np.random.seed(42)
x_in = np.random.rand(1, 10, 1)

branch = "master"

if branch == "parallel-prefactor":
    print("Running parallel-prefactor")
    weights_parallel = {}
    outs_parallel = {}
    for large_x in [True, False]:
       for num_replicas in [1, 3]:
           preproc = Preprocessing(
                   flav_info=flav_info,
                   input_shape=(1,),
                   name="preprocessing_factor",
                   seed=42,
                   large_x=large_x,
                   num_replicas=num_replicas,
               )
           preproc.build((None, 1,))
           weights_parallel[(large_x, num_replicas)] = preproc.get_weights()
           outs_parallel[(large_x, num_replicas)] = preproc(x_in)
    with open("tmp/preproc_weights.pkl", "wb") as f:
       pickle.dump(weights_parallel, f)
    with open("tmp/preproc_outs.pkl", "wb") as f:
       pickle.dump(outs_parallel, f)

if branch == "master":
    print("Running master")
    with open("tmp/preproc_weights.pkl", "rb") as f:
        weights_parallel = pickle.load(f)
    with open("tmp/preproc_outs.pkl", "rb") as f:
        outs_parallel = pickle.load(f)
    for large_x in [True, False]:
        for num_replicas in [1, 3]:
            print(f"large_x: {large_x}, num_replicas: {num_replicas}")
            preprocs = []
            for i_replica in range(num_replicas):
                preproc = Preprocessing(
                    flav_info=flav_info,
                    input_shape=(1,),
                    name=f"preprocessing_factor_{i_replica}",
                    seed=42,
                    large_x=large_x,
                )
                preproc.build((None, 1))

                weights_this_replica = []
                for w in weights_parallel[(large_x, num_replicas)]:
                    weights_this_replica.append(w[i_replica])

                preproc.set_weights(weights_this_replica)
                preprocs.append(preproc)
            outs = np.stack(
                [preprocs[i_replica](x_in) for i_replica in range(num_replicas)], axis=1
            )
            np.testing.assert_equal(outs, outs_parallel[(large_x, num_replicas)])

@scarlehoff
Copy link
Member

The tests for 3.11 are pointing to something else though:

        TypeError: 'NoneType' object cannot be interpreted as an integer

RE the label of the replicas. No I don't think that matters, it's completely internal after all.

For the regression instead, just used the quickcard_1.json (and similarly names) to update the ones that failed.
It needs to be done manually because they don't fail often enough so that we've automatized the process... (obv, for you they fail more because you're touching the fitting code directly)

@APJansen
Copy link
Collaborator Author

APJansen commented Jan 8, 2024

Trying to fix the regression tests, I am running n3fit quickcard.yml 1 after editing the quickcard, changing load: "weights.h5" to save: "weights.h5" first. Then I copy the resulting quickcard/nn/replica_1/weights.h5 into tests/regressions/weights_1.h5, and quickcard/nn/replica_1/quickcard.json into tests/regressions/quickcard.json. And similar with 2 instead of 1 (forgetting about the qed runcard for now).

That should fix it right? But locally at least, this still gives an assertion error, with a max relative difference of 15%..

@APJansen APJansen force-pushed the parallel-prefactor branch from e2c20b2 to c650cf3 Compare January 9, 2024 11:37
@APJansen APJansen mentioned this pull request Jan 9, 2024
@APJansen APJansen force-pushed the parallel-prefactor branch from c650cf3 to 5453531 Compare January 9, 2024 14:42
@APJansen
Copy link
Collaborator Author

I just repeated the procedure above on Snellius, updating the weights_1.h5 and quickcard_1.json from a run on Snellius and running the corresponding test. There as well it still fails.

@RoyStegeman
Copy link
Member

RoyStegeman commented Jan 10, 2024

That should fix it right?

Looking at what exactly the test does this should indeed not work. Namely, it uses the hdf5 file to initialise the NN, and subsequently still performs the fit. So when doing the steps as you describe, the json files you saved correspond to the starting point while the json files they're compared to correspond to the end of the fit. One possible solution is of course to run the fit twice: once to produce the hdf5 file, and then load the weights from that file to produce the json files needed for the comparison.

It's not so clear to me why the the weights are loaded from the file as opposed to just initialised using the rng in this regression test (maybe to test the loading functionality?). And of course I don't know how the current hdf5 files were generated (in what kind of fit setup), though in principle for this regression test that doesn't matter. @scarlehoff can probably clarify on these points.

@scarlehoff
Copy link
Member

The setting of the weights is to give a bit of stability to the regression test, I wanted to have one that works and one that fails (to fit, I mean). It would be interesting to have a replica_3 that does not start from a fixed point, however that will introduce instabilities and the regression tests will fail randomly much more.

In any case, I don't think the starting point should be changed? You should be able to either use the same weights, the number of parameters on the network should not change in this PR.

@scarlehoff
Copy link
Member

scarlehoff commented Jan 10, 2024

Just to be precise, "fixing the regression test" here should mean:

  1. Converting the current weights into the new format (and if you can please leave the script here in the PR for future reference, since it might be necessary to use it in the future; it's not a widely used feature so I'm ok with this compromise, but I don't think we should merge something that breaks all previous weights)
  2. Seeing how the numbers change due to this changes (hopefully not much when the starting point is the same).
  3. Updating the numbers in the .json reference file only once (2.) is properly checked.

@APJansen
Copy link
Collaborator Author

Ah I see, that makes sense, I'll look into it, thanks!

@APJansen
Copy link
Collaborator Author

That fixed it! The script I used is below, it uses saved weights from this branch to get the structure, and then looks in the old file to copy over the data. Locally it didnt pass the test, though with small difference, but as you can see in the CI it does, I didn't change the outputs.

Now i just need to fix this issue with 3.11.

reformatting weights script
def reformat_h5(i_replica):
    old_file = h5py.File(f"weights_{i_replica}_master.h5", 'r')
    new_file = h5py.File(f"weights_{i_replica}.h5", 'r+')

    def replace_numbers(name, obj):
        name_new_to_old_rules = {
                "NNs": f"NN_{i_replica - 1}",
                "preprocessing_factor": f"preprocessing_factor_{i_replica - 1}",
                "dense_3": "dense",
                "dense_4": "dense_1",
                "dense_5": "dense_2",
            }
        if isinstance(obj, h5py.Dataset):
            old_name = name
            for new, old in name_new_to_old_rules.items():
                old_name = old_name.replace(new, old)

            print(f"Replacing {name} with {old_name}")
            data_to_put_in = old_file[old_name][...]
            if "preprocessing" in name:
                data_to_put_in = np.expand_dims(data_to_put_in, axis=0)

            obj[...] = data_to_put_in

    new_file.visititems(replace_numbers)

reformat_h5(1)
reformat_h5(2)

@APJansen APJansen force-pushed the parallel-prefactor branch 2 times, most recently from d77af4d to 5867fec Compare January 10, 2024 16:34
@APJansen APJansen marked this pull request as ready for review January 10, 2024 17:18
@APJansen
Copy link
Collaborator Author

Fixed! And ready for review.

I don't know if you think it's necessary to do more tests on top of the CI, but if you do I would propose to postpone that and just do it in #1905, which branches off of this.

@APJansen APJansen added the run-fit-bot Starts fit bot from a PR. label Jan 10, 2024
@scarlehoff
Copy link
Member

For merging, I'd say the same as with:
#1888 (comment)

Btw, one point that I wanted to raise during the meeting this morning (which I could not attend) is that please either squash or rebase (given the amount of commits -and intermediate merges- probably you need both) these PR such that their history is not mixed with master.

@APJansen
Copy link
Collaborator Author

Btw, one point that I wanted to raise during the meeting this morning (which I could not attend) is that please either squash or rebase (given the amount of commits -and intermediate merges- probably you need both) these PR such that their history is not mixed with master.

Yes I tried that yesterday (on nn-layers-refactor), but I got some error I didn't understand, so I merged instead, but I can take another look.

@scarlehoff scarlehoff changed the base branch from master to develop_merge_20240119 January 19, 2024 17:06
n3fit/src/n3fit/backends/keras_backend/MetaModel.py Outdated Show resolved Hide resolved
n3fit/src/n3fit/backends/keras_backend/MetaModel.py Outdated Show resolved Hide resolved
n3fit/src/n3fit/backends/keras_backend/MetaModel.py Outdated Show resolved Hide resolved
n3fit/src/n3fit/backends/keras_backend/MetaModel.py Outdated Show resolved Hide resolved
n3fit/src/n3fit/backends/keras_backend/MetaModel.py Outdated Show resolved Hide resolved
compute_preprocessing_factor = Preprocessing(
flav_info=flav_info,
input_shape=(1,),
name="preprocessing_factor",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, import this from MetaModel to make sure that it is consistent.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have done this everywhere you suggest, but it doesnt seem very logical to me to import this from n3fit.backends, and maybe a bit overkill for such simple names, what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, you can add a constants.py or something if you don't like the import from n3fit.backends. If they are supposed to be the same they should be the same. But in model_gen you need to import from n3fit.backends anyway so it doesn't create any circular problems.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A separate constants.py makes sense, but then lots of other stuff should move there too, I'd say let's leave that for later ;P

flav_info=flav_info,
input_shape=(1,),
name="preprocessing_factor",
seed=seed[0] + number_of_layers,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means that training replicas 1 to 10 and 2 to 10 won't give me the same initial results, right?

However, would 1 to 10 and 1 to 9 produce the exact same result from 1 to 9? I would like to know that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean n3fit 1 -r 10 vs n3fit 2 -r 10, and n3fit 1 -r 10 versus n3fit 1 -r 9? I'll test, I'm actually not sure how those arguments get fed into pdfNN_layer_generator.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running in parallel I mean. You normally start at replica 1 but it is not necessarily always like that.

The reason why I'm asking is that I think right now if you instantiate, say, 10 replicas (to fit them in parallel for instance) each of them will get a separate seed for the generation of the pseudo-data and, until now, a separate seed for each of the Neural Networks.
As long as the seeds were separated that meant the initial point could be reproducible.

Instead, now all replicas are being instantiated with the first seed, so obviously if you fit 1 to 10 you will get a result different from 2 to 10 (because the corresponding first seed is different).

But I am wondering whether 1 to 10 would be equivalent to 1 to 9 for the first 9 replicas.

Then, one thing to be done for when parallel fit becomes the standard is recover that reproducibility. For now reproducibility is limited to sequential CPU replicas. Reproducibility cannot be achieved in any easy way in GPU, but at least the initial state can be made to be the same.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked n3fit 1 -r 4, n3fit 2 -r 4, and n3fit 1 -r 5, and there were no similar results anywhere. For the MultiDense layers I did make sure that initialization is identical to master, here I did not.

Perhaps it's easiest to use the solution from there: MultiInitializer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it works out of the box for the preprocessing, yes indeed.
But if it doesn't we can leave that for later.

Btw, since you're at it, check that with seed[0] + number_of_layers we are not being silly and using the same seed twice. That was something that worked well when every layer was using seed +=1 but not sure whether that's the case anymore.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's ok to leave for later, I'll try it in the next PR #1905, to save some hassle of moving the code between branches.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The seeds should still be ok, I haven't changed the seeding of the NN layers, neither in this PR nor in the next.

n3fit/src/n3fit/model_trainer.py Outdated Show resolved Hide resolved
n3fit/src/n3fit/tests/test_modelgen.py Outdated Show resolved Hide resolved
n3fit/src/n3fit/vpinterface.py Outdated Show resolved Hide resolved
Copy link
Member

@scarlehoff scarlehoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm, is this finished now? (can it join the bulk of PRs to be merge to master?)

Does anybody else want to look at it? (not sure whether @Radonirinaunimi or @RoyStegeman already did have look)

@Radonirinaunimi
Copy link
Member

Just to confirm, is this finished now? (can it join the bulk of PRs to be merge to master?)

Does anybody else want to look at it? (not sure whether @Radonirinaunimi or @RoyStegeman already did have look)

Yes, thanks! I'd like to have at look at this and #1788 today (before the CM).

@scarlehoff
Copy link
Member

scarlehoff commented Jan 23, 2024

#1788 needs to be first squashed and rebased on top of what-will-be-master before being ready to be merged though

@Radonirinaunimi
Copy link
Member

#1788 needs to be first squashed and rebased on top of what-will-be-master before being ready to be merged though

Yes, this I understood. But we can nonetheless start reviewing that, no? Or, do you expect to have many merged changes into master that'll later affect the review of that PR?

@scarlehoff
Copy link
Member

do you expect to have many merged changes into master that'll later affect the review of that PR?

I don't know. But wanted to warn you just in case!

@APJansen
Copy link
Collaborator Author

Just to confirm, is this finished now? (can it join the bulk of PRs to be merge to master?)

Yes if you are happy with it this is ready to be merged!

Or, do you expect to have many merged changes into master that'll later affect the review of that PR?

I don't think so, just did a quick test merging this branch into #1788 (without pushing) and there were no merge conflicts at all.

@scarlehoff scarlehoff mentioned this pull request Jan 23, 2024
@scarlehoff
Copy link
Member

scarlehoff commented Jan 23, 2024

btw @APJansen did you already rebased / merged on top of / with #1914 ?
(to launch the fitbot here again)

@APJansen
Copy link
Collaborator Author

I haven't, just tried to merge it in here, (after rebasing didn't work). Merging went through without conflicts, but a quick test run fails with ModuleNotFoundError: No module named 'validphys._version'

@scarlehoff
Copy link
Member

scarlehoff commented Jan 23, 2024

but a quick test run fails with ModuleNotFoundError: No module named 'validphys._version'

You need to reinstall. pip install -e . would work but also cmake if you want that.

@scarlehoff scarlehoff force-pushed the develop_merge_20240119 branch from 595e3a9 to 8cb04b3 Compare January 24, 2024 08:50
Copy link
Member

@Radonirinaunimi Radonirinaunimi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the pending issue regarding the full reproducibility, this also LGTM.

Below are just some tiny fixes regarding docstrings.

n3fit/src/n3fit/backends/keras_backend/MetaModel.py Outdated Show resolved Hide resolved
n3fit/src/n3fit/backends/keras_backend/MetaModel.py Outdated Show resolved Hide resolved
n3fit/src/n3fit/backends/keras_backend/MetaModel.py Outdated Show resolved Hide resolved
Base automatically changed from develop_merge_20240119 to master January 24, 2024 09:32
Simplify handling of dropout

Factor out layer_generator in generate_dense_network

Refactor dense_per_flavor_network

Move setting of last nodes to generate_nn

Add constant arguments

Add constant arguments

Move dropout to generate_nn

Move concatenation of per_flavor layers into generate_nn

Make the two layer generators almost equal

remove separate dense and dense_per_flavor functions

Add documentation.

Simplify per_flavor layer concatenation

Reverse order of loops over replicas and layers

Fixes for dropout

Fixes for per_flavour

Fix issue with copying over nodes for per_flavour layer

Fix seeds in per_flavour layer

Add error for combination of dropout with per_flavour layers

Add basis_size argument to per_flavour layer

Fix model_gen tests to use new generate_nn in favor of now removed generate_dense and generate_dense_per_flavour

Allow for nodes to be a tuple

Move dropout, per_flavour check to checks

Clarify layer type check

Co-authored-by: Juan M. Cruz-Martinez <juacrumar@lairen.eu>

Clarify naming in nn_generator

Remove initializer_name argument

clarify comment

Co-authored-by: Juan M. Cruz-Martinez <juacrumar@lairen.eu>

Add comment on shared layers

Rewrite comprehension over replica seeds

Add check on layer type

Merge prefactors into single layer

Add replica dimension to preprocessing factor in test

Update preprocessing layer in vpinterface

Remove assigning of weight slices

Simplify loading weights from file

Update regression data

Always return a single NNs model for all replicas, adjust weight getting and setting accordingly

Revert "Update regression data"

This reverts commit 6f79368.

Change structure of regression weights

Remove now unused postfix

Update regression weights

Give explicit shape to scatter_to_one

Update developing weights structure

fix prefix typo

add double ticks

rename layer name constants

use constants defined in metamodel.py for layer names

Explain need for is_stacked_single_replicas

shorten line

fix constant loading

Simplify get_replica_weights

NNs -> all_NNs

Clarify get_layer_replica_weights

Co-authored-by: Juan M. Cruz-Martinez <juacrumar@lairen.eu>

Clarify set_layer_replica_weights

Remove comment about python 3.11

Co-authored-by: Juan M. Cruz-Martinez <juacrumar@lairen.eu>

Fix typo in comment

Co-authored-by: Tanjona Rabemananjara <rrabeman@nikhef.nl>

Fix formatting in docstring

Co-authored-by: Tanjona Rabemananjara <rrabeman@nikhef.nl>

Rewording docstring

Co-authored-by: Tanjona Rabemananjara <rrabeman@nikhef.nl>
@APJansen
Copy link
Collaborator Author

The clean install worked straight away! :)

And I've rebased this onto the latest master and squashed it into a single commit.

@scarlehoff scarlehoff merged commit c3f896a into master Jan 26, 2024
8 checks passed
@scarlehoff scarlehoff deleted the parallel-prefactor branch January 26, 2024 09:59
@scarlehoff scarlehoff restored the parallel-prefactor branch January 26, 2024 19:46
@scarlehoff scarlehoff added run-fit-bot Starts fit bot from a PR. and removed run-fit-bot Starts fit bot from a PR. labels Jan 26, 2024
Copy link

Greetings from your nice fit 🤖 !
I have good news for you, I just finished my tasks:

Check the report carefully, and please buy me a ☕ , or better, a GPU 😉!

@scarlehoff scarlehoff deleted the parallel-prefactor branch January 27, 2024 17:51
scarlehoff added a commit that referenced this pull request Jan 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
escience n3fit Issues and PRs related to n3fit Refactoring run-fit-bot Starts fit bot from a PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants