-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge prefactors into single layer #1881
Conversation
453a98f
to
d5e14f9
Compare
1cb492c
to
5a85502
Compare
@scarlehoff @RoyStegeman @Radonirinaunimi Just so you know, I rebased this and updated the status above, in case you want to have a look. But it's still WIP so no need to review yet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left a suggestion which I think might simplify the code and bypass completely the problem you have in mac (and, probably, bypass future problems with other tensorflow versions because setting all weights will always work, breaking down a layer into smaller variables is more tricky).
I have done a manual test, first creating a prefactor layer in this branch and getting the output on some test input. Then in master create the same layers, and set the weights to those in this branch. Comparing outputs, they're identical, both for 1 and for 3 replicas. This is also the reason the tests are still failing, it's only some regression tests that fail, so they should be updated. Also, I changed the name of the layer from The same will apply of course to the test scriptimport numpy as np
import pickle
from n3fit.layers import Preprocessing
flav_info = [
{"fl": i, "largex": [0, 1], "smallx": [1, 2]}
for i in ["u", "ubar", "d", "dbar", "c", "g", "s", "sbar"]
]
np.random.seed(42)
x_in = np.random.rand(1, 10, 1)
branch = "master"
if branch == "parallel-prefactor":
print("Running parallel-prefactor")
weights_parallel = {}
outs_parallel = {}
for large_x in [True, False]:
for num_replicas in [1, 3]:
preproc = Preprocessing(
flav_info=flav_info,
input_shape=(1,),
name="preprocessing_factor",
seed=42,
large_x=large_x,
num_replicas=num_replicas,
)
preproc.build((None, 1,))
weights_parallel[(large_x, num_replicas)] = preproc.get_weights()
outs_parallel[(large_x, num_replicas)] = preproc(x_in)
with open("tmp/preproc_weights.pkl", "wb") as f:
pickle.dump(weights_parallel, f)
with open("tmp/preproc_outs.pkl", "wb") as f:
pickle.dump(outs_parallel, f)
if branch == "master":
print("Running master")
with open("tmp/preproc_weights.pkl", "rb") as f:
weights_parallel = pickle.load(f)
with open("tmp/preproc_outs.pkl", "rb") as f:
outs_parallel = pickle.load(f)
for large_x in [True, False]:
for num_replicas in [1, 3]:
print(f"large_x: {large_x}, num_replicas: {num_replicas}")
preprocs = []
for i_replica in range(num_replicas):
preproc = Preprocessing(
flav_info=flav_info,
input_shape=(1,),
name=f"preprocessing_factor_{i_replica}",
seed=42,
large_x=large_x,
)
preproc.build((None, 1))
weights_this_replica = []
for w in weights_parallel[(large_x, num_replicas)]:
weights_this_replica.append(w[i_replica])
preproc.set_weights(weights_this_replica)
preprocs.append(preproc)
outs = np.stack(
[preprocs[i_replica](x_in) for i_replica in range(num_replicas)], axis=1
)
np.testing.assert_equal(outs, outs_parallel[(large_x, num_replicas)]) |
The tests for 3.11 are pointing to something else though:
RE the label of the replicas. No I don't think that matters, it's completely internal after all. For the regression instead, just used the |
Trying to fix the regression tests, I am running That should fix it right? But locally at least, this still gives an assertion error, with a max relative difference of 15%.. |
e2c20b2
to
c650cf3
Compare
c650cf3
to
5453531
Compare
I just repeated the procedure above on Snellius, updating the weights_1.h5 and quickcard_1.json from a run on Snellius and running the corresponding test. There as well it still fails. |
Looking at what exactly the test does this should indeed not work. Namely, it uses the It's not so clear to me why the the weights are loaded from the file as opposed to just initialised using the rng in this regression test (maybe to test the loading functionality?). And of course I don't know how the current |
The setting of the weights is to give a bit of stability to the regression test, I wanted to have one that works and one that fails (to fit, I mean). It would be interesting to have a In any case, I don't think the starting point should be changed? You should be able to either use the same weights, the number of parameters on the network should not change in this PR. |
Just to be precise, "fixing the regression test" here should mean:
|
Ah I see, that makes sense, I'll look into it, thanks! |
That fixed it! The script I used is below, it uses saved weights from this branch to get the structure, and then looks in the old file to copy over the data. Locally it didnt pass the test, though with small difference, but as you can see in the CI it does, I didn't change the outputs. Now i just need to fix this issue with 3.11. reformatting weights scriptdef reformat_h5(i_replica):
old_file = h5py.File(f"weights_{i_replica}_master.h5", 'r')
new_file = h5py.File(f"weights_{i_replica}.h5", 'r+')
def replace_numbers(name, obj):
name_new_to_old_rules = {
"NNs": f"NN_{i_replica - 1}",
"preprocessing_factor": f"preprocessing_factor_{i_replica - 1}",
"dense_3": "dense",
"dense_4": "dense_1",
"dense_5": "dense_2",
}
if isinstance(obj, h5py.Dataset):
old_name = name
for new, old in name_new_to_old_rules.items():
old_name = old_name.replace(new, old)
print(f"Replacing {name} with {old_name}")
data_to_put_in = old_file[old_name][...]
if "preprocessing" in name:
data_to_put_in = np.expand_dims(data_to_put_in, axis=0)
obj[...] = data_to_put_in
new_file.visititems(replace_numbers)
reformat_h5(1)
reformat_h5(2) |
d77af4d
to
5867fec
Compare
Fixed! And ready for review. I don't know if you think it's necessary to do more tests on top of the CI, but if you do I would propose to postpone that and just do it in #1905, which branches off of this. |
For merging, I'd say the same as with: Btw, one point that I wanted to raise during the meeting this morning (which I could not attend) is that please either squash or rebase (given the amount of commits -and intermediate merges- probably you need both) these PR such that their history is not mixed with master. |
Yes I tried that yesterday (on nn-layers-refactor), but I got some error I didn't understand, so I merged instead, but I can take another look. |
n3fit/src/n3fit/model_gen.py
Outdated
compute_preprocessing_factor = Preprocessing( | ||
flav_info=flav_info, | ||
input_shape=(1,), | ||
name="preprocessing_factor", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, import this from MetaModel
to make sure that it is consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have done this everywhere you suggest, but it doesnt seem very logical to me to import this from n3fit.backends
, and maybe a bit overkill for such simple names, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, you can add a constants.py
or something if you don't like the import from n3fit.backends
. If they are supposed to be the same they should be the same. But in model_gen
you need to import from n3fit.backends
anyway so it doesn't create any circular problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A separate constants.py
makes sense, but then lots of other stuff should move there too, I'd say let's leave that for later ;P
flav_info=flav_info, | ||
input_shape=(1,), | ||
name="preprocessing_factor", | ||
seed=seed[0] + number_of_layers, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means that training replicas 1 to 10 and 2 to 10 won't give me the same initial results, right?
However, would 1 to 10 and 1 to 9 produce the exact same result from 1 to 9? I would like to know that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean n3fit 1 -r 10
vs n3fit 2 -r 10
, and n3fit 1 -r 10
versus n3fit 1 -r 9
? I'll test, I'm actually not sure how those arguments get fed into pdfNN_layer_generator
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running in parallel I mean. You normally start at replica 1 but it is not necessarily always like that.
The reason why I'm asking is that I think right now if you instantiate, say, 10 replicas (to fit them in parallel for instance) each of them will get a separate seed for the generation of the pseudo-data and, until now, a separate seed for each of the Neural Networks.
As long as the seeds were separated that meant the initial point could be reproducible.
Instead, now all replicas are being instantiated with the first seed, so obviously if you fit 1 to 10 you will get a result different from 2 to 10 (because the corresponding first seed is different).
But I am wondering whether 1 to 10 would be equivalent to 1 to 9 for the first 9 replicas.
Then, one thing to be done for when parallel fit becomes the standard is recover that reproducibility. For now reproducibility is limited to sequential CPU replicas. Reproducibility cannot be achieved in any easy way in GPU, but at least the initial state can be made to be the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked n3fit 1 -r 4
, n3fit 2 -r 4
, and n3fit 1 -r 5
, and there were no similar results anywhere. For the MultiDense
layers I did make sure that initialization is identical to master, here I did not.
Perhaps it's easiest to use the solution from there: MultiInitializer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it works out of the box for the preprocessing, yes indeed.
But if it doesn't we can leave that for later.
Btw, since you're at it, check that with seed[0] + number_of_layers we are not being silly and using the same seed twice. That was something that worked well when every layer was using seed +=1 but not sure whether that's the case anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's ok to leave for later, I'll try it in the next PR #1905, to save some hassle of moving the code between branches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The seeds should still be ok, I haven't changed the seeding of the NN layers, neither in this PR nor in the next.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to confirm, is this finished now? (can it join the bulk of PRs to be merge to master?)
Does anybody else want to look at it? (not sure whether @Radonirinaunimi or @RoyStegeman already did have look)
Yes, thanks! I'd like to have at look at this and #1788 today (before the CM). |
#1788 needs to be first squashed and rebased on top of what-will-be-master before being ready to be merged though |
Yes, this I understood. But we can nonetheless start reviewing that, no? Or, do you expect to have many merged changes into master that'll later affect the review of that PR? |
I don't know. But wanted to warn you just in case! |
Yes if you are happy with it this is ready to be merged!
I don't think so, just did a quick test merging this branch into #1788 (without pushing) and there were no merge conflicts at all. |
I haven't, just tried to merge it in here, (after rebasing didn't work). Merging went through without conflicts, but a quick test run fails with |
You need to reinstall. |
595e3a9
to
8cb04b3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from the pending issue regarding the full reproducibility, this also LGTM.
Below are just some tiny fixes regarding docstrings.
Simplify handling of dropout Factor out layer_generator in generate_dense_network Refactor dense_per_flavor_network Move setting of last nodes to generate_nn Add constant arguments Add constant arguments Move dropout to generate_nn Move concatenation of per_flavor layers into generate_nn Make the two layer generators almost equal remove separate dense and dense_per_flavor functions Add documentation. Simplify per_flavor layer concatenation Reverse order of loops over replicas and layers Fixes for dropout Fixes for per_flavour Fix issue with copying over nodes for per_flavour layer Fix seeds in per_flavour layer Add error for combination of dropout with per_flavour layers Add basis_size argument to per_flavour layer Fix model_gen tests to use new generate_nn in favor of now removed generate_dense and generate_dense_per_flavour Allow for nodes to be a tuple Move dropout, per_flavour check to checks Clarify layer type check Co-authored-by: Juan M. Cruz-Martinez <juacrumar@lairen.eu> Clarify naming in nn_generator Remove initializer_name argument clarify comment Co-authored-by: Juan M. Cruz-Martinez <juacrumar@lairen.eu> Add comment on shared layers Rewrite comprehension over replica seeds Add check on layer type Merge prefactors into single layer Add replica dimension to preprocessing factor in test Update preprocessing layer in vpinterface Remove assigning of weight slices Simplify loading weights from file Update regression data Always return a single NNs model for all replicas, adjust weight getting and setting accordingly Revert "Update regression data" This reverts commit 6f79368. Change structure of regression weights Remove now unused postfix Update regression weights Give explicit shape to scatter_to_one Update developing weights structure fix prefix typo add double ticks rename layer name constants use constants defined in metamodel.py for layer names Explain need for is_stacked_single_replicas shorten line fix constant loading Simplify get_replica_weights NNs -> all_NNs Clarify get_layer_replica_weights Co-authored-by: Juan M. Cruz-Martinez <juacrumar@lairen.eu> Clarify set_layer_replica_weights Remove comment about python 3.11 Co-authored-by: Juan M. Cruz-Martinez <juacrumar@lairen.eu> Fix typo in comment Co-authored-by: Tanjona Rabemananjara <rrabeman@nikhef.nl> Fix formatting in docstring Co-authored-by: Tanjona Rabemananjara <rrabeman@nikhef.nl> Rewording docstring Co-authored-by: Tanjona Rabemananjara <rrabeman@nikhef.nl>
9bcb73f
to
8810e11
Compare
The clean install worked straight away! :) And I've rebased this onto the latest master and squashed it into a single commit. |
Greetings from your nice fit 🤖 !
Check the report carefully, and please buy me a ☕ , or better, a GPU 😉! |
Join the preprocessing layers into one.
This also required changing the way weights are extracted and set, as a replica now is only a single slice of a weight tensor. Locally for me the way I implemented it is giving errors that seem specific to Mac, running on the GPU.
Status: Done, ready for review
replica_axis
flag I made there, to keep everything as is in the single replica models (i.e. not have an extra replica axis of length 1 there) so nothing there has to change.Edit: the changes of this PR changed the layout of the weights and thus weights saved for earlier versions of the code need to be restructured so they remain equivalent. The following script (comment: #1881 (comment)) achieves exactly that:
test script