Refactor stopping #1792

APJansen · 2023-08-15T12:06:38Z

This simplifies the stopping.py module, reducing the number of classes and the coupling to the outside. It also rewrites the WriterWrapper.

I started looking at this because #1782 requires changes here, but the changes in this PR are purely a refactor that should do exactly the same.

Tests

I test this by running on this branch, renaming the output folder, running on master, and comparing the two. We only expect changes in the timing and versions.
I've tested with:

n3fit Basic_runcard_modified.yml 1 -r 5
n3fit Basic_runcard_modified.yml 4 -r 5
n3fit Basic_runcard_modified.yml 3

In all cases the only differences are the versions and the timings, the rest is identical.
Note I've modified the basic runcard to speed it up, increasing the learning rate so it stops quickly. I think it is still general enough for these tests, in the case with 5 replicas, the best epochs are 77, 72, 77, 72, 75, so not a degenerate case where coincidentally they're all equal or something.

Comments

WriterWrapper is a class with one method, that gets initialized and immediately after the method gets called, and that's all that happens with it. So it could just as well have been a function. I left it as a class though because it was easier and I'm not sure if everyone will agree.
Similarly in the stopping module, there were 4 classes and it was quite hard to figure out which calls which etc. I removed one, ReplicaState, and moved a lot of the functionality from FitHistory to the main Stopping class. Now FitHistory is essentially nothing more than a list of the 4th class FitState, so perhaps I should remove that entirely, keeping only Stopping and FitState.
Some of the timings are stored for each replica, but if multiple replicas are run these will be the overal timings. However when you run them one at a time it will be different of course. Also there is nothing better to report on a per-replica basis if run in parallel, so I don't think there is anything better to do and I just added a comment on it.
In WriterWrapper I needed both the pdf_model itself and the pdf_instance which is the model wrapped by N3PDF. I got the former from the latter by accessing its attribute, but that is supposed to be private so need to clean that up.

…tate

…story

scarlehoff · 2023-08-23T08:11:14Z

Hi @APJansen, some comments about the stopping refactoring.

First of all, as you say, the stopping module is unnecessarily complicated. There were some plans for it and it was used for some side projects but as thing stands might as well be completely changed.
So feel free to rewrite it completely if you wish so. This might be important for running the replicas in parallel in a GPU so don't constrain yourself to having all the different histories and what's not.
(let me cc @RoyStegeman here, since the last thing that might have used this module are the script to generate animations)

The only important thing is that the final results are exactly identical, and by that I mean the files written at the end by the fit are the same (so the .json, the .exportgrid etc). Since the stopping deals only with how/when the fit stops it should not have any effect on numerics so the best test (are you are doing) is just to check that running the same runcard with master and this branch are identical.

In this context there are a few situations that should be tested:

The normal fit
A fit where all the trvl fraction are set to 1.0 (in this special case chi2_tr = chi2_vl, so the fit will output the model at the last point in which positivity passes)
A fit where the stopping patience is equal to 1.0 (in this case the fit will output the model at the best overall value of chi2_vl)
A fit without positivity

If you don't have runcards for these let me know and I can prepare them for you

Just run these tests with n3fit <runcard> <replica_number>, don't bother with the multireplica in this case. If it doesn't break even better ofc.

(RE the writter module, same, as long as the output is the same you can do what you want with it. Btw, the reason the module feels a bit out of place is this line here which incidentally should be removed because we dropped compatibility some time ago and all that logic is now removed!)

APJansen · 2023-08-28T12:04:34Z

Hi @scarlehoff, great, thanks for the extra info! The chi2's are saved as well though in chi2exps.log, so the history needs to stay in some form, but I don't think it's a problem.
I think in principle it could be written entirely in terms of Keras's early stopping callback, was there a reason you didn't use this?
For now though I'll just review what I've done and maybe simplify it some more if possible, I think it's good enough and want to get on with the actual parallel replica stuff.

About the runcards to test, can I just take the basic runcard as the normal fit, and modify it in those 3 ways separately for the others? The first modification being frac: 1.0 in the datasets_inputs, the second stopping_patience: 1.0, and the last I'm not sure, just removing the two entries under posdatasets?

RoyStegeman · 2023-08-30T09:39:33Z

I think in principle it could be written entirely in terms of Keras's early stopping callback, was there a reason you didn't use this?

That's a good question. Would this also work with the boolean positivity requirement?

The first modification being frac: 1.0 in the datasets_inputs, the second stopping_patience: 1.0

Indeed

and the last I'm not sure, just removing the two entries under posdatasets

If I'm not mistaken the posdatasets key cannot be empty since it's collect`ed over, so instead of removing it, you can probably put an empty list.

Since I'm somewhat out of the loop, what is the status of the different PRs?

Refactor stopping #1792 (this one) is in draft but from the comments it seems ready for review?
Trvl mask layers #1661 can be closed in favour of Parallel replicas with varying tr-vl masks #1788?
Parallel replicas with varying tr-vl masks #1788 is WIP?
Multi Replica PDF #1782 is also WIP?
Doing hyperopt (Hyperopt loss #1726 ?) is the final target, but is waiting on the above ?

APJansen · 2023-08-30T12:42:45Z

@RoyStegeman An empty list being just a -?

I'm just writing an issue describing recent developments and plans. I think indeed #1661 can be closed, #1788 is WIP and #1782 is even more WIP. This one I'll have another look today since it's been a while but yes it's done or nearly done, apart from some further testing.
We'll try to have a small hyperopt example before the meeting at the end of september, probably on the current version of trvl-mask-layers, but no guarantees. The issue I'm writing is about a significant speedup, but that won't be fully integrated before that.

RoyStegeman · 2023-08-30T12:56:45Z

@RoyStegeman An empty list being just a -?

No a [], i.e. posdatasets: []. - would I believe return [None], which collect will still try to loop over and I don't expect all functions to deal with that successfully (though perhaps they will).

Regarding the rest - thanks, it would indeed be helpful to have an issue as a single place to keep track of the planning and various PRs related to this project.

APJansen · 2023-08-30T13:52:31Z

Ok thanks, I've got the 4 versions running on master, but this PR needs some work that I won't get to this week, I'll hopefully finish this next week and mark it as ready for review.

RoyStegeman · 2023-08-30T14:02:07Z

Thanks for creating the issue, it's very clear.

Sure, you can do everything at your own pace, it's your project :). I'll just wait till it's ready to review.

github-actions · 2023-09-18T16:37:40Z

Greetings from your nice fit 🤖 !
I have good news for you, I just finished my tasks:

Fit Name: NNBOT-387ba9693-2023-09-18
Fit Report: https://vp.nnpdf.science/xMBeudlKQyKJx27-IynE-g==
Fit Data: https://data.nnpdf.science/fits/NNBOT-387ba9693-2023-09-18.tar.gz

Check the report carefully, and please buy me a ☕ , or better, a GPU 😉!

scarlehoff · 2023-09-23T12:33:23Z

I think in principle it could be written entirely in terms of Keras's early stopping callback, was there a reason you didn't use this?
That's a good question. Would this also work with the boolean positivity requirement?

There's also the secondary problem of using frameworks other than tensorflow for the PDFs (for instance, the quantum PDFs using qibo).

I'd avoid adding the tf dependency where is not explicitly needed. Unless there's a clear advantage in terms of performance (at the end of the day, 99% of the fits and papers are done with tensorflow), in that case it's probably fine to add some inconveniences on the side projects.

APJansen · 2023-09-25T06:41:13Z

Sorry I forgot to add reviewers, is this ok to merge?

Radonirinaunimi · 2023-09-25T07:52:32Z

Sorry I forgot to add reviewers, is this ok to merge?

I have started having a look at this but still needs a couple of hours/a day.

scarlehoff · 2023-09-25T07:53:59Z

I had a quick look and seems ok. I'll have to look in detail next week.

Did you test some edge cases like no patience / no positivity / infinite patience ?

Also, since you reshuffled a lot the stopping module, please remove any "dangling methods" or properties (maybe you did already ofc)

Radonirinaunimi

I did check the corner cases and in all instances the fits ran fine. I have requested very minor modifications which I think are worth addressing.

n3fit/src/n3fit/io/writer.py

n3fit/src/n3fit/stopping.py

APJansen · 2023-10-04T11:22:38Z

@Radonirinaunimi I've addressed all your comments, and reran the comparison. @scarlehoff the tests I did are as we discussed, I've documented them here. They all pass also after the last changes.

scarlehoff

Thanks for this. I've left a few comments but they don't change anything fundamental.

n3fit/src/n3fit/io/writer.py

n3fit/src/n3fit/stopping.py

n3fit/src/n3fit/io/writer.py

Co-authored-by: Juan M. Cruz-Martinez <juacrumar@lairen.eu>

APJansen · 2023-10-11T07:31:04Z

Thanks @scarlehoff, I implemented your comments, and checked that results are still identical. Is it ok to merge now?

APJansen added 14 commits August 14, 2023 15:44

Add best_epochs and positivity_statusses to Stopping (and run black)

f2a38af

Clean up Writer

a514bf1

Undo assumed changes to model structure

df0cc8f

various fixes

0c11088

Update best_epochs and positivity_statusses

5db1c67

Remove now unused iterators over replicas

1a7de34

Remove now unused positivity_pass and positivity_status from ReplicaS…

74bc818

…tate

Move all best and stop_epoch to Stopping, remove trainable=False

e04d336

Remove all_positivity_status

984d4fd

Completely remove ReplicaState, move remaining functionality to FitHi…

57b3be1

…story

Uniformize notation

85e0061

Improve documentation

6c1229a

Move all but the losses from FitHistory to Stopping itself

2c48869

Update documentation

df0dae0

APJansen added Refactoring n3fit Issues and PRs related to n3fit labels Aug 15, 2023

APJansen added 2 commits August 16, 2023 13:56

Indicate private attributes

635e6a0

Precompute preprocessing, arclengths, integrability_numbers

f541e1b

APJansen mentioned this pull request Aug 30, 2023

Realising a factor 20-30 speedup on GPU #1803

Closed

APJansen added 3 commits September 4, 2023 11:22

Add if statement before restoring best weights

9f07416

Add default best_epoch to last_epoch

78a6d30

Merge branch 'master' into refactor_stopping

66d32ac

APJansen marked this pull request as ready for review September 4, 2023 13:32

APJansen requested review from scarlehoff and RoyStegeman September 25, 2023 06:40

Radonirinaunimi requested changes Oct 3, 2023

View reviewed changes

n3fit/src/n3fit/io/writer.py Outdated Show resolved Hide resolved

n3fit/src/n3fit/io/writer.py Outdated Show resolved Hide resolved

n3fit/src/n3fit/io/writer.py Outdated Show resolved Hide resolved

n3fit/src/n3fit/stopping.py Outdated Show resolved Hide resolved

APJansen added 5 commits October 4, 2023 13:01

statusses -> statuses

f8b985f

combine 3 list comprehensions into one loop

e2f65ec

Clarify comment

6b364cf

Using PurePath instead of str

5c3c2e8

Merge branch 'master' into refactor_stopping

9918791

Radonirinaunimi approved these changes Oct 4, 2023

View reviewed changes

scarlehoff reviewed Oct 5, 2023

View reviewed changes

APJansen and others added 6 commits October 10, 2023 13:34

improve documentation

905218c

Co-authored-by: Juan M. Cruz-Martinez <juacrumar@lairen.eu>

Use path object syntax

17ff602

Update n3fit/src/n3fit/stopping.py

d82c826

Co-authored-by: Juan M. Cruz-Martinez <juacrumar@lairen.eu>

Update n3fit/src/n3fit/io/writer.py

93afd75

Co-authored-by: Juan M. Cruz-Martinez <juacrumar@lairen.eu>

Remove fix for tensorflow 2.2 path bug

5a6f743

Remove best_epoch method

e0b9cc8

remove PurePath

2396d44

scarlehoff merged commit db8b790 into master Oct 11, 2023
4 checks passed

scarlehoff deleted the refactor_stopping branch October 11, 2023 13:04

This was referenced Oct 16, 2023

Multi Replica PDF #1782

Closed

bugfix with stopping #1820

Merged

Radonirinaunimi mentioned this pull request Nov 8, 2023

trvl-mask-layers parallel NNPDF4.0 fit broken #1838

Closed

APJansen mentioned this pull request Dec 4, 2023

Multi Replica PDF #1880

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor stopping #1792

Refactor stopping #1792

APJansen commented Aug 15, 2023 •

edited

Loading

scarlehoff commented Aug 23, 2023

APJansen commented Aug 28, 2023

RoyStegeman commented Aug 30, 2023 •

edited

Loading

APJansen commented Aug 30, 2023

RoyStegeman commented Aug 30, 2023 •

edited

Loading

APJansen commented Aug 30, 2023

RoyStegeman commented Aug 30, 2023

github-actions bot commented Sep 18, 2023

scarlehoff commented Sep 23, 2023

APJansen commented Sep 25, 2023

Radonirinaunimi commented Sep 25, 2023

scarlehoff commented Sep 25, 2023 •

edited

Loading

Radonirinaunimi left a comment

APJansen commented Oct 4, 2023

scarlehoff left a comment

APJansen commented Oct 11, 2023

Refactor stopping #1792

Refactor stopping #1792

Conversation

APJansen commented Aug 15, 2023 • edited Loading

Tests

Comments

scarlehoff commented Aug 23, 2023

APJansen commented Aug 28, 2023

RoyStegeman commented Aug 30, 2023 • edited Loading

APJansen commented Aug 30, 2023

RoyStegeman commented Aug 30, 2023 • edited Loading

APJansen commented Aug 30, 2023

RoyStegeman commented Aug 30, 2023

github-actions bot commented Sep 18, 2023

scarlehoff commented Sep 23, 2023

APJansen commented Sep 25, 2023

Radonirinaunimi commented Sep 25, 2023

scarlehoff commented Sep 25, 2023 • edited Loading

Radonirinaunimi left a comment

Choose a reason for hiding this comment

APJansen commented Oct 4, 2023

scarlehoff left a comment

Choose a reason for hiding this comment

APJansen commented Oct 11, 2023

APJansen commented Aug 15, 2023 •

edited

Loading

RoyStegeman commented Aug 30, 2023 •

edited

Loading

RoyStegeman commented Aug 30, 2023 •

edited

Loading

scarlehoff commented Sep 25, 2023 •

edited

Loading