Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor stopping #1792

Merged
merged 32 commits into from
Oct 11, 2023
Merged

Refactor stopping #1792

merged 32 commits into from
Oct 11, 2023

Conversation

APJansen
Copy link
Collaborator

@APJansen APJansen commented Aug 15, 2023

This simplifies the stopping.py module, reducing the number of classes and the coupling to the outside. It also rewrites the WriterWrapper.

I started looking at this because #1782 requires changes here, but the changes in this PR are purely a refactor that should do exactly the same.

Tests

I test this by running on this branch, renaming the output folder, running on master, and comparing the two. We only expect changes in the timing and versions.
I've tested with:

  • n3fit Basic_runcard_modified.yml 1 -r 5
  • n3fit Basic_runcard_modified.yml 4 -r 5
  • n3fit Basic_runcard_modified.yml 3

In all cases the only differences are the versions and the timings, the rest is identical.
Note I've modified the basic runcard to speed it up, increasing the learning rate so it stops quickly. I think it is still general enough for these tests, in the case with 5 replicas, the best epochs are 77, 72, 77, 72, 75, so not a degenerate case where coincidentally they're all equal or something.

Comments

  • WriterWrapper is a class with one method, that gets initialized and immediately after the method gets called, and that's all that happens with it. So it could just as well have been a function. I left it as a class though because it was easier and I'm not sure if everyone will agree.
  • Similarly in the stopping module, there were 4 classes and it was quite hard to figure out which calls which etc. I removed one, ReplicaState, and moved a lot of the functionality from FitHistory to the main Stopping class. Now FitHistory is essentially nothing more than a list of the 4th class FitState, so perhaps I should remove that entirely, keeping only Stopping and FitState.
  • Some of the timings are stored for each replica, but if multiple replicas are run these will be the overal timings. However when you run them one at a time it will be different of course. Also there is nothing better to report on a per-replica basis if run in parallel, so I don't think there is anything better to do and I just added a comment on it.
  • In WriterWrapper I needed both the pdf_model itself and the pdf_instance which is the model wrapped by N3PDF. I got the former from the latter by accessing its attribute, but that is supposed to be private so need to clean that up.

@APJansen APJansen added Refactoring n3fit Issues and PRs related to n3fit labels Aug 15, 2023
@scarlehoff
Copy link
Member

Hi @APJansen, some comments about the stopping refactoring.

First of all, as you say, the stopping module is unnecessarily complicated. There were some plans for it and it was used for some side projects but as thing stands might as well be completely changed.
So feel free to rewrite it completely if you wish so. This might be important for running the replicas in parallel in a GPU so don't constrain yourself to having all the different histories and what's not.
(let me cc @RoyStegeman here, since the last thing that might have used this module are the script to generate animations)

The only important thing is that the final results are exactly identical, and by that I mean the files written at the end by the fit are the same (so the .json, the .exportgrid etc). Since the stopping deals only with how/when the fit stops it should not have any effect on numerics so the best test (are you are doing) is just to check that running the same runcard with master and this branch are identical.

In this context there are a few situations that should be tested:

  • The normal fit
  • A fit where all the trvl fraction are set to 1.0 (in this special case chi2_tr = chi2_vl, so the fit will output the model at the last point in which positivity passes)
  • A fit where the stopping patience is equal to 1.0 (in this case the fit will output the model at the best overall value of chi2_vl)
  • A fit without positivity

If you don't have runcards for these let me know and I can prepare them for you

Just run these tests with n3fit <runcard> <replica_number>, don't bother with the multireplica in this case. If it doesn't break even better ofc.

(RE the writter module, same, as long as the output is the same you can do what you want with it. Btw, the reason the module feels a bit out of place is this line here which incidentally should be removed because we dropped compatibility some time ago and all that logic is now removed!)

@APJansen
Copy link
Collaborator Author

Hi @scarlehoff, great, thanks for the extra info! The chi2's are saved as well though in chi2exps.log, so the history needs to stay in some form, but I don't think it's a problem.
I think in principle it could be written entirely in terms of Keras's early stopping callback, was there a reason you didn't use this?
For now though I'll just review what I've done and maybe simplify it some more if possible, I think it's good enough and want to get on with the actual parallel replica stuff.

About the runcards to test, can I just take the basic runcard as the normal fit, and modify it in those 3 ways separately for the others? The first modification being frac: 1.0 in the datasets_inputs, the second stopping_patience: 1.0, and the last I'm not sure, just removing the two entries under posdatasets?

@RoyStegeman
Copy link
Member

RoyStegeman commented Aug 30, 2023

I think in principle it could be written entirely in terms of Keras's early stopping callback, was there a reason you didn't use this?

That's a good question. Would this also work with the boolean positivity requirement?

The first modification being frac: 1.0 in the datasets_inputs, the second stopping_patience: 1.0

Indeed

and the last I'm not sure, just removing the two entries under posdatasets

If I'm not mistaken the posdatasets key cannot be empty since it's collect`ed over, so instead of removing it, you can probably put an empty list.

Since I'm somewhat out of the loop, what is the status of the different PRs?

@APJansen
Copy link
Collaborator Author

@RoyStegeman An empty list being just a -?

I'm just writing an issue describing recent developments and plans. I think indeed #1661 can be closed, #1788 is WIP and #1782 is even more WIP. This one I'll have another look today since it's been a while but yes it's done or nearly done, apart from some further testing.
We'll try to have a small hyperopt example before the meeting at the end of september, probably on the current version of trvl-mask-layers, but no guarantees. The issue I'm writing is about a significant speedup, but that won't be fully integrated before that.

@RoyStegeman
Copy link
Member

RoyStegeman commented Aug 30, 2023

@RoyStegeman An empty list being just a -?

No a [], i.e. posdatasets: []. - would I believe return [None], which collect will still try to loop over and I don't expect all functions to deal with that successfully (though perhaps they will).

Regarding the rest - thanks, it would indeed be helpful to have an issue as a single place to keep track of the planning and various PRs related to this project.

@APJansen
Copy link
Collaborator Author

Ok thanks, I've got the 4 versions running on master, but this PR needs some work that I won't get to this week, I'll hopefully finish this next week and mark it as ready for review.

@RoyStegeman
Copy link
Member

Thanks for creating the issue, it's very clear.

Sure, you can do everything at your own pace, it's your project :). I'll just wait till it's ready to review.

@APJansen APJansen marked this pull request as ready for review September 4, 2023 13:32
@github-actions
Copy link

Greetings from your nice fit 🤖 !
I have good news for you, I just finished my tasks:

Check the report carefully, and please buy me a ☕ , or better, a GPU 😉!

@scarlehoff
Copy link
Member

I think in principle it could be written entirely in terms of Keras's early stopping callback, was there a reason you didn't use this?
That's a good question. Would this also work with the boolean positivity requirement?

There's also the secondary problem of using frameworks other than tensorflow for the PDFs (for instance, the quantum PDFs using qibo).

I'd avoid adding the tf dependency where is not explicitly needed. Unless there's a clear advantage in terms of performance (at the end of the day, 99% of the fits and papers are done with tensorflow), in that case it's probably fine to add some inconveniences on the side projects.

@APJansen
Copy link
Collaborator Author

Sorry I forgot to add reviewers, is this ok to merge?

@Radonirinaunimi
Copy link
Member

Sorry I forgot to add reviewers, is this ok to merge?

I have started having a look at this but still needs a couple of hours/a day.

@scarlehoff
Copy link
Member

scarlehoff commented Sep 25, 2023

I had a quick look and seems ok. I'll have to look in detail next week.

Did you test some edge cases like no patience / no positivity / infinite patience ?

Also, since you reshuffled a lot the stopping module, please remove any "dangling methods" or properties (maybe you did already ofc)

Copy link
Member

@Radonirinaunimi Radonirinaunimi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did check the corner cases and in all instances the fits ran fine. I have requested very minor modifications which I think are worth addressing.

n3fit/src/n3fit/io/writer.py Outdated Show resolved Hide resolved
n3fit/src/n3fit/io/writer.py Outdated Show resolved Hide resolved
n3fit/src/n3fit/io/writer.py Outdated Show resolved Hide resolved
n3fit/src/n3fit/stopping.py Outdated Show resolved Hide resolved
@APJansen
Copy link
Collaborator Author

APJansen commented Oct 4, 2023

@Radonirinaunimi I've addressed all your comments, and reran the comparison. @scarlehoff the tests I did are as we discussed, I've documented them here. They all pass also after the last changes.

Copy link
Member

@scarlehoff scarlehoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this. I've left a few comments but they don't change anything fundamental.

n3fit/src/n3fit/io/writer.py Outdated Show resolved Hide resolved
n3fit/src/n3fit/io/writer.py Outdated Show resolved Hide resolved
n3fit/src/n3fit/io/writer.py Outdated Show resolved Hide resolved
n3fit/src/n3fit/io/writer.py Outdated Show resolved Hide resolved
n3fit/src/n3fit/io/writer.py Outdated Show resolved Hide resolved
n3fit/src/n3fit/io/writer.py Outdated Show resolved Hide resolved
n3fit/src/n3fit/stopping.py Outdated Show resolved Hide resolved
n3fit/src/n3fit/stopping.py Outdated Show resolved Hide resolved
n3fit/src/n3fit/io/writer.py Outdated Show resolved Hide resolved
APJansen and others added 6 commits October 10, 2023 13:34
Co-authored-by: Juan M. Cruz-Martinez <juacrumar@lairen.eu>
Co-authored-by: Juan M. Cruz-Martinez <juacrumar@lairen.eu>
Co-authored-by: Juan M. Cruz-Martinez <juacrumar@lairen.eu>
@APJansen
Copy link
Collaborator Author

Thanks @scarlehoff, I implemented your comments, and checked that results are still identical. Is it ok to merge now?

@scarlehoff scarlehoff merged commit db8b790 into master Oct 11, 2023
4 checks passed
@scarlehoff scarlehoff deleted the refactor_stopping branch October 11, 2023 13:04
This was referenced Oct 16, 2023
@APJansen APJansen mentioned this pull request Dec 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
n3fit Issues and PRs related to n3fit Refactoring run-fit-bot Starts fit bot from a PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants