-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hyperopt loss #1726
Hyperopt loss #1726
Conversation
Description of what happens with hyperoptimizationFor the benefit of my future self and my colleagues @goord and @Cmurilochem I describe here exactly what happens during hyperoptimization. It mainly describes how it works in master, so far this branch only changes it on two points, as indicated. Single TrialA "trial" is the entire evaluation of one set of hyperparameters. One trial in the code is one execution of the function FoldsTrials are scored by their For each fold, the datasets in that fold are left out completely, and the model is trained as normal on the remaining k-1 folds. The model is trained on the k-1 folds (masked with the training mask). Early exit 1 Then the Early exit 2 LossWe get Managing trialsResults are saved in The trials start from the function Inside possible issueon a test with 2 trials varying only the nodes per layer, it tested (27,12,8) and then (30,36,8), but the best parameters printed were the default (25,20,8). I'm not sure what exactly happens in this final call to Some thoughts on improvementsParallelizing trials, checkpointingTrials can be run in parallel using MongoDB, as described here. K-foldingWhen using K-folding, we now have 3 classes of data at every fold:
Both 2 and 3 are not seen during training. The difference in how they are used is that 2. is used for the "early exit 1" above, while 3 is used to compute the hyper loss. If that was all, I'd say 2 is wasted as this condition is very unlikely to trigger even with one replica. We also discussed a long time ago that perhaps K-folding won't be necessary anymore. The point of it is to have a more accurate validation loss, but if we run with say 100 replicas, it is already more accurate. We would still need 3 groups of data, to avoid the bias of joining 2 and 3, however they could be split randomly per dataset, (per replica). The advantage would be that it'd require only a single run per trial (slightly bigger, but still nearly 4x speedup), and simplify the code. Not sure though what the effect on the learning would be. Code improvements
|
Implementation of a new metricsAs a continuation of the work by @APJansen, I intent to add a new hyperoptimization metrics to the where the first term represents our usual averaged-over-replicas hyper loss, Refactoring
|
We talked again about refactoring this, and we decided to keep the changes minimal for now, just to allow implementation of the phi2 loss. We can refactor the code about hyperoptimization inside model_trainer later, once we have things running. We made some assumptions listed below, are these correct @scarlehoff? AssumptionsPreviously So now we assume instead that first the losses are computed per fold, so already having some kind of aggregation of the replicas. This is done in the training loop, as it is quit early if it is too high. And then there are also penalties, which currently are all computed per replica. We assumed for now that this is general enough, and that we always want to take an average over the per-replica penalties, and add this to the loss. |
Ah, since there were no questions I thought that the example I put there was enough. Not sure what to add (i.e., not sure what the showstoppers are?)
The thing is that, while Regarding the assumptions:
No. But at the end of the fold (when you have an ensemble of model) you can easily create another model which is the average of all previous models (ideally discarding outliers) over the axis of the replicas. With that you can compute the phi.
Yes. Let's not care about the penalties for the time being. I don't think they are needed. |
Start implementation of
|
res = results(groupdataset, n3pdfs, covmat, sqrt_covmat(covmat)) |
the results
function will output a truple composed by:
- an instance of DataResult which holds the relevant information for a given group exp data set
- an instance of ThPredictionsResult which holds the corresponding theory predictions from the multi-replica PDF. The attribute
ThPredictionsResult.rawdata
stores such theory predictions for each replica.
2. abs_chi2_data function
This function will use the the output of results
(the above tuple) to calculate:
all_chi2
: this corresponds to the usual chi2 calculated by the sum of the squared differences between the predictedThPredictionsResult.rawdata
and central experimental observables for each replica and group exp dataset.all_chi2
is then used to instantiateThPredictionsResult.stats_class
which in our case is always N3Stats ; see also here.central_chi2
: Differently fromall_chi2
, this corresponds to the chi2 calculated by the sum of squared differences betweenThPredictionsResult.central_value
, i.e., average of the predictions of each replica PDF, and the central experimental observables for each group exp dataset.
These calculated quantities will be returned, together with the number of exp data points, in a form of a namedtuple
named Chi2Data
.
3. phi_data function
Finally, phi_data
just uses the output data of abs_chi2_data
. It calculates
- averaging out
all_chi2
over replicas usingN3Stats.error_members().mean()
- subtract
<all_chi2>
fromcentral_chi2
, dividing the result by the number of exp data points within the group exp dataset. - Take the sqrt of [(
<all_chi2>
-central_chi2
)/number of exp data points]
Question regarding added tests@scarlehoff, I added new tests into test_hyperopt.py. To test the calculation of |
Make sure the seeds are also fixed for numpy. Note that when the fit is set to
however while the tests should be robust in linux (they empirically seem to be) I'm not so sure about mac m1 (i.e., if you are locally running on a mac and then trying to test in the ci... not sure what will happen) |
Thanks @scarlehoff. I have added a custom |
a94f6a2
to
edd5e4e
Compare
8439ff9
to
c499725
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add documentation for the different losses/statistics that can be optimised and how the runcard can be changed to use them?
09f929d
to
f88fbb5
Compare
Done. Just added docs in f88fbb5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, the docs look good
Co-authored-by: Roy Stegeman <roystegeman@live.nl>
Co-authored-by: Roy Stegeman <roystegeman@live.nl>
Co-authored-by: Roy Stegeman <roystegeman@live.nl>
Co-authored-by: Roy Stegeman <roystegeman@live.nl>
Co-authored-by: Roy Stegeman <roystegeman@live.nl>
9cee50c
to
2050911
Compare
Co-authored-by: Roy Stegeman <roystegeman@live.nl>
Co-authored-by: Roy Stegeman <roystegeman@live.nl>
Improving hyperoptimization, experimenting with different hyperoptimization loss functions
Tasks done in this PR
HyperLoss
class with buit-in methods that can automatically perform statistics over replicas and then folds. The user can select statistics viareplica_statistic
andfold_statistic
in runcardkfold
.loss_type
option in runcardkfold
.Hyperopt
(tries.json
(specifically withinkfold_meta
entry) a matrix (folds x replicas) of calculatedhyper_losses_chi2
and a vector (folds) ofhyper_losses_phi2
.Description
The implemented
HyperLoss
is instantiated withinModelTrainer
and later on used inModelTrainer.hyperparametrizable
. The user must pass three paramaters that are set in the runcard:loss_type
: The type of loss to be used. Options arechi2
orphi2
.replica_statistic
: The statistics over replicas to be used within each fold. Forloss_type = chi2
, it can assume the usual statistics:average
,best_worst
andstd
. Note:replica_statistic
is inactive ifloss_type = phi2
asfold_statistic
: The statistics over folds. Options are:average
,best_worst
andstd
.loss_type: phi2
andfold_statistic: average
, the figure of merit to be minimised is actually:The current implementation of$\varphi^2_{k}$ is based on
validphys
functions. It is evaluated using only experimental data within the hold out fold (as expected).Runcard examples
hyper_loss
is set as thehyper_loss
as the inverse of the max value ofNotes
It must be merged after #1788 as the current
hyperopt_loss
branch has been created fromtrvl-mask-layers
.