Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

n3fit - Refactor of hyperopt #516

Merged
merged 41 commits into from
Apr 1, 2020
Merged

n3fit - Refactor of hyperopt #516

merged 41 commits into from
Apr 1, 2020

Conversation

scarlehoff
Copy link
Member

@scarlehoff scarlehoff commented Jul 23, 2019

It is basically a refactoring of the hyperopt code so it includes more documentation and it is easier to follow.

How to

The first step is to prepare a runcard with a hyperscan dictionary in it. As an example you can see the one in the runcards folder of n3fit: Basic_runcard.yml.

In its current form the hyperoptimization is activated by the flag --hyperopt at run time. It accepts as an argument a number, which is basically the number of trials to run. So, once you have your runcard prepare all you have to do is:

n3fit my_hyper_runcard.yml [replica_number] --hyperopt [number of trials]
Each trial consists on a different configuration of the parameter, following the scan defined in hyperscan.

Once the hyperopt scan has finished (it will take slightly less than time_of_a_single_fit * nuumber_of_trials) you will find a file in the nnfit directory:

my_hyper_runcard/nnfit/replica_1/trials.json

With this you can use the n3Hyperplot script to generate the plots:

n3Hyperplot my_hyper_runcard

Will generate a plot per replica. If you want to join all replicas as if it were one single longer run you can do

n3Hyperplot my_hyper_runcard --combine

If you want to play a bit with the results, maybe remove anything that uses Adadelta as the optimizer and has less than 500 epochs:

n3Hyperplot my_hyper_runcard --combine --filter "optimizer=Adadelta" "epochs>500"

The filter option will accept the operators !=, =, <, >

Note: I generally do the hyperscan with the replicas turned off, but send several runs with a different replica number, this way I get a few .json files with statistically different runs (all the seeding in n3fit
depends on the replica number)

@scarlehoff
Copy link
Member Author

Right now the part regarding the generation of the trials should be (is) complete. The part that is missing is to make the script that generates the plot and plays along with the trails a bit better (right now is a functional mess).

@wilsonmr in order to implement new things to hyperopt, this PR (at this point, plus changes regarding the code structure but not functionality) should be the way to go.

As I say one of the TODO, eventually it might be good to do the migration to Report Engine of both this and the reading of the runcard of n3fit, but until that happens I think this is how the hyperopt implementation it will look for a while...

@wilsonmr
Copy link
Contributor

I really need to get out the habit of writing such long comments

ok this is also relevant for #514

at the moment it seems to me that the fit function has several stages, which could possibly be seperated into seperate providers:

  • initialising seeds
  • loading and instancing of data <- relevant to to closure part
  • hyper optimizer
  • fit:
    1. run fit
    2. store fit

I think it would be good if things were split up a bit like this

def initialised_seeds(
    replica, fitting["trvlseed", "nnseed", "mcseed", "genrep"]
): -> seeds and gen_rep bool

def loaded_data(experiments, t0set): -> all_exp_infos

def loaded_positivity_data(posdatasets): -> pos_info

#might need to make this for single replica
def performfit(
    initialised_seeds,
    loaded_data,
    loaded_positivity_data,
    replica, replica_path,
    fitting["basis"],
    fitting["load", "loadfile"],
    fitting["parameters"],) -> result or results

#and this for single replica
def storefit(performfit): -> stored fit or fits # and compute arc legnths etc..

#and then do this
storefits = collect(storefit, replicas)

hyperopt_function(....): ...

It makes it clear that the experiments which is parse from runcard is clearly used in just one small section of the code, and so if I want to make some change -> like adding the possibility to use pseudodata instead then it's way easier to know where to do this and see exactly what it impacts -> in the case of closure data I know that regardless of using real or closure data, loaded data must return loaded data/pseudodata

I was also wondering if fitting is a bit redundant and just there for historical reasons -> perhaps these keys can just go in the top level of the runcard, I don't see any reason why the nnfit runcard should influence the n3fit runcard but perhaps I'm missing something

Right now this isn't a major concern but the only thing with splitting this up would be working out what to do about the backend imports within fit - whether to duplicate the imports in hyperopt/fit functions or perhaps have another provider, which somehow wrapped the imports within a class, schematically:

class backendimports:
    def __init__(self, *args):
        some logic based on args:
            import whatever
            self.ModelTrainer = whatever

def import_provider(backend):
    return backendimport(backend)

def fit(import_provider, *args):

AFAICT though there really is no point having these imports in the function anyway - unless you have another backend in development? In practise I wonder if at the level of fit.py you could just put the imports at the top and leave a comment saying if and when another backend appears this needs to be changed, just a thought..

@scarlehoff
Copy link
Member Author

scarlehoff commented Jul 24, 2019

def initialised_seeds(
    replica, fitting["trvlseed", "nnseed", "mcseed", "genrep"]
): -> seeds and gen_rep bool

agree with this, although I don't think genrep needs to exist here.

def loaded_data(experiments, t0set): -> all_exp_infos

def loaded_positivity_data(posdatasets): -> pos_info

Also with this. The only thing to remember is that the mcessed and trvalseed need to enter this provider.

#and this for single replica
def storefit(performfit): -> stored fit or fits # and compute arc legnths etc..

well, I would say storefits should get the results from performfit and not return anything.
But in this part I would prefer to have a clear image in mind on how to do parallel replicas efficiently (and I still don't have it) before really saying anything.

hyperopt_function(....): ...

Here I dissagree. hyperopt_fuction should be part of performfit as a flag that will force the fit to run many times.

I was also wondering if fitting is a bit redundant and just there for historical reasons -> perhaps these keys can just go in the top level of the runcard, I don't see any reason why the nnfit runcard should influence the n3fit runcard but perhaps I'm missing something

100% historial reasons.
Or, more than historical reasons, because we wanted the runcard we were using with n3fit to be able to run with nnfit as well. But there is no reason to keep it outside that "debugging" stage to be fair.

Right now this isn't a major concern but the only thing with splitting this up would be working out what to do about the backend imports within fit - whether to duplicate the imports in hyperopt/fit functions or perhaps have another provider, which somehow wrapped the imports within a class, schematically:

The imports in fit.py are only two:

  • set_initial_state: only used for debug. It can even go up to n3fit.py
  • MetaModel: used to generate the PDF. This will only be used by storefit and it is not even necessary. Performfit can return already a model formed.

Now, all of this I would say it should be made into an issue because I believe it is relevant and should be done (and sooner rather than later even if it is not a top priority, but it would help with the modularization of the code) but it should go in a separate PR. All of this should be 100% independent of the hyperoptimization procedure.

@wilsonmr
Copy link
Contributor

well, I would say storefits should get the results from performfit and not return anything.

yeah sorry I didn't mean for it to return anything I just was saying what the function was doing

hyperopt_function(....): ...

Here I dissagree. hyperopt_fuction should be part of performfit as a flag that will force the fit to run many times.

But the point is in n3fit at the moment you add actions_: peformfit to the runcard behind the scenes, now if hyperopt and performfit were two different actions as I am suggesting here, then the runcard would explicitly either be actions_: performfit or actions_: hyperopt and so the flag in the runcard would instead just turn into which action to run. I think your own comment even supports this

https://github.com/NNPDF/nnpdf/blob/e68c1bedd3285091a7462cf0fd92b2327ba32baa/n3fit/src/n3fit/fit.py#L224

everything inside the if hyperopt basically defines what I'd put in the hyperopt function. If we split up the other things then all that would be left in the fit function if we kept the hyperopt in there would be

def fit(..):
    if hyperopt:
        # do something which scans parameters
    else:
        # produce a replica or set of replicas and save them

which I would say goes against the point of actions.

Now, all of this I would say it should be made into an issue because I believe it is relevant and should be done (and sooner rather than later even if it is not a top priority, but it would help with the modularization of the code) but it should go in a separate PR. All of this should be 100% independent of the hyperoptimization procedure.

Yeah I will try to summarise in an issue - the only thing I would say is my view of how hyperoptimization should be a seperate action could happen here

@scarlehoff
Copy link
Member Author

scarlehoff commented Jul 24, 2019

everything inside the if hyperopt basically defines what I'd put in the hyperopt function. If we split up the other things then all that would be left in the fit function if we kept the hyperopt in there would be

def fit(..):
if hyperopt:
# do something which scans parameters
else:
# produce a replica or set of replicas and save them

which I would say goes against the point of actions.

You can have a performfit action which takes as input a ModelTrainer instance and a hyperopt function that takes also an input a ModelTrainer instance. At that point the difference between performfit and hyperopt is simply whether storefit is called at the end or not.

If you want to break out hyperopt from performfit it should be done before, i.e., hyperopt provides a parameters dictionary, ModelTrainer is provided by somebody else and then performfit is called by hyperopt's fitting function receiving as input ModelTrainer and parameters. *

Yeah I will try to summarise in an issue - the only thing I would say is my view of how hyperoptimization should be a separate action could happen here

I would rather have a second PR with "make hyperoptimization into an action" because there are several things that should be carefully thought about in order to not lose generality and because that should not affect the functionality.

*Edit: it is more complicated than this if you want to avoid rerunning things as much as possible. As the thing with parallel replicas I would need to actually sit down for a while in order to have a clear idea of what would make myself happy.

@wilsonmr
Copy link
Contributor

wilsonmr commented Aug 1, 2019

Hmm I think it wouldn't be so difficult to add a vaidphys argcheck that makes sure that if hyperopt is not None then hyperscan shouldn't be either.

Approximately (the error raise here looks weird when I tested it, but at least stops pretty early with a nicer error message rather than getting all the way to fit.py line 193 and then hyper_scan.py L126 and finding that it has a NoneType):

from reportengine.checks import make_argcheck, CheckError

@make_argcheck
def check_consistent_hyperscan_options(hyperopt, hyperscan):
    if hyperopt is not None and hyperscan is None:
        raise CheckError("hyperscan needs to be defined if performing hyperopt")

@check_consistent_hyperscan_options
def fit( ....)

Just means that if the runcard doesn't have hyperscan and it should then it'll raise an error much sooner - quality of life

@wilsonmr
Copy link
Contributor

wilsonmr commented Aug 1, 2019

Now having said that I think you could add to the ConfigParser of n3fit something like:

def parse_hyperscan(self, hyperscan_dict: (dict, type(None)), *, hyperopt=None):
    if hyperopt is None:
        return None
    else:
        if hyperscan is None:
            raise ConfigError(..)
        #here check for compulsory entries in the dictionary (are there any?)

I think that hyper_scan.py L126-140 could easily go in a parser basically and if it was then you'd catch issues earlier or trivial things like if the dictionary doesn't exist and will let user know it's an issue with their runcard

@wilsonmr
Copy link
Contributor

wilsonmr commented Aug 1, 2019

Should we add seaborn and hyperopt the the conda package meta with this PR so it can be used 'out of the box'?

dictionaries = filter(filter_function, dictionaries)

# Now fill a pandas dataframe with the survivors
dataframe_raw = pd.DataFrame(dictionaries)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the version of pandas I tested this on, dictionaries cannot be an iterable but needs to be explicitly cast:
pd.DataFrame(list(dictionaries))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the output of filter should always be an iterator, which is an iterable... which version of pandas were you using?
Maybe the error was coming from somewhere else and that broke the filter?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am getting something like:


In [13]: pd.DataFrame(iter(l))                                                                                                                                                                                                                
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-13-b81493853882> in <module>
----> 1 pd.DataFrame(iter(l))

~/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
    405                 mgr = self._init_dict({}, index, columns, dtype=dtype)
    406         elif isinstance(data, collections.Iterator):
--> 407             raise TypeError("data argument can't be an iterator")
    408         else:
    409             try:

TypeError: data argument can't be an iterator

In [14]: pd.__version__                                                                                                                                                                                                                       
Out[14]: '0.23.4'

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. They fixed that for 0.24

pandas-dev/pandas#21987

But it doesn't hurt to cast the filter to a list. I just wanted to make sure the error was not coming from something else.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pandas is in my bad books for breaking interface way too often with things like this. But yeah, I see that it works again in 0.25.

@scarlehoff
Copy link
Member Author

Should we add seaborn and hyperopt the the conda package meta with this PR so it can be used 'out of the box'?

It was already added in the previous PR. They both should be installed with the nnpdf package.

Hmm I think it wouldn't be so difficult to add a vaidphys argcheck that makes sure that if hyperopt is not None then hyperscan shouldn't be either.

Approximately (the error raise here looks weird when I tested it, but at least stops pretty early with a nicer error message rather than getting all the way to fit.py line 193 and then hyper_scan.py L126 and finding that it has a NoneType):

from reportengine.checks import make_argcheck, CheckError

@make_argcheck
def check_consistent_hyperscan_options(hyperopt, hyperscan):
    if hyperopt is not None and hyperscan is None:
        raise CheckError("hyperscan needs to be defined if performing hyperopt")

@check_consistent_hyperscan_options
def fit( ....)

I've added this error. I think it looks ok.


# Now filter out the ones we don't want
for filter_function in filter_functions:
dictionaries = filter(filter_function, dictionaries)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This maybe works, but it is potentially a headache (I had to run a couple of tests to make sure I understood what it's actually doing). Note that filter is a lazy iterator that doesn't consume the argument, so it = filter(func, it) saves a reference to the old iterable inside the return value of filter and reassigns it to be the filtered iterator. This is a bit inefficient in that you have to save all the intermediate iterators inside the recursive filter, but more importantly also more subtle than it looks like in that if you change something so that next(dictionaries) it will cause hard to understand bugs. Instead, you should do something like [item for d in dictionaries if all(f(item) for f in filter_functions)] (filter with the corresponding lambda would be fine, but I bet that slower).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it looked so nice :(

But I see the problem.

Copy link
Member Author

@scarlehoff scarlehoff Aug 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Casting it to a list -per iteration- solves the problem? (i.e., like it is in the latest version)
Or would you rather have something along the lines of:

for filter in filters:
    new_dicts = []
    for i in dictionaries:
        if filter(i): new_dict.append(i)
     dictionaries = new_dicts

?

(or the line you wrote, I am a bit dense today, I think I lost the ability to read past the 6th line of any comments)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the line I wrote better, because it is very clear that it requires all filters. Instead I was going to write that what you did is wrong because it looked like any at first sight (but then I realized you reassign dictionaries). It is a lot of intermediate lists again though.

@scarlehoff
Copy link
Member Author

scarlehoff commented Aug 16, 2019

These last few commit (adding the --autofilter option) deal with the last* of the to-dos I wrote for this PR. Motivated by #527. This trimming algorithm is mostly empirical (I wanted something that would automatically do the same job as I would do by looking at the hyperopt plot and selecting which options to filter). I called it --autofilter.

I'm happy with it in the sense that it does what I want. I am unhappy in the sense that parameters are set 100% manually (that's why I was planning to not port it to this repository, but it was actually useful when choosing the best models so it can stay...)

*last in numerical order

@Zaharid
Copy link
Contributor

Zaharid commented Aug 17, 2019

Overall I think that this indeed needs to be a set of validphys actions (so it's good that we all agree). This is currently inventing an incompatible pipeline that would preclude us from doing fancy things (e.g. plots that correlate the hyperparameters with various estimators computed by vp such as the closure test ones), or even standard things like getting these plots into the comparefits reports. That said this code looks in a form that it shouldn't be too complicated to do just that.

Then we could have a small wrapper like comparefits or setupfit that just calls validphys under the hood with various command line options.

@Zaharid
Copy link
Contributor

Zaharid commented Aug 17, 2019

Also a lot of the processing code feels very similar to fitdata.py and looks like it could go there.

"loss": "loss",
}

NODES = (keywords["nodes"], 4)
# 0 = normal scatter plot, 1 = violin, 2 = log
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be 'normal', 'scatter', 'violin' without the extra indirection? Or even an enum.Enum?


operator = regex_op.findall(filter_string)[0]

if operator == "=":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really require this? Can we not have only ==?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't know, to me it is clear when the user write "optimizer=adam" they will want to check for equality so no need to make their life harder.

raise NotImplementedError("Filter string not valid, operator not recognized {0}".format(filter_string))

# This I know it is not ok:
if isinstance(val_check, str) and isinstance(filter_val, str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could use sympy.parse_expr for these things? @siranipour has been meaning to use it for the filters for a while...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at simpy's sympify but I saw it was using eval under the hood anyway so didn't look further.
I also saw pyparsing as an option, having our own expression-parser module might be useful.

"""
Receives a data_dict (a parsed trial) and a filter string, returns True if the trial passes the filter

filter string must have the format: key[]string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this would be clearer if it said:

filter string must have the format: key <operator> string

and also the regex knew how to discard whitespace (i.e. adding \w* to the patterns).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A space means an extra argument in the command line and this expressions are passed from the command line.

@@ -0,0 +1,332 @@
"""
This module contains functions dedicated to process the json dictionaries
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kind of algorithmic code probably could use a few tests. Of course we might minimize the amount of it by using pandas wherever possible.

"""
# If we don't have enough keys to produce n combinations, return empty
if len(key_info) < ncomb:
return []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be an error instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No (or not in its current form) since we might want to loop over number of combinations. If we only have 2 keys but ask for more than 2 to combine, it simply does nothing and that's fine.

Eventually, when there is a robust form of this algorithm we might want to be restrictive with what it does, of course.

@Zaharid
Copy link
Contributor

Zaharid commented Aug 17, 2019 via email

@Zaharid
Copy link
Contributor

Zaharid commented Aug 17, 2019 via email

@scarlehoff
Copy link
Member Author

Overall I think that this indeed needs to be a set of validphys actions (so it's good that we all agree). This is currently inventing an incompatible pipeline that would preclude us from doing fancy things (e.g. plots that correlate the hyperparameters with various estimators computed by vp such as the closure test ones), or even standard things like getting these plots into the comparefits reports. That said this code looks in a form that it shouldn't be too complicated to do just that.

Then we could have a small wrapper like comparefits or setupfit that just calls validphys under the hood with various command line options.

Yeah, we can have a go at this in Milan in two weeks.

Not sympyfy, parse_expr. That does what it should.

Also uses eval. But I have nothing against sympy, I just didn't spend more time with it because my goal was just to see whether I could avoid eval. If we go the simpy route we can then change whatever needs to be changed.

@Zaharid
Copy link
Contributor

Zaharid commented Aug 17, 2019 via email

@scarlehoff
Copy link
Member Author

Ok, so there are two things missing from this PR that I believe should be done before merging:

  • Justify a bit better the algorithm, maybe have a full discussion in Varenna about it
  • Instead of just plotting things, create a validphys hyperopt-report kind of thing

For the second point I would welcome a bit of help maybe in the meeting code here in Milano in two weeks, because I have very little exp with plotting things with validphys and I am sure building a few plots it's just a five minutes job for some of you.

Besides that, the hyperoptimization should be a different action as per #519. But I think that would be outside the scope of this PR.

@Zaharid
Copy link
Contributor

Zaharid commented Aug 17, 2019

I had a look at the hyperopt library itself and I have to say I am not a huge fan. Not of the documentation, not of the way the code looks, not of the deeply nested dictionaries, not of the dependency on mongodb, not of the amount of open issues on github. In the end it seems to me that we are getting a fancy for loop and a bunch of things we have to work around or that we could do better ourselves.

With that in mind, maybe it would make sense to try to isolate it as much as possible in one module, say hyperoptio and make it so it is easy to switch to a different library, e.g. the keras thing when it becomes a bit more tested.

@ldd69
Copy link
Contributor

ldd69 commented Aug 18, 2019

Do we have a document with the details of the hyper-opt algorithm that is currently implemented in the code?

@scarlehoff
Copy link
Member Author

scarlehoff commented Aug 18, 2019

Do we have a document with the details of the hyper-opt algorithm that is currently implemented in the code?

@ldd69
The hyperoptimization algorithm implemented in hyperopt is described here https://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization

edit: to first approximation is just a random search of thousands of combinations.

Then there is the code for removing the unstable configurations. This code is WIP (I called it hyper_algorithm.py) and even what a "unstable configuration" is needs discussion before it becomes a true algorithm. As I said, right know it basically does what I do by eye only automatically.

With that in mind, maybe it would make sense to try to isolate it as much as possible in one module, say hyperoptio and make it so it is easy to switch to a different library, e.g. the keras thing when it becomes a bit more tested.

@Zaharid I agree with your feelings about hyperopt. This is the reasoning behind hyper_scan.py, most of the code is not hyperopt dependent and just changing a few lines would render it compatible with other libraries.

@ldd69
Copy link
Contributor

ldd69 commented Aug 18, 2019

Excellent, thank you! I'll have a look. I'm trying to summarise my thinking in a set of notes.

@scarlehoff
Copy link
Member Author

Excellent, thank you! I'll have a look. I'm trying to summarise my thinking in a set of notes.

I'll try to summarize how the code does the whole hyperoptimization thing (from a more pragmatic point of view) for Varenna in my presentation. I'll link it here as well once it's done.

@scarlehoff scarlehoff force-pushed the n3fit-refactor-hyperopt branch from a448fd2 to dc851c5 Compare April 1, 2020 13:52
@scarlehoff scarlehoff merged commit b8aabcd into master Apr 1, 2020
@scarrazza scarrazza deleted the n3fit-refactor-hyperopt branch April 22, 2020 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
n3fit Issues and PRs related to n3fit
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants