-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
n3fit - Refactor of hyperopt #516
Conversation
Right now the part regarding the generation of the trials should be (is) complete. The part that is missing is to make the script that generates the plot and plays along with the trails a bit better (right now is a functional mess). @wilsonmr in order to implement new things to hyperopt, this PR (at this point, plus changes regarding the code structure but not functionality) should be the way to go. As I say one of the TODO, eventually it might be good to do the migration to Report Engine of both this and the reading of the runcard of n3fit, but until that happens I think this is how the hyperopt implementation it will look for a while... |
I really need to get out the habit of writing such long comments ok this is also relevant for #514 at the moment it seems to me that the
I think it would be good if things were split up a bit like this def initialised_seeds(
replica, fitting["trvlseed", "nnseed", "mcseed", "genrep"]
): -> seeds and gen_rep bool
def loaded_data(experiments, t0set): -> all_exp_infos
def loaded_positivity_data(posdatasets): -> pos_info
#might need to make this for single replica
def performfit(
initialised_seeds,
loaded_data,
loaded_positivity_data,
replica, replica_path,
fitting["basis"],
fitting["load", "loadfile"],
fitting["parameters"],) -> result or results
#and this for single replica
def storefit(performfit): -> stored fit or fits # and compute arc legnths etc..
#and then do this
storefits = collect(storefit, replicas)
hyperopt_function(....): ... It makes it clear that the experiments which is parse from runcard is clearly used in just one small section of the code, and so if I want to make some change -> like adding the possibility to use pseudodata instead then it's way easier to know where to do this and see exactly what it impacts -> in the case of closure data I know that regardless of using real or closure data, loaded data must return loaded data/pseudodata I was also wondering if Right now this isn't a major concern but the only thing with splitting this up would be working out what to do about the backend imports within class backendimports:
def __init__(self, *args):
some logic based on args:
import whatever
self.ModelTrainer = whatever
def import_provider(backend):
return backendimport(backend)
def fit(import_provider, *args): AFAICT though there really is no point having these imports in the function anyway - unless you have another backend in development? In practise I wonder if at the level of |
agree with this, although I don't think
Also with this. The only thing to remember is that the
well, I would say storefits should get the
Here I dissagree. hyperopt_fuction should be part of performfit as a flag that will force the fit to run many times.
100% historial reasons.
The imports in fit.py are only two:
Now, all of this I would say it should be made into an issue because I believe it is relevant and should be done (and sooner rather than later even if it is not a top priority, but it would help with the modularization of the code) but it should go in a separate PR. All of this should be 100% independent of the hyperoptimization procedure. |
yeah sorry I didn't mean for it to return anything I just was saying what the function was doing
But the point is in everything inside the
which I would say goes against the point of actions.
Yeah I will try to summarise in an issue - the only thing I would say is my view of how hyperoptimization should be a seperate action could happen here |
You can have a If you want to break out hyperopt from performfit it should be done before, i.e.,
I would rather have a second PR with "make hyperoptimization into an action" because there are several things that should be carefully thought about in order to not lose generality and because that should not affect the functionality. *Edit: it is more complicated than this if you want to avoid rerunning things as much as possible. As the thing with parallel replicas I would need to actually sit down for a while in order to have a clear idea of what would make myself happy. |
Hmm I think it wouldn't be so difficult to add a vaidphys argcheck that makes sure that if Approximately (the error raise here looks weird when I tested it, but at least stops pretty early with a nicer error message rather than getting all the way to
Just means that if the runcard doesn't have hyperscan and it should then it'll raise an error much sooner - quality of life |
Now having said that I think you could add to the
I think that |
Should we add |
dictionaries = filter(filter_function, dictionaries) | ||
|
||
# Now fill a pandas dataframe with the survivors | ||
dataframe_raw = pd.DataFrame(dictionaries) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the version of pandas I tested this on, dictionaries cannot be an iterable but needs to be explicitly cast:
pd.DataFrame(list(dictionaries))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the output of filter should always be an iterator, which is an iterable... which version of pandas were you using?
Maybe the error was coming from somewhere else and that broke the filter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am getting something like:
In [13]: pd.DataFrame(iter(l))
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-13-b81493853882> in <module>
----> 1 pd.DataFrame(iter(l))
~/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
405 mgr = self._init_dict({}, index, columns, dtype=dtype)
406 elif isinstance(data, collections.Iterator):
--> 407 raise TypeError("data argument can't be an iterator")
408 else:
409 try:
TypeError: data argument can't be an iterator
In [14]: pd.__version__
Out[14]: '0.23.4'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. They fixed that for 0.24
But it doesn't hurt to cast the filter to a list. I just wanted to make sure the error was not coming from something else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pandas is in my bad books for breaking interface way too often with things like this. But yeah, I see that it works again in 0.25.
It was already added in the previous PR. They both should be installed with the nnpdf package.
I've added this error. I think it looks ok. |
|
||
# Now filter out the ones we don't want | ||
for filter_function in filter_functions: | ||
dictionaries = filter(filter_function, dictionaries) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This maybe works, but it is potentially a headache (I had to run a couple of tests to make sure I understood what it's actually doing). Note that filter is a lazy iterator that doesn't consume the argument, so it = filter(func, it)
saves a reference to the old iterable inside the return value of filter
and reassigns it
to be the filtered iterator. This is a bit inefficient in that you have to save all the intermediate iterators inside the recursive filter, but more importantly also more subtle than it looks like in that if you change something so that next(dictionaries)
it will cause hard to understand bugs. Instead, you should do something like [item for d in dictionaries if all(f(item) for f in filter_functions)]
(filter
with the corresponding lambda would be fine, but I bet that slower).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it looked so nice :(
But I see the problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Casting it to a list -per iteration- solves the problem? (i.e., like it is in the latest version)
Or would you rather have something along the lines of:
for filter in filters:
new_dicts = []
for i in dictionaries:
if filter(i): new_dict.append(i)
dictionaries = new_dicts
?
(or the line you wrote, I am a bit dense today, I think I lost the ability to read past the 6th line of any comments)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the line I wrote better, because it is very clear that it requires all
filters. Instead I was going to write that what you did is wrong because it looked like any
at first sight (but then I realized you reassign dictionaries
). It is a lot of intermediate lists again though.
These last few commit (adding the --autofilter option) deal with the last* of the to-dos I wrote for this PR. Motivated by #527. This trimming algorithm is mostly empirical (I wanted something that would automatically do the same job as I would do by looking at the hyperopt plot and selecting which options to filter). I called it --autofilter. I'm happy with it in the sense that it does what I want. I am unhappy in the sense that parameters are set 100% manually (that's why I was planning to not port it to this repository, but it was actually useful when choosing the best models so it can stay...) *last in numerical order |
Overall I think that this indeed needs to be a set of validphys actions (so it's good that we all agree). This is currently inventing an incompatible pipeline that would preclude us from doing fancy things (e.g. plots that correlate the hyperparameters with various estimators computed by vp such as the closure test ones), or even standard things like getting these plots into the comparefits reports. That said this code looks in a form that it shouldn't be too complicated to do just that. Then we could have a small wrapper like comparefits or setupfit that just calls validphys under the hood with various command line options. |
Also a lot of the processing code feels very similar to |
"loss": "loss", | ||
} | ||
|
||
NODES = (keywords["nodes"], 4) | ||
# 0 = normal scatter plot, 1 = violin, 2 = log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be 'normal', 'scatter', 'violin' without the extra indirection? Or even an enum.Enum
?
|
||
operator = regex_op.findall(filter_string)[0] | ||
|
||
if operator == "=": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really require this? Can we not have only ==?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't know, to me it is clear when the user write "optimizer=adam" they will want to check for equality so no need to make their life harder.
raise NotImplementedError("Filter string not valid, operator not recognized {0}".format(filter_string)) | ||
|
||
# This I know it is not ok: | ||
if isinstance(val_check, str) and isinstance(filter_val, str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could use sympy.parse_expr for these things? @siranipour has been meaning to use it for the filters for a while...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at simpy's sympify but I saw it was using eval under the hood anyway so didn't look further.
I also saw pyparsing as an option, having our own expression-parser module might be useful.
""" | ||
Receives a data_dict (a parsed trial) and a filter string, returns True if the trial passes the filter | ||
|
||
filter string must have the format: key[]string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this would be clearer if it said:
filter string must have the format: key <operator> string
and also the regex knew how to discard whitespace (i.e. adding \w*
to the patterns).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A space means an extra argument in the command line and this expressions are passed from the command line.
@@ -0,0 +1,332 @@ | |||
""" | |||
This module contains functions dedicated to process the json dictionaries |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This kind of algorithmic code probably could use a few tests. Of course we might minimize the amount of it by using pandas wherever possible.
""" | ||
# If we don't have enough keys to produce n combinations, return empty | ||
if len(key_info) < ncomb: | ||
return [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be an error instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No (or not in its current form) since we might want to loop over number of combinations. If we only have 2 keys but ask for more than 2 to combine, it simply does nothing and that's fine.
Eventually, when there is a robust form of this algorithm we might want to be restrictive with what it does, of course.
Not sympyfy, parse_expr. That does what it should.
…On Sat, 17 Aug 2019, 16:05 Juacrumar, ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In n3fit/src/n3fit/hyper_optimization/plotting.py
<#516 (comment)>:
> + filter_key = keywords[filter_key]
+
+ val_check = data_dict[filter_key]
+ if val_check is None: # NaN means it does not apply
+ return True
+
+ operator = regex_op.findall(filter_string)[0]
+
+ if operator == "=":
+ operator = "=="
+ operators = ["!=", "==", ">", "<"]
+ if operator not in operators:
+ raise NotImplementedError("Filter string not valid, operator not recognized {0}".format(filter_string))
+
+ # This I know it is not ok:
+ if isinstance(val_check, str) and isinstance(filter_val, str):
I looked at simpy's sympify but I saw it was using eval under the hood
anyway so didn't look further.
I also saw pyparsing as an option, having our own expression-parser module
might be useful.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#516?email_source=notifications&email_token=ABLJWUVTKAASGPAGIS7VAQDQFAHSFA5CNFSM4IGFLUVKYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCB3ROXI#discussion_r314949880>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABLJWUVTUGCJDITJI2LJ6UDQFAHSFANCNFSM4IGFLUVA>
.
|
See how that went for bash! Besides, it would mean more issues if we ever
go the sympy route.
…On Sat, 17 Aug 2019, 15:58 Juacrumar, ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In n3fit/src/n3fit/hyper_optimization/plotting.py
<#516 (comment)>:
> + match = regex_not_op.findall(filter_string)
+ if len(match) < 2:
+ raise ValueError("Filter str is not correct: {0}".format(filter_string))
+ filter_key = match[0]
+ filter_val = match[1]
+
+ if filter_key in keywords.keys():
+ filter_key = keywords[filter_key]
+
+ val_check = data_dict[filter_key]
+ if val_check is None: # NaN means it does not apply
+ return True
+
+ operator = regex_op.findall(filter_string)[0]
+
+ if operator == "=":
Don't know, to me it is clear when the user write "optimizer=adam" they
will want to check for equality so no need to make their life harder.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#516?email_source=notifications&email_token=ABLJWUTR75D6XLYLGQWVZULQFAGZ5A5CNFSM4IGFLUVKYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCB3RNEY#discussion_r314949707>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABLJWUR3JBIHM4YRKV5RQNTQFAGZ5ANCNFSM4IGFLUVA>
.
|
Yeah, we can have a go at this in Milan in two weeks.
Also uses eval. But I have nothing against sympy, I just didn't spend more time with it because my goal was just to see whether I could avoid eval. If we go the simpy route we can then change whatever needs to be changed. |
Yeah, we can have a go at this in Milan in two weeks.
Not sympyfy, parse_expr. That does what it should.
Also uses eval. But I have nothing against sympy, I just didn't spend more
time with it because my goal was just to see whether I could avoid eval. If
we go the simpy route we can then change whatever needs to be changed.
Sure, but it cleans the expression sufficiently before.
… —
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#516?email_source=notifications&email_token=ABLJWUU4WW2EFZVWUSKKSNTQFAMCNA5CNFSM4IGFLUVKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4QOAWI#issuecomment-522248281>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABLJWUTTMWF7JRYMSIUBW53QFAMCNANCNFSM4IGFLUVA>
.
|
Ok, so there are two things missing from this PR that I believe should be done before merging:
For the second point I would welcome a bit of help maybe in the meeting code here in Milano in two weeks, because I have very little exp with plotting things with validphys and I am sure building a few plots it's just a five minutes job for some of you. Besides that, the hyperoptimization should be a different action as per #519. But I think that would be outside the scope of this PR. |
I had a look at the hyperopt library itself and I have to say I am not a huge fan. Not of the documentation, not of the way the code looks, not of the deeply nested dictionaries, not of the dependency on mongodb, not of the amount of open issues on github. In the end it seems to me that we are getting a fancy for loop and a bunch of things we have to work around or that we could do better ourselves. With that in mind, maybe it would make sense to try to isolate it as much as possible in one module, say |
Do we have a document with the details of the hyper-opt algorithm that is currently implemented in the code? |
@ldd69 edit: to first approximation is just a random search of thousands of combinations. Then there is the code for removing the unstable configurations. This code is WIP (I called it hyper_algorithm.py) and even what a "unstable configuration" is needs discussion before it becomes a true algorithm. As I said, right know it basically does what I do by eye only automatically.
@Zaharid I agree with your feelings about hyperopt. This is the reasoning behind |
Excellent, thank you! I'll have a look. I'm trying to summarise my thinking in a set of notes. |
I'll try to summarize how the code does the whole hyperoptimization thing (from a more pragmatic point of view) for Varenna in my presentation. I'll link it here as well once it's done. |
…an_wrapper function
a448fd2
to
dc851c5
Compare
It is basically a refactoring of the hyperopt code so it includes more documentation and it is easier to follow.
How to
The first step is to prepare a runcard with a
hyperscan
dictionary in it. As an example you can see the one in theruncards
folder ofn3fit
:Basic_runcard.yml
.In its current form the hyperoptimization is activated by the flag --hyperopt at run time. It accepts as an argument a number, which is basically the number of trials to run. So, once you have your runcard prepare all you have to do is:
n3fit my_hyper_runcard.yml [replica_number] --hyperopt [number of trials]
Each trial consists on a different configuration of the parameter, following the scan defined in
hyperscan
.Once the hyperopt scan has finished (it will take slightly less than time_of_a_single_fit * nuumber_of_trials) you will find a file in the nnfit directory:
my_hyper_runcard/nnfit/replica_1/trials.json
With this you can use the
n3Hyperplot
script to generate the plots:n3Hyperplot my_hyper_runcard
Will generate a plot per replica. If you want to join all replicas as if it were one single longer run you can do
n3Hyperplot my_hyper_runcard --combine
If you want to play a bit with the results, maybe remove anything that uses Adadelta as the optimizer and has less than 500 epochs:
n3Hyperplot my_hyper_runcard --combine --filter "optimizer=Adadelta" "epochs>500"
The
filter
option will accept the operators !=, =, <, >Note: I generally do the hyperscan with the replicas turned off, but send several runs with a different replica number, this way I get a few .json files with statistically different runs (all the seeding in n3fit
depends on the replica number)