-
-
Notifications
You must be signed in to change notification settings - Fork 25.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH extensible parameter search results #1787
Conversation
Could you rebase please? It would be great if @GaelVaroquaux, @ogrisel and @larsmans could comment, as this is core api stuff :) |
This looks great! If the test pass, I think this is good to go (plus the rebase obv). |
There are probably some examples that need updating and possibly also the docs. |
rebased. Any pointers to examples and docs needing updates? |
well, the grid-search narrative documentation and the examples using grid-search probably. |
As an aside (perhaps subject to a separate PR), I wonder whether we should return the parameters as a structured array (rather than dicts). So, rather than array([({'foo': 5, 'bar': 'a'}, 1.0), ({'foo': 3, 'bar': 'a'}, 0.5)],
dtype=[('parameters', '|O4'), ('test_score', '<f4')]) it would be: array([((5, 'a'), 1.0), ((3, 'a'), 0.5)],
dtype=[('parameters', [('foo', '<i4'), ('bar', '|S1')]), ('test_score', '<f4')]) This allows us to easily query the data by parameter value: >>> grid_results_['parameters']['foo'] > 4
array([ True, False], dtype=bool) Note this would also apply to randomised searches, helping towards #1020 where a solution like #1034 could not. This approach, however, doesn't handle grid searches with multiple grids (i.e. passing an array of dicts to WDYT? |
This last commit (b18a278) moves cross-validation evaluation code into a single place, the |
Provides consinstent and enhanced structured-array result style for non-search CV evaluation. So far only regression-tested.
From your many pull requests, I think this is the one that I really want to merge first. Did you think about my proposal to rename |
Contains scores for all parameter combinations in param_grid. | ||
Each entry corresponds to one parameter setting. | ||
Each named tuple has the attributes: | ||
`grid_results_` : structured array of shape [# param combinations] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shows that grid_results_
is not a good name, as it is not a grid here.
Refactoring the What made you add the What I don't like about this is that suddenly the concept of parameters appears in |
|
I just realised I didn't answer the question "What made you add the Clearly it encapsulates the parallel fitting and the formatting of results. I also thought users of But a re-entrant setup was most important for the context of custom search algorithms (not in this PR; see https://github.com/jnothman/scikit-learn/tree/param_search_callback) where the |
I also considered making |
Thanks a lot for the feedback :) sounds sensible to me. I'll do a fine-grained review asap ;) |
fit_params): | ||
"""Inner loop for cross validation""" | ||
n_samples = X.shape[0] if sp.issparse(X) else len(X) | ||
def fit_fold(estimator, X, y, train, test, scorer, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this function is public, it should be in the references (doc/modules/classes.rst
)
Sure, I can add things to |
also, add iid parameter to cross_val_score
I think we should always add stuff to Just |
I think the structure of @larsmans @GaelVaroquaux I'd really like you to have a look if you find the time. |
I wrote about this to the ML in order to weigh up alternatives and potentially get wider consensus. I don't think structured arrays are used elsewhere in scikit-learn, and I worry that while appropriate, they are a little too constraining and unfamiliar. |
It's not relevant to the rest of the proposal, but I've decided |
Sorry for later feedback. I will try to have a look at this PR soon as I a currently working with |
I assigned this PR for Milestone 0.14 as the new |
@ogrisel: With regards to your comments on the ML, would we like to see the default storage / presentation of results as:
? |
I would prefer a list of dicts with:
And later we will let the user compute additional attribute using a callback API, for instance to collect additional complementary scores such as per class precision, recall and f1 score or full confusion matrices. Then make the search result interface compute the grouped statistics and rank models by mean validation errors by grouping on the parameters_id fields. |
That structure makes a lot of sense in terms of asynchronous parallelisation... I'm still not entirely convinced it's worthwhile having each fold available to the user as a separate record (which is providing the output of map, before reduce). I also don't think train and test fold size necessarily need to be in there if we are using the same folds for every candidate. I guess what you're trying to say is that this is the nature of our raw data: a series of fold records. And maybe we need to make a distinction between:
My suggestion of structured arrays was intended to provide compact in-memory storage with easy, flexible and efficient access, but still required per-fold intermediate records. Let's say that we could pass some kind of
Each of these needs to:
I don't really think that first point should be necessary. If we have an asynchronous processing queue, we will still expect folds for each candidate to be evaluated roughly at the same time, so grouping can happen more efficiently by handling it straight off the queue (storing all the fold results temporarily in memory) rather than in each |
In short: I can't think of a use-case where a user wants per-fold data to be in a list. In an iterable coming off a queue, yes. In a relational DB, perhaps. (Grouped by candidate, certainly.) |
It is for fail over if some parameters set will generate ill conditioned optimization problems that are not numerically stable across all CV folds. That can happen with SGDClassifier and GBRT models apparently. Dealing with missing evaluations is very useful, even with the lack of async parallelization.
This statement is false if we would like to implement the "warm start with growing number of CV folds" use case.
Implementing fault tolerant grid search is one, iteratively growable CV folds is another (warm restarts with a higher number of CV iterations). I wasted a couple of grid search run (lasting 10min each times) precisely because of those 2 missing use cases yesterday. So they are not made up use cases: as a regular user of the lib I really feel the need for those. Also implementing learning curves with a variable In short: the dumb fold log records datastructure is so much more simple and flexible to allow the implementation of any additional use cases in the future (e.g. learning curves and warm restarts in any dimension) that I think it should be the base underlying datastructure we collect internally even if we expect the user to rarely have the need to access it directly but rather through the results_ object. For instance we could have:
The results log can be kept if we implement warm restarts. The results_summary_ will have to be reseted and recomputed from the updated log. The enduser API can still be made simple by providing a results object that can do the aggregation and even output the structured array datastructure you propose if it prove really useful from an enduser API standpoint. |
Also I don't think memory efficiency will never be an issue: even with millions of evaluations the overhead of python dicts and python object reference is pretty manageable in 2013 :) |
Assuming you're not collecting other data, but in that case you're right, the dict overhead will make little difference, and I'm going on about nothing. For fault tolerance there's still sense in storing some data on-disk, though. I'll think about how best to transform this PR into something like that. |
So from master, the things that IMO should happen are:
|
And again, I should point out that one difficulty with dicts is that our names for fields in them cannot have deprecation warnings, so it's a bit dangerous making them a public API... |
That's a valid point I had not thought of. |
So we could make them custom objects, but they're less portable. I can't yet think of a nice solution there, except to make the (And not being concerned by the memory consumption of dicts, your comment on the memory efficiency of namedtuples in the context of |
Sounds good. Also +1 for using I would like to have other people opinions on our discussion though. Apparently people are pretty busy at the moment. Let see: Ping @larsmans @mblondel @amueller @pprett @glouppe @arjoly @vene I know @GaelVaroquaux is currently traveling at conferences. We might have a look at this during the SciPy sprint next week with him and @jakevdp. |
Indeed, it's just that I added a |
I think the discussion is a bit hard to navigate and it would be more sensible to present a cut back PR: #2079. I'll close this one as it seems we're unlikely to go with its solution. |
GridSearch
and friends need to be able to return more fields in their results (e.g. #1742, composite score).More generally, the conceivable results from a parameter search can be classified into:
best_params_
,best_score_
,best_estimator_
; howeverbest_params_
andbest_score_
are redundantly available ingrid_scores_
as long as the index of the best parameters is known.)Hence this patch changes the output of a parameter search to be (attribute names are open for debate!):
grid_results_
(1.) a structured array (a numpy array with named fields) with one record per set of parametersfold_results_
(2.) a structured array with one record per fold per set of parametersbest_index_
(3.)best_estimator_
ifrefit==True
(3.)The structured arrays can be indexed by field name to produce an array of values; alternatively they can be indexed as an array to produce a single record, akin to the
namedtuple
s introduced in 0c94b55 (not in 0.13.1). In any case it allows numpy vectorised operations, as used here when calculating the mean score for each parameter setting (in_aggregate_scores
).Given this data, the legacy
grid_scores_
(already deprecated),best_params_
andbest_scores_
are calculated as properties.This approach is extensible to new fields, in particular new fields within
fold_results_
records, which are compiled from dicts returned fromfit_fold
(formerlyfit_grid_point
).This PR is cut back from #1768; there you can see this extensibility exemplified to store training scores, training and test times, and precision and recall together with F-score.