Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom generator for models in exhaustive feature selector #833

Open
jonathan-taylor opened this issue Jun 22, 2021 · 4 comments
Open

Custom generator for models in exhaustive feature selector #833

jonathan-taylor opened this issue Jun 22, 2021 · 4 comments

Comments

@jonathan-taylor
Copy link

Describe the workflow you want to enable

I'd like to make it easier to do best subsets with categorical features -- for simplicity let's start by assuming an additive model so for each feature there are a set of columns in the design matrix associated with that feature. When all are continuous features
each feature is associated to a single column, otherwise there is a feature grouping that can be described as a sequence of length X.shape[1] assigning columns to a particular feature. More generally, this sequence assigning columns to features could also include interactions of both continuous and categorical variables.

Describe your proposed solution

It is (at least in some corners) common practice to include all columns associated to a categorical feature or none. This would be able to be encoded in the candidates list. If interactions were permitted then some conventions only include an interaction if both main effects are also included. While the logic of which candidates to generate may be user-specific, it would seem if we could supply a custom iterator for candidates then most of the code should not need to be modified. Instead of custom_names each particular candidate may have its own identifier, so one could specify
whether the iterator produces simply indices or (indices, identifier) pairs.

This would remove the need for the min_features/max_features argument as this would be encoded into the
iterator itself. So perhaps a helper functions to produce at least a few common iterators for candidates could be included.
Specifically one which produce the default "all continuous" iterator, and one which could easily handle an additive model
with possibly some categorical variables.

Describe alternatives you've considered, if relevant

I've considered simply wrapping R functions like regsubsets that easily handles the categorical variables. I would
prefer an sklearn aware version that could do this as well.

Additional context

@jonathan-taylor
Copy link
Author

Implemented a simple version here: #834

It might also be nice to have the possibility of the sequential feature selector use custom logic as well. Again, when adding and deleting categorical variables or interactions one would want to add or delete groups of features at a time.

@rasbt
Copy link
Owner

rasbt commented Jun 23, 2021

Overall, this sounds like a great idea, and I would be in favor of such a solution for both the exhaustive and sequential feature selectors. Refactoring this into custom iterators seems like a very elegant solution. We would then have helper function that reproduces the current behavior along with generating iterators for the datasets with categorical variables.

With regard to identifying categorical features, there are many options, but what do you think of adopting the approach used in scikit-learn's HistGradientBoostingClassifier (https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingClassifier.html)?

Sorry I am currently moving and may not be super responsive in the next 1-2 weeks, but I just wanted to say that your proposal would be a very nice feature.

@jonathan-taylor
Copy link
Author

Let me take a look at the scikit learn example. On a little further reflection it seems possible to even do both sequential and exhaustive using almost identical code that generates candidates from a current "state". For exhaustive, the candidates would not depend on the state but just continue along an generator, while for sequential the set of candidates would depend on an updated state. State updates could be applied by applying a function to the returned scores from the previous set of candidates, i.e. sequential's next state would be the maximizing from the previous set of candidates.

Getting both done this way may be too ambitious to start. I will try to flesh out the exhaustive one first...

@rasbt
Copy link
Owner

rasbt commented Oct 6, 2021

Thanks a lot for the PR, this is very exciting!

Big picture-wise, there are a few thoughts.

  1. What do we do with the existing ExhaustiveFeatureSelector and SequentialFeatureSelector? We could deprecate them, that is, remove them from the documentation but leave them in the code for a few versions / years.

  2. If we do deprecate the existing SFS, two missing features would be floating-forward and floating-backward. I think right now, via

    for direction in ['forward', 'backward', 'both']:
         strategy = step(X,
                         direction=direction,
                         max_features=p,
                         fixed_features=[2,3],
                         categorical_features=categorical_features)

it only supports the standard forward and backward. I assume that 'both' means it runs forward first and finds the best set. Then it runs backward (independently) to find the best set. Then, the best set is determined by comparing the results from forward and backward? This is actually a neat addition.

  1. If we deprecate and add the floating variants, I think the only thing that we need to ensure is that it still remains compatible with scikit-learn pipelines and maybe GridSearchCV.

Amazing work, though. What you put together here is really exciting and impressive!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants