-
Notifications
You must be signed in to change notification settings - Fork 872
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Custom generator for models in exhaustive feature selector #833
Comments
Implemented a simple version here: #834 It might also be nice to have the possibility of the sequential feature selector use custom logic as well. Again, when adding and deleting categorical variables or interactions one would want to add or delete groups of features at a time. |
Overall, this sounds like a great idea, and I would be in favor of such a solution for both the exhaustive and sequential feature selectors. Refactoring this into custom iterators seems like a very elegant solution. We would then have helper function that reproduces the current behavior along with generating iterators for the datasets with categorical variables. With regard to identifying categorical features, there are many options, but what do you think of adopting the approach used in scikit-learn's HistGradientBoostingClassifier (https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingClassifier.html)? Sorry I am currently moving and may not be super responsive in the next 1-2 weeks, but I just wanted to say that your proposal would be a very nice feature. |
Let me take a look at the scikit learn example. On a little further reflection it seems possible to even do both sequential and exhaustive using almost identical code that generates candidates from a current "state". For exhaustive, the candidates would not depend on the state but just continue along an generator, while for sequential the set of candidates would depend on an updated state. State updates could be applied by applying a function to the returned scores from the previous set of candidates, i.e. sequential's next state would be the maximizing from the previous set of candidates. Getting both done this way may be too ambitious to start. I will try to flesh out the exhaustive one first... |
Thanks a lot for the PR, this is very exciting! Big picture-wise, there are a few thoughts.
it only supports the standard forward and backward. I assume that 'both' means it runs forward first and finds the best set. Then it runs backward (independently) to find the best set. Then, the best set is determined by comparing the results from forward and backward? This is actually a neat addition.
Amazing work, though. What you put together here is really exciting and impressive! |
Describe the workflow you want to enable
I'd like to make it easier to do best subsets with categorical features -- for simplicity let's start by assuming an additive model so for each feature there are a set of columns in the design matrix associated with that feature. When all are continuous features
each feature is associated to a single column, otherwise there is a feature grouping that can be described as a sequence of length
X.shape[1]
assigning columns to a particular feature. More generally, this sequence assigning columns to features could also include interactions of both continuous and categorical variables.Describe your proposed solution
It is (at least in some corners) common practice to include all columns associated to a categorical feature or none. This would be able to be encoded in the
candidates
list. If interactions were permitted then some conventions only include an interaction if both main effects are also included. While the logic of which candidates to generate may be user-specific, it would seem if we could supply a custom iterator forcandidates
then most of the code should not need to be modified. Instead ofcustom_names
each particular candidate may have its own identifier, so one could specifywhether the iterator produces simply indices or (indices, identifier) pairs.
This would remove the need for the
min_features/max_features
argument as this would be encoded into theiterator itself. So perhaps a helper functions to produce at least a few common iterators for candidates could be included.
Specifically one which produce the default "all continuous" iterator, and one which could easily handle an additive model
with possibly some categorical variables.
Describe alternatives you've considered, if relevant
I've considered simply wrapping
R
functions likeregsubsets
that easily handles the categorical variables. I wouldprefer an sklearn aware version that could do this as well.
Additional context
The text was updated successfully, but these errors were encountered: