-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to give sample weights? #288
Comments
Could you please describe your use case? In case the data is unbalanced, Auto-sklearn configures whether to activate the class_weight feature. |
I have some samples that I really need to get right, so want to assign a larger penalty for getting those ones wrong. Many of the sklearn classifiers take a |
I see your point. Sample weights would then have to be passed to each method in auto-sklearn, especially the final ensemble building procedure, right? However, I think that this would contradict the principle of auto-sklearn which tries to optimize a user-given metric. Do you think it would be possible that you create a custom metric which penalizes a solution for missing your important data points? |
That is sort of what I'm asking how to do. I tried making a custom metric (f3-score) that takes sample weights:
However, when I fit with that, I get errors about incompatible sizes because the metric doesn't know which samples are in the holdout set. The sample weights has shape |
Thanks for pointing that out. It would be great to have auto-sklearn accept sample weights for the scoring functions. However, I will not be able to implement this feature in the next weeks. If you want to contribute this feature, I would be happy to assist with that. |
Sure, I could give it a shot if you could point me in the direction of where I would need to make the changes. |
Great. Here's a brief tour: auto-sklearn stores the data in a class AbstractDataManager, from which XYDataManager is derived. The data managers are used to persist the data on disk and are loaded by the evaluation module which takes care about restricting the runtime and memory usage of the target algorithm. Weights of the data points would have to be persisted, too. It is then used in the evaluator class where the optimization loss is calculated. I think these are the code pieces where changes need to be done in order to influence the optimization procedure. Furthermore, you would need to change the call to the scoring function in the ensemble builder and ensemble selection. Those two will be a bit trickier, as they rely on the correct sorting of the data (as the sorting will change due to the resampling strategy). You can have a look here how the targets are built in order to accommodate for the change of order. I hope this is not too complicated and gives a good overview of where the code needs to be changed. In general, a search for call of One more note: I will probably have no time to reply tomorrow and will be out of office for a few days afterwards. Therefore, I might not reply immediately until next Wednesday. |
Hey, have the sample_weight problem be solved now? Thanks |
It would be great if we can add sample_weight that representing the confidence of each data point. |
@mfeurer Hi, regarding passing the right sample weights, I'm thinking we can leverage the index from I do know that autosklearn converts the X and y into numpy array by applying |
We convert pandas dataframes to numpy arrays because we never found the time to update our packages to accept pandas. I'd be happy about pull requests overcoming this issue. |
Just throwing in another use case: I'm doing multioutput regression and there are some missing values y |
Any luck some folks would have create the PR for the sample_weight addition while searching for the best configuration? Cheers. |
@mfeurer One of your commits is seen there: 6de26d7 When you introduce "Feature: weighting for imbalanced classes" in the commit. Is it possible to use sample weights with this feature then ? I am perplexed, especially since the commit is old date and the more recent discussion on this thread. Cheers. |
Hey @simonprovost, no, unfortunately, there has not yet been any progress on this. We'd be happy about a contribution, otherwise, we'll discuss in our next offline meeting whether we can increase the priority on this one. |
@mfeurer great, thanks for the prompt answer. I could take a look at it, is the description for contributing you mentioned at the beginning of the post still accurate with the new version of Auto Sklearn? Cheers |
Mostly, from the top of my head these are the modules to be changed:
@eddiebergman can you think of any other modules that need to be updated for this to be supported? |
Not of the top of my head, the main difficulty is that sample weights need to be passed through the entire chain of objects which is not entirely transparent, hence the need to update quite a few modules. I would be happy to regularly review a PR and give guidance during it if you would like to contribute these changes :) Best, |
Yeah, apologies for saying I could do this and then disappearing. I did start taking a look, but got pretty lost in all the code that would need to be changed and then got pulled into other projects. |
This would be a very good feature to implement. |
Any updates on that? it seems like a useful thing to do. Actually I think it should be relatively easy, in the sense that sample_weights should be propagated to all the fit() methods of a given pipeline... shouldn't it? So no need for a custom metric, just propagate the importances down to every fit call |
@mrektor Exactly. As I have just begun my Ph.D. in AutoML, I unfortunately do not have the time to contribute this, in theory, short PR. Otherwise, as the authors indicated, feel free to give it a whirl as they would be delighted to review such PR. Cheers. |
@mrektor Sorry for ghosting, I'm half-back on maintaining auto-sklearn and my first priorities are to update scikit-learn, SMAC, pynisher and ConfigSpace. After that I will add it on to the stack. In theory yes, quite simple, in practice it's complicated by obscurities in multi-processing and the fact sample-weights are not supported by all components. |
I see, nice! So are you planning to integrate with scikit 1.x? it was quite a pain having to downgrade as many packages now depend on 1.x... |
Is there a way to give sample weights to the fit method? I see that the metrics can take it as an argument, but the fit method doesn't.
The text was updated successfully, but these errors were encountered: