Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to give sample weights? #288

Open
kgullikson88 opened this issue May 24, 2017 · 24 comments
Open

How to give sample weights? #288

kgullikson88 opened this issue May 24, 2017 · 24 comments
Labels

Comments

@kgullikson88
Copy link

Is there a way to give sample weights to the fit method? I see that the metrics can take it as an argument, but the fit method doesn't.

@mfeurer
Copy link
Contributor

mfeurer commented May 25, 2017

Could you please describe your use case? In case the data is unbalanced, Auto-sklearn configures whether to activate the class_weight feature.

@kgullikson88
Copy link
Author

I have some samples that I really need to get right, so want to assign a larger penalty for getting those ones wrong. Many of the sklearn classifiers take a sample_weight keyword argument in the fit method for this purpose (logistic regression does, for example)

@mfeurer
Copy link
Contributor

mfeurer commented May 26, 2017

I see your point. Sample weights would then have to be passed to each method in auto-sklearn, especially the final ensemble building procedure, right? However, I think that this would contradict the principle of auto-sklearn which tries to optimize a user-given metric. Do you think it would be possible that you create a custom metric which penalizes a solution for missing your important data points?

@kgullikson88
Copy link
Author

That is sort of what I'm asking how to do. I tried making a custom metric (f3-score) that takes sample weights:

from sklearn.metrics import precision_score, recall_score
from autosklearn.metrics make_scorer

def score_func(y_true, y_pred, beta=3, sample_weight=None):
    if sample_weight is not None:
        prec = precision_score(y_true=y_true, y_pred=y_pred, sample_weight=sample_weight)
        rec = recall_score(y_true=y_true, y_pred=y_pred, sample_weight=sample_weight)
    else:
        prec = precision_score(y_true=y_true, y_pred=y_pred)
        rec = recall_score(y_true=y_true, y_pred=y_pred)
    if prec == 0 and rec == 0:
        return 0.0
    return (1 + beta**2) * prec * rec / (beta**2 * prec + rec)

scorer = make_scorer('f3_score', score_func, sample_weight=weights)

However, when I fit with that, I get errors about incompatible sizes because the metric doesn't know which samples are in the holdout set. The sample weights has shape (full_sample_size,), while the y_true and y_pred values have smaller sizes due to cross validation.

@mfeurer
Copy link
Contributor

mfeurer commented May 29, 2017

Thanks for pointing that out. It would be great to have auto-sklearn accept sample weights for the scoring functions. However, I will not be able to implement this feature in the next weeks. If you want to contribute this feature, I would be happy to assist with that.

@kgullikson88
Copy link
Author

Sure, I could give it a shot if you could point me in the direction of where I would need to make the changes.

@mfeurer
Copy link
Contributor

mfeurer commented May 30, 2017

Great. Here's a brief tour:

auto-sklearn stores the data in a class AbstractDataManager, from which XYDataManager is derived. The data managers are used to persist the data on disk and are loaded by the evaluation module which takes care about restricting the runtime and memory usage of the target algorithm. Weights of the data points would have to be persisted, too.

It is then used in the evaluator class where the optimization loss is calculated. I think these are the code pieces where changes need to be done in order to influence the optimization procedure.

Furthermore, you would need to change the call to the scoring function in the ensemble builder and ensemble selection. Those two will be a bit trickier, as they rely on the correct sorting of the data (as the sorting will change due to the resampling strategy). You can have a look here how the targets are built in order to accommodate for the change of order.

I hope this is not too complicated and gives a good overview of where the code needs to be changed. In general, a search for call of calculate_score would be a good idea in case I missed one.

One more note: I will probably have no time to reply tomorrow and will be out of office for a few days afterwards. Therefore, I might not reply immediately until next Wednesday.

@xiangning-chen
Copy link

Hey, have the sample_weight problem be solved now? Thanks

@forest-jiang
Copy link

It would be great if we can add sample_weight that representing the confidence of each data point.

@kaiweiang
Copy link

@mfeurer Hi, regarding passing the right sample weights, I'm thinking we can leverage the index from pandas.DataFrame or pandas.Series. Currently, it can be done in sklearn's GridSearchCV (see https://stackoverflow.com/questions/49581104/sklearn-gridsearchcv-not-using-sample-weight-in-score-function?answertab=active#tab-top

I do know that autosklearn converts the X and y into numpy array by applying sklearn.utils.check_array although the pandas data frame are passed to the fit method. Any specific reason you enforce this?

@mfeurer
Copy link
Contributor

mfeurer commented May 18, 2020

We convert pandas dataframes to numpy arrays because we never found the time to update our packages to accept pandas. I'd be happy about pull requests overcoming this issue.

@krey
Copy link

krey commented Aug 21, 2021

Just throwing in another use case: I'm doing multioutput regression and there are some missing values y

@simonprovost
Copy link

Any luck some folks would have create the PR for the sample_weight addition while searching for the best configuration? Cheers.

@simonprovost
Copy link

@mfeurer One of your commits is seen there: 6de26d7 When you introduce "Feature: weighting for imbalanced classes" in the commit. Is it possible to use sample weights with this feature then ? I am perplexed, especially since the commit is old date and the more recent discussion on this thread.

Cheers.

@mfeurer
Copy link
Contributor

mfeurer commented Jan 17, 2022

Hey @simonprovost, no, unfortunately, there has not yet been any progress on this. We'd be happy about a contribution, otherwise, we'll discuss in our next offline meeting whether we can increase the priority on this one.

@simonprovost
Copy link

@mfeurer great, thanks for the prompt answer. I could take a look at it, is the description for contributing you mentioned at the beginning of the post still accurate with the new version of Auto Sklearn?

Cheers

@mfeurer
Copy link
Contributor

mfeurer commented Jan 17, 2022

Mostly, from the top of my head these are the modules to be changed:

  • autosklearn.data.abstract_data_manager
  • autosklearn.data.xy_data_manager
  • autosklearn.data.feature_validator
  • autosklearn.evaluation (probably all files in there)
  • autosklearn.ensembles.ensemble_selection
  • autosklearn.ensemble_builder
  • pipeline.components.data_preprocessing.balancing

@eddiebergman can you think of any other modules that need to be updated for this to be supported?

@eddiebergman
Copy link
Contributor

Not of the top of my head, the main difficulty is that sample weights need to be passed through the entire chain of objects which is not entirely transparent, hence the need to update quite a few modules.

I would be happy to regularly review a PR and give guidance during it if you would like to contribute these changes :)

Best,
Eddie

@kgullikson88
Copy link
Author

Yeah, apologies for saying I could do this and then disappearing. I did start taking a look, but got pretty lost in all the code that would need to be changed and then got pulled into other projects.

@eddiebergman eddiebergman added the enhancement A new improvement or feature label Jun 10, 2022
@dmenig
Copy link

dmenig commented Jun 12, 2022

This would be a very good feature to implement.

@mrektor
Copy link

mrektor commented Nov 9, 2022

Any updates on that? it seems like a useful thing to do. Actually I think it should be relatively easy, in the sense that sample_weights should be propagated to all the fit() methods of a given pipeline... shouldn't it?

So no need for a custom metric, just propagate the importances down to every fit call

@simonprovost
Copy link

@mrektor Exactly. As I have just begun my Ph.D. in AutoML, I unfortunately do not have the time to contribute this, in theory, short PR. Otherwise, as the authors indicated, feel free to give it a whirl as they would be delighted to review such PR.

Cheers.

@eddiebergman
Copy link
Contributor

@mrektor Sorry for ghosting, I'm half-back on maintaining auto-sklearn and my first priorities are to update scikit-learn, SMAC, pynisher and ConfigSpace. After that I will add it on to the stack.

In theory yes, quite simple, in practice it's complicated by obscurities in multi-processing and the fact sample-weights are not supported by all components.

@mrektor
Copy link

mrektor commented Nov 15, 2022

I see, nice! So are you planning to integrate with scikit 1.x? it was quite a pain having to downgrade as many packages now depend on 1.x...
good to know! keep up the good work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants