Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError: 'weight' with sklearn.feature_selection.SelectFromModel #5653

Closed
GuidoBartoli opened this issue May 11, 2020 · 6 comments
Closed

Comments

@GuidoBartoli
Copy link

GuidoBartoli commented May 11, 2020

Hi,
I'm using scikit-learn automatic feature selection together with a trained XGBoost model.
I set up a threshold to interrupt the feature reduction process when accuracy falls below it.
I think everything is fine in the loop, but when I use SelectFromModel.transform() I receive the following error:

Traceback (most recent call last):
  File "boost.py", line 581, in <module>
    s_train_x = selection.transform(train_x)
  File "/home/guido/.virtualenvs/ml/lib/python3.6/site-packages/sklearn/feature_selection/_base.py", line 77, in transform
    mask = self.get_support()
  File "/home/guido/.virtualenvs/ml/lib/python3.6/site-packages/sklearn/feature_selection/_base.py", line 46, in get_support
    mask = self._get_support_mask()
  File "/home/guido/.virtualenvs/ml/lib/python3.6/site-packages/sklearn/feature_selection/_from_model.py", line 178, in _get_support_mask
    scores = _get_feature_importances(estimator, self.norm_order)
  File "/home/guido/.virtualenvs/ml/lib/python3.6/site-packages/sklearn/feature_selection/_from_model.py", line 18, in _get_feature_importances
    coef_ = getattr(estimator, "coef_", None)
  File "/home/guido/.virtualenvs/ml/lib/python3.6/site-packages/xgboost/sklearn.py", line 716, in coef_
    coef = np.array(json.loads(b.get_dump(dump_format='json')[0])['weight'])
KeyError: 'weight'

I'm using the latest xgboost 1.0.2 with scikit-learn 0.22 and below there is the code I wrote. It's part of a bigger script, so some variable are defined before, but the KeyError should not depend on that.

report = []
prev_t = -1
scores = np.sort(model.feature_importances_)
indices = np.argsort(model.feature_importances_)
misc.msg('Feature selection (threshold = {})...'.format(autosel))
iterator = tqdm(scores)
for i, t in enumerate(iterator):
    if -1 < prev_t == t:
        continue
    prev_t = t
    selection = SelectFromModel(model, threshold=t, prefit=True)
    try:
        s_train_x = selection.transform(train_x)
    except ValueError:
        misc.msg('Incompatible number of features!', 'err')
        sys.exit(1)
    kwargs = {'tree_method': 'hist' if not gpu else 'gpu_hist',
              'grow_policy': 'lossguide' if useloss else 'depthwise'} \
        if not exact else {}
    s_model = xgb.XGBClassifier(objective=model.objective, n_jobs=-1, n_estimators=model.n_estimators,
                                max_depth=model.max_depth, learning_rate=model.learning_rate,
                                subsample=model.subsample, colsample_bytree=model.colsample_bytree,
                                min_child_weight=model.min_child_weight, gamma=model.gamma,
                                reg_alpha=model.reg_alpha, reg_lambda=model.reg_lambda,
                                max_delta_step=model.max_delta_step, random_state=model.random_state,
                                scale_pos_weight=model.scale_pos_weight, **kwargs)
    try:
        s_model.fit(s_train_x, train_y)
    except KeyboardInterrupt:
        misc.msg('Feature selection interrupted', 'warn')
        sys.exit(0)
    s_test_x = selection.transform(test_x)
    s_pred_y = s_model.predict(s_test_x)
    s_accuracy = accuracy_score(test_y, s_pred_y)
    subset = str(list(reversed(indices[i:]))).replace(',', ';')
    report.append([t, s_train_x.shape[1], s_accuracy, subset])
    if s_accuracy < args.autosel:
        iterator.close()
        misc.msg('Accuracy below threshold ({:.6f})'.format(s_accuracy), 'warn')
        misc.msg('Feature subset: {}'.format(conv.values2ranges(indices[i:])))
        break
    gc.collect()

Anyone can reproduce this behaviour?
Many thanks in advance!

@trivialfis
Copy link
Member

Hi, could you please post a more complete script that I can run?

@GuidoBartoli
Copy link
Author

Sure, I will post it here this afternoon, so you can take a look at it.

Thanks!

@GuidoBartoli
Copy link
Author

GuidoBartoli commented May 13, 2020

This is a minimal test.py:

from h5py import File
from joblib import load
from sklearn.feature_selection import SelectFromModel

if __name__ == '__main__':
    h5 = File('dataset.h5', 'r')
    data = h5['data'][:]
    model = load('model.mdl')
    selection = SelectFromModel(model, threshold=0.95, prefit=True).transform(data)

This is the corresponding requirements.txt:

h5py==2.10.0
joblib==0.14.1
numpy==1.18.4
scikit-learn==0.23.0
scipy==1.4.1
six==1.14.0
threadpoolctl==2.0.0
xgboost==1.0.2

Here are the dataset and model to be unzipped in the same folder as the script. The model is a xgb.XGBClassifier previously trained on the same data with the standard fit() function.

You can reproduce the reported problem with python test.py.

@arthurnage
Copy link

@GuidoBartoli Hi dude. I've had the same problem. I have used xgboost==1.0.0 version. Upgrading up to recent 1.1.0 helped.

@hcho3
Copy link
Collaborator

hcho3 commented Jun 17, 2020

The issue is fixed in #5505 and the example script runs fine on XGBoost 1.1.0.

@hcho3 hcho3 closed this as completed Jun 17, 2020
@GuidoBartoli
Copy link
Author

The issue is fixed in #5505 and the example script runs fine on XGBoost 1.1.0.

Perfect, many thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants