[Source] [Bug Report] [Documentation] [API Reference]
Compute feature importances on the training set and validate the performance on the validation set:
- Feature Selection for Binary Classification with RandomForestClassifier
- Feature Selection for Binary Classification with XGBClassifier
Null Importances is normalized feature importance measures that can correct the feature importance bias. The method is based on repeated permutations of the outcome vector for estimating the distribution of measured importance for each variable in a non-informative setting.
Overall, this package
- Fit the given model class
$M$ times with different model'srandom_state
to get$M$ actual feature importances of feature f:$A_f = [a_{f_1},a_{f_2}...a_{f_M}]$ . - Fit the given model class with different model's
random_state
and shuffled targets for$N$ times to get$N$ feature random importances:$R_f = [r_{f_1},r_{f_2}...r_{f_N}]$ . - Compute the final importances of a feature
$f$ by various methods, such as:$I_f = Avg(A_f) - Avg(R_f)$ $I_f = Avg(A_f) / (Avg(R_f) + 1)$ $I_f = Wasserstein Distance(A_f, R_f)$
We want random_state
(Which is also fine).
Not to be confused with sklearn.inspection.permutation_importance, this sklearn method is about feature permutation instead of target permutation.
This method were originally proposed/implemented by:
- [Paper] Permutation importance: a corrected feature importance measure
- [Kaggle Notebook] Feature Selection with Null Importances
- Compute null importances with only one function call.
- Support models that provide information about feature importance (e.g. coef_, feature_importances_), such as
RandomForestClassifier
,RandomForestRegressor
,XGBClassifier
.XGBRegressor
,LGBMClassifier
,LGBMRegressor
,CatBoostClassifier
,CatBoostRegressor
,Lasso
,LinearSVC
, etc. - Support
sklearn
'sMultiOutputClassifier
orMultiOutputRegressor
interface. - Support data in
pandas.DataFrame
andnumpy.ndarray
- Highly customizable with both the exposed
compute
andgeneric_compute
functions. - Proven effectiveness in Kaggle competitions and in
Our Benchmarks Results
.
Here are some examples of Top Kaggle solutions using this method:
Year | Competition | Medal | Link |
---|---|---|---|
2023 | Predict Student Performance from Game Play | Gold | 3rd place solution |
2019 | Elo Merchant Category Recommendation | Gold | 16th place solution |
2018 | Home Credit Default Risk | Gold | 10th place solution |
Below show the benchmark results of running null-importances with feature selection on multiple models and datasets. "better" means it is better than running feature selection with the model's built-in feature importances. Details can be seen here
.
model | n_dataset | n_better | better % |
---|---|---|---|
RandomForestClassifier | 10 | 10 | 100.0 |
RandomForestRegressor | 12 | 8 | 66.67 |
XGBClassifier | 10 | 7 | 70.0 |
XGBRegressor | 12 | 7 | 58.33 |
LGBMClassifier | 10 | 8 | 80.0 |
LGBMRegressor | 12 | 8 | 66.67 |
CatBoostClassifier | 10 | 6 | 60.0 |
CatBoostRegressor | 12 | 8 | 66.67 |
pip install target-permutation-importances
or with poetry:
poetry add target-permutation-importances
If with Python 3.8.x & Pandas 1.x.x:
pip install target-permutation-importances==1.19.0
or with poetry:
poetry add target-permutation-importances==1.19.0
Although this package is tested on models from sklearn
, xgboost
, catboost
, lightgbm
, they are not
a hard requirement for the installation, you can use this package for any model if it implements the sklearn
interface.
For models that don't follow sklearn
interface, you can use the exposed generic_compute
method as discussed in the
Advance Usage / Customization section.
Dependencies:
[tool.poetry.dependencies]
python = "^3.9"
numpy = "^1.23.5"
pandas = "^1.5.3"
tqdm = "^4.48.2"
beartype = "^0.14.1"
scipy = "^1.9"
# import the package
import target_permutation_importances as tpi
# Prepare a dataset
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
# Models
from sklearn.ensemble import RandomForestClassifier
data = load_breast_cancer()
# Convert to a pandas dataframe
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
# Train test split, only compute importances in the train set
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, shuffle=True, random_state=2023)
# Convert to a pandas dataframe
Xpd = pd.DataFrame(data.data, columns=data.feature_names)
# Compute permutation importances with default settings
result_df = tpi.compute(
model_cls=RandomForestClassifier, # The constructor/class of the model.
model_cls_params={ # The parameters to pass to the model constructor. Update this based on your needs.
"n_jobs": -1,
},
model_fit_params={}, # The parameters to pass to the model fit method. Update this based on your needs.
X=X_train, # pd.DataFrame
y=y_train, # pd.Series, np.ndarray
num_actual_runs=2,
num_random_runs=10,
# Options: {compute_permutation_importance_by_subtraction, compute_permutation_importance_by_division}
# Or use your own function to calculate.
permutation_importance_calculator=tpi.compute_permutation_importance_by_subtraction,
)
print(result_df[["feature", "importance"]].sort_values("importance", ascending=False).head())
Fork above code from Kaggle.
Outputs:
Running 2 actual runs and 10 random runs
100%|██████████| 2/2 [00:01<00:00, 1.62it/s]
100%|██████████| 10/10 [00:06<00:00, 1.46it/s]
feature importance
25 worst perimeter 0.117495
22 worst concave points 0.089949
26 worst radius 0.084632
7 mean concave points 0.064289
20 worst area 0.062485
8 mean concavity 0.047122
10 mean perimeter 0.029270
5 mean area 0.014566
11 mean radius 0.014346
0 area error 0.000693
# import the package
import target_permutation_importances as tpi
# Prepare a dataset
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
# Models
from sklearn.svm import LinearSVC
data = load_breast_cancer()
# Convert to a pandas dataframe
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
# Train test split, only compute importances in the train set
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, shuffle=True, random_state=2023)
# Compute permutation importances with default settings
result_df = tpi.compute(
model_cls=LinearSVC, # The constructor/class of the model.
model_cls_params={"max_iter": 1000}, # The parameters to pass to the model constructor. Update this based on your needs.
model_fit_params={}, # The parameters to pass to the model fit method. Update this based on your needs.
X=X_train, # pd.DataFrame
y=y_train, # pd.Series, np.ndarray
num_actual_runs=1,
num_random_runs=10,
# Options: {compute_permutation_importance_by_subtraction, compute_permutation_importance_by_division}
# Or use your own function to calculate.
permutation_importance_calculator=tpi.compute_permutation_importance_by_subtraction,
)
print(result_df[["feature", "importance"]].sort_values("importance", ascending=False).head(10))
Fork above code from Kaggle.
Outputs:
feature importance
10 mean perimeter 0.067352
29 worst texture 0.029602
11 mean radius 0.029509
26 worst radius 0.026499
21 worst compactness 0.010139
23 worst concavity 0.009149
25 worst perimeter 0.008779
14 mean texture 0.007845
0 area error 0.007540
20 worst area 0.004508
# import the package
import target_permutation_importances as tpi
# Prepare a dataset
import pandas as pd
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
# Models
from sklearn.ensemble import RandomForestRegressor
from sklearn.multioutput import MultiOutputClassifier, MultiOutputRegressor
# Multi target regression
X, y = make_regression(
n_samples=100,
n_features=20,
n_targets=5,
)
# Train test split, only compute importances in the train set
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, shuffle=True, random_state=2023)
# Compute permutation importances with default settings
result_df = tpi.compute(
model_cls=MultiOutputRegressor,
model_cls_params={ # The constructor/class of MultiOutputRegressor
"estimator": RandomForestRegressor(n_estimators=2),
},
model_fit_params={}, # The parameters to pass to the model fit method. Update this based on your needs.
X=X_train, # pd.DataFrame, np.ndarray
y=y_train, # pd.Series, np.ndarray
num_actual_runs=2,
num_random_runs=10,
# Options: {compute_permutation_importance_by_subtraction, compute_permutation_importance_by_division}
# Or use your own function to calculate.
permutation_importance_calculator=tpi.compute_permutation_importance_by_subtraction,
)
print(result_df[["feature", "importance"]].sort_values("importance", ascending=False).head())
Fork above code from Kaggle.
Outputs:
Running 2 actual runs and 10 random runs
100%|██████████| 2/2 [00:00<00:00, 4.97it/s]
100%|██████████| 10/10 [00:01<00:00, 5.10it/s]
feature importance
19 19 0.233211
10 10 0.058238
0 0 0.049782
17 17 0.031321
2 2 0.028209
You can find more detailed examples in the "Feature Selection Examples" section.
You can pick your own model by changing
model_cls
, model_cls_params
and model_fit_params
, for example, using with LGBMClassifier
with a importance_type=gain
and colsample_bytree=0.5
:
result_df = tpi.compute(
model_cls=LGBMClassifier, # The constructor/class of the model.
model_cls_params={ # The parameters to pass to the model constructor. Update this based on your needs.
"n_jobs": -1,
"importance_type": "gain",
"colsample_bytree": 0.5,
},
model_fit_params={}, # The parameters to pass to the model fit method. Update this based on your needs.
X=Xpd, # pd.DataFrame, np.ndarray
y=data.target, # pd.Series, np.ndarray
num_actual_runs=2,
num_random_runs=10,
# Options: {compute_permutation_importance_by_subtraction, compute_permutation_importance_by_division}
# Or use your own function to calculate.
permutation_importance_calculator=tpi.compute_permutation_importance_by_subtraction,
)
Note: Tree models are greedy. Usually it is a good idea to introduce some randomness into the tree model by setting colsample_*
params.
It forces the model to explore the importances of different features. In other words, setting these
parameters avoid a feature from being under-representative in the importance calculation because of having another highly correlated feature.
You can pick your own calculation method by changing permutation_importance_calculator
.
There are 3 provided calculations:
tpi.compute_permutation_importance_by_subtraction
tpi.compute_permutation_importance_by_division
tpi.compute_permutation_importance_by_wasserstein_distance
You can also implement you own calculation function and pass it in. The function needs to follow
PermutationImportanceCalculatorType
specification, you can find it in
API Reference
This package exposes tpi.generic_compute
to allow advance customization.
Read the followings for details:
TargetPermutationImportancesWrapper
follows scikit-learn interfaces and support scikit-learn feature selection method such as SelectFromModel
:
# Import the package
import target_permutation_importances as tpi
# Prepare a dataset
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
# Models
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier
data = load_breast_cancer()
# Convert to a pandas dataframe
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
# Train test split, only compute importances in the train set
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, shuffle=True, random_state=2023)
# Compute permutation importances with default settings
wrapped_model = tpi.TargetPermutationImportancesWrapper(
model_cls=RandomForestClassifier, # The constructor/class of the model.
model_cls_params={ # The parameters to pass to the model constructor. Update this based on your needs.
"n_jobs": -1,
},
num_actual_runs=2,
num_random_runs=10,
# Options: {compute_permutation_importance_by_subtraction, compute_permutation_importance_by_division}
# Or use your own function to calculate.
permutation_importance_calculator=tpi.compute_permutation_importance_by_subtraction,
)
wrapped_model.fit(
X=X_train, # pd.DataFrame, np.ndarray
y=y_train, # pd.Series, np.ndarray
# And other fit parameters for the model.
)
# Get the feature importances as a pandas dataframe
result_df = wrapped_model.feature_importances_df
print(result_df[["feature", "importance"]].sort_values("importance", ascending=False).head())
# Select top-5 features with sklearn `SelectFromModel`
selector = SelectFromModel(
estimator=wrapped_model, prefit=True, max_features=5, threshold=-np.inf
).fit(Xpd, data.target)
selected_x = selector.transform(Xpd)
print(selected_x.shape)
print(selector.get_feature_names_out())
Fork above code from Kaggle.
Outputs:
Running 2 actual runs and 10 random runs
100%|██████████| 2/2 [00:01<00:00, 1.80it/s]
100%|██████████| 10/10 [00:06<00:00, 1.55it/s]
feature importance
22 worst perimeter 0.151953
27 worst concave points 0.124407
20 worst radius 0.119090
7 mean concave points 0.098747
23 worst area 0.096943
(569, 5)
['mean concave points' 'worst radius' 'worst perimeter' 'worst area'
'worst concave points']
Python Version
You can find the suggested development Python version in .python-version
.
You might consider setting up Pyenv
if you want to have multiple Python versions on your machine.
Python packages
This repository is setup with Poetry
. If you are not familiar with Poetry, you can find package requirements listed in pyproject.toml
.
Otherwise, you can just set it up with poetry install
Run Benchmarks
To run the benchmark locally on your machine, run make run_tabular_benchmark
or python -m benchmarks.run_tabular_benchmark
Make Changes
Following the Make Changes Guide from Github
Before committing or merging, please run the linters defined in make lint
and the tests defined in make test