Visualize constructed features and get best pipeline found. #459

axelroy · 2017-05-22T06:58:07Z

Greetings,

First of all, thank you for the amazing job you did on this project. I'm trying to use TPOT on a research context, and after a few tests, I have some questions about how to use it :

I've seen in the issue Workflow to visualize Tpot results #337 that we can retrieve the explored pipelines with the tpot._evaluated_individuals set. Is this a good way to use it, or could it change over the versions? I want to be able to retrieve the best model, the features and the parameters to store it into a DB.
Is there any way to retrieve the best features, as shown in this paper on page 12, and to know from which the constructed ones have been based on?

Thank you for your help,

Kind regards,
Axel.

The text was updated successfully, but these errors were encountered:

weixuanfu · 2017-05-22T15:46:53Z

For the first question, I think this is right way to use it for current version and next version (0.8). You can also find the example in the unit test, test_evaluated_individuals, if the way is updated in future versions.

For the second one. for now TPOT cannot provide the ranking of features' importance as Figure 5 in the paper. The importance of features on page 12 was estimated using Random Forest method.

rhiever · 2017-05-22T20:39:56Z

Hi @axelroy,

If you want to access the best model from the TPOT run, you can access it with the tpot. _fitted_pipeline property at the end of the run. If you run TPOT at the highest verbosity (3), you can also access the entire Pareto front of best pipelines with the tpot._pareto_front_fitted_pipelines property. Note that both of these properties are only assigned at the end of a TPOT run.

In terms of presenting feature importances, those are limited to specific models. In the case of the paper you linked, those were decision trees and random forests, so I was displaying tree-based feature importances. If TPOT discovers a pipeline for you that uses a decision tree or other tree-based method as the final classifier, for example, then you could access those feature importances with the following code:

# The first index to -1 gets the last step in the pipeline
# The second index to 1 gets the actual classifier object
tpot._fitted_pipeline.steps[-1][1].feature_importances_

which is an array of feature importances that you can then match with the feature names. The same applies with linear models, except you'd access the coef_ property instead.

axelroy · 2017-05-23T11:42:24Z

Thank you very much for the responses, I'll test this as soon as possible.

rhiever · 2017-07-18T09:29:02Z

Closing this issue for now. Please feel free to re-open if you have any more questions or comments.

don-lab-dc · 2018-12-19T21:29:38Z

I'm using tpot and loving it, but am struggling to join the names of the features I provide tpot with the list of feature importances that I extract using tpot._fitted_pipeline.steps[-1][1].feature_importances_. I understand that this is because tpot is building and evaluating new synthetic features. Do you have a recommended method for either or both of the following: (1) disabling synthetic feature generation so I can zip my feature names to the feature importances; or (2) appending names of the generated features to my list of feature names so I can zip them with feature importances? Ideally, I'd like to be able to do something like this:

for feature_name, feature_score in zip(df.drop('class', axis=1).columns, tpot._fitted_pipeline.steps[-1][1].feature_importances_):
    print(feature_name, '\t', feature_score)

Here's an example pipeline to which I would like to apply such a method:

{'config_dict': {'sklearn.ensemble.RandomForestClassifier': {'n_estimators': [100], 'criterion': ['gini', 'entropy'], 'max_features': array([0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 , 0.55,
       0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1.  ]), 'min_samples_split': range(2, 21), 'min_samples_leaf': range(1, 21), 'bootstrap': [True, False]}, 'sklearn.tree.DecisionTreeClassifier': {'criterion': ['gini', 'entropy'], 'max_depth': range(1, 11), 'min_samples_split': range(2, 21), 'min_samples_leaf': range(1, 21)}, 'sklearn.ensemble.ExtraTreesClassifier': {'n_estimators': [100], 'criterion': ['gini', 'entropy'], 'max_features': array([0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 , 0.55,
       0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1.  ]), 'min_samples_split': range(2, 21), 'min_samples_leaf': range(1, 21), 'bootstrap': [True, False]}, 'sklearn.preprocessing.Binarizer': {'threshold': array([0.  , 0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ,
       0.55, 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1.  ])}, 'sklearn.cluster.FeatureAgglomeration': {'linkage': ['ward', 'complete', 'average'], 'affinity': ['euclidean', 'l1', 'l2', 'manhattan', 'cosine']}, 'sklearn.preprocessing.MaxAbsScaler': {}, 'sklearn.preprocessing.MinMaxScaler': {}, 'sklearn.preprocessing.Normalizer': {'norm': ['l1', 'l2', 'max']}, 'sklearn.decomposition.PCA': {'svd_solver': ['randomized'], 'iterated_power': range(1, 11)}, 'sklearn.kernel_approximation.RBFSampler': {'gamma': array([0.  , 0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ,
       0.55, 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1.  ])}, 'sklearn.preprocessing.RobustScaler': {}, 'sklearn.preprocessing.StandardScaler': {}, 'tpot.builtins.ZeroCount': {}, 'sklearn.feature_selection.SelectFwe': {'alpha': array([0.   , 0.001, 0.002, 0.003, 0.004, 0.005, 0.006, 0.007, 0.008,
       0.009, 0.01 , 0.011, 0.012, 0.013, 0.014, 0.015, 0.016, 0.017,
       0.018, 0.019, 0.02 , 0.021, 0.022, 0.023, 0.024, 0.025, 0.026,
       0.027, 0.028, 0.029, 0.03 , 0.031, 0.032, 0.033, 0.034, 0.035,
       0.036, 0.037, 0.038, 0.039, 0.04 , 0.041, 0.042, 0.043, 0.044,
       0.045, 0.046, 0.047, 0.048, 0.049]), 'score_func': {'sklearn.feature_selection.f_classif': None}}, 'sklearn.feature_selection.SelectPercentile': {'percentile': range(1, 100), 'score_func': {'sklearn.feature_selection.f_classif': None}}, 'sklearn.feature_selection.VarianceThreshold': {'threshold': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.2]}}, 'crossover_rate': 0.1, 'cv': 5, 'disable_update_check': False, 'early_stop': None, 'generations': 10, 'max_eval_time_mins': 5, 'max_time_mins': None, 'memory': None, 'mutation_rate': 0.9, 'n_jobs': 7, 'offspring_size': 10, 'periodic_checkpoint_folder': None, 'population_size': 10, 'random_state': None, 'scoring': None, 'subsample': 1.0, 'verbosity': 2, 'warm_start': False}

Apologies if I missed this being addressed previously.

weixuanfu · 2018-12-20T14:45:21Z

Hmm, I think those synthetic features should be in those first (left) columns but they usually had very high importance scores in the last operator of pipeline.

For now, TPOT does not provide this option for disabling synthetic feature generation. But:

One of my dev branch of TPOT called noCDF_noStacking has a option named simple_pipeline, which can disable both StackingEstimator and CombineDFs if simple_pipeline=True (e.g. TPOTClassifier(simple_pipeline=True)). But it is noted that this dev branch is not fully tested yet. If you want to try TPOT without StackingEstimator and FeatureUnion, you may install this branch in your test environment via the command below:
pip install --upgrade --no-deps --force-reinstall git+https://github.com/weixuanfu/tpot.git@noCDF_noStacking

Please check #152 for more details. We are working on a more advanced pipeline configuration option.

don-lab-dc · 2018-12-21T14:30:44Z

Thanks @weixuanfu ! For purposes of transparency, explainability, and trust, it would be lovely to have the ability to connect TPOT to something like eli5 for feature importance inspection and exploration. This may not be so important for biological work (I don't really know), but for public safety work, it's quite important to be able to be able to explain -- if only very roughly -- how a model works.

Muhammad-Hassan1000 · 2023-04-12T11:26:08Z

@weixuanfu I'm using TPOT and I want to extract the feature importance for every evaluated individual and not just the best pipeline. I am able to access all the pipelines using tpot.evaluated_individuals_ but then I want to retrieve feature importance through either .feature_importances_ or .coef_ or permutation_importance. Is there a way re-evaluate each pipeline to retrieve this information?
I know some models may not have feature_importance attribute therefore, I have used it in an If Else block.

rhiever added the question label May 22, 2017

AIAdventures mentioned this issue Jun 6, 2017

Titanic example -problem with 2nd last cell. #492

Closed

PGijsbers mentioned this issue Jun 23, 2017

Tpot examples do not seem to differentiate/evolve? #503

Closed

rhiever closed this as completed Jul 18, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Visualize constructed features and get best pipeline found. #459

Visualize constructed features and get best pipeline found. #459

axelroy commented May 22, 2017

weixuanfu commented May 22, 2017

rhiever commented May 22, 2017

axelroy commented May 23, 2017

rhiever commented Jul 18, 2017

don-lab-dc commented Dec 19, 2018 •

edited

Loading

weixuanfu commented Dec 20, 2018

don-lab-dc commented Dec 21, 2018 •

edited

Loading

Muhammad-Hassan1000 commented Apr 12, 2023

Visualize constructed features and get best pipeline found. #459

Visualize constructed features and get best pipeline found. #459

Comments

axelroy commented May 22, 2017

weixuanfu commented May 22, 2017

rhiever commented May 22, 2017

axelroy commented May 23, 2017

rhiever commented Jul 18, 2017

don-lab-dc commented Dec 19, 2018 • edited Loading

weixuanfu commented Dec 20, 2018

don-lab-dc commented Dec 21, 2018 • edited Loading

Muhammad-Hassan1000 commented Apr 12, 2023

don-lab-dc commented Dec 19, 2018 •

edited

Loading

don-lab-dc commented Dec 21, 2018 •

edited

Loading