Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Visualize constructed features and get best pipeline found. #459

Closed
axelroy opened this issue May 22, 2017 · 8 comments
Closed

Visualize constructed features and get best pipeline found. #459

axelroy opened this issue May 22, 2017 · 8 comments
Labels

Comments

@axelroy
Copy link

axelroy commented May 22, 2017

Greetings,

First of all, thank you for the amazing job you did on this project. I'm trying to use TPOT on a research context, and after a few tests, I have some questions about how to use it :

  • I've seen in the issue Workflow to visualize Tpot results #337 that we can retrieve the explored pipelines with the tpot._evaluated_individuals set. Is this a good way to use it, or could it change over the versions? I want to be able to retrieve the best model, the features and the parameters to store it into a DB.

  • Is there any way to retrieve the best features, as shown in this paper on page 12, and to know from which the constructed ones have been based on?

Thank you for your help,

Kind regards,
Axel.

@weixuanfu
Copy link
Contributor

For the first question, I think this is right way to use it for current version and next version (0.8). You can also find the example in the unit test, test_evaluated_individuals, if the way is updated in future versions.

For the second one. for now TPOT cannot provide the ranking of features' importance as Figure 5 in the paper. The importance of features on page 12 was estimated using Random Forest method.

@rhiever
Copy link
Contributor

rhiever commented May 22, 2017

Hi @axelroy,

If you want to access the best model from the TPOT run, you can access it with the tpot. _fitted_pipeline property at the end of the run. If you run TPOT at the highest verbosity (3), you can also access the entire Pareto front of best pipelines with the tpot._pareto_front_fitted_pipelines property. Note that both of these properties are only assigned at the end of a TPOT run.

In terms of presenting feature importances, those are limited to specific models. In the case of the paper you linked, those were decision trees and random forests, so I was displaying tree-based feature importances. If TPOT discovers a pipeline for you that uses a decision tree or other tree-based method as the final classifier, for example, then you could access those feature importances with the following code:

# The first index to -1 gets the last step in the pipeline
# The second index to 1 gets the actual classifier object
tpot._fitted_pipeline.steps[-1][1].feature_importances_

which is an array of feature importances that you can then match with the feature names. The same applies with linear models, except you'd access the coef_ property instead.

@axelroy
Copy link
Author

axelroy commented May 23, 2017

Thank you very much for the responses, I'll test this as soon as possible.

@rhiever
Copy link
Contributor

rhiever commented Jul 18, 2017

Closing this issue for now. Please feel free to re-open if you have any more questions or comments.

@rhiever rhiever closed this as completed Jul 18, 2017
@don-lab-dc
Copy link

don-lab-dc commented Dec 19, 2018

I'm using tpot and loving it, but am struggling to join the names of the features I provide tpot with the list of feature importances that I extract using tpot._fitted_pipeline.steps[-1][1].feature_importances_. I understand that this is because tpot is building and evaluating new synthetic features. Do you have a recommended method for either or both of the following: (1) disabling synthetic feature generation so I can zip my feature names to the feature importances; or (2) appending names of the generated features to my list of feature names so I can zip them with feature importances? Ideally, I'd like to be able to do something like this:

for feature_name, feature_score in zip(df.drop('class', axis=1).columns, tpot._fitted_pipeline.steps[-1][1].feature_importances_):
    print(feature_name, '\t', feature_score)

Here's an example pipeline to which I would like to apply such a method:

{'config_dict': {'sklearn.ensemble.RandomForestClassifier': {'n_estimators': [100], 'criterion': ['gini', 'entropy'], 'max_features': array([0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 , 0.55,
       0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1.  ]), 'min_samples_split': range(2, 21), 'min_samples_leaf': range(1, 21), 'bootstrap': [True, False]}, 'sklearn.tree.DecisionTreeClassifier': {'criterion': ['gini', 'entropy'], 'max_depth': range(1, 11), 'min_samples_split': range(2, 21), 'min_samples_leaf': range(1, 21)}, 'sklearn.ensemble.ExtraTreesClassifier': {'n_estimators': [100], 'criterion': ['gini', 'entropy'], 'max_features': array([0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 , 0.55,
       0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1.  ]), 'min_samples_split': range(2, 21), 'min_samples_leaf': range(1, 21), 'bootstrap': [True, False]}, 'sklearn.preprocessing.Binarizer': {'threshold': array([0.  , 0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ,
       0.55, 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1.  ])}, 'sklearn.cluster.FeatureAgglomeration': {'linkage': ['ward', 'complete', 'average'], 'affinity': ['euclidean', 'l1', 'l2', 'manhattan', 'cosine']}, 'sklearn.preprocessing.MaxAbsScaler': {}, 'sklearn.preprocessing.MinMaxScaler': {}, 'sklearn.preprocessing.Normalizer': {'norm': ['l1', 'l2', 'max']}, 'sklearn.decomposition.PCA': {'svd_solver': ['randomized'], 'iterated_power': range(1, 11)}, 'sklearn.kernel_approximation.RBFSampler': {'gamma': array([0.  , 0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ,
       0.55, 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1.  ])}, 'sklearn.preprocessing.RobustScaler': {}, 'sklearn.preprocessing.StandardScaler': {}, 'tpot.builtins.ZeroCount': {}, 'sklearn.feature_selection.SelectFwe': {'alpha': array([0.   , 0.001, 0.002, 0.003, 0.004, 0.005, 0.006, 0.007, 0.008,
       0.009, 0.01 , 0.011, 0.012, 0.013, 0.014, 0.015, 0.016, 0.017,
       0.018, 0.019, 0.02 , 0.021, 0.022, 0.023, 0.024, 0.025, 0.026,
       0.027, 0.028, 0.029, 0.03 , 0.031, 0.032, 0.033, 0.034, 0.035,
       0.036, 0.037, 0.038, 0.039, 0.04 , 0.041, 0.042, 0.043, 0.044,
       0.045, 0.046, 0.047, 0.048, 0.049]), 'score_func': {'sklearn.feature_selection.f_classif': None}}, 'sklearn.feature_selection.SelectPercentile': {'percentile': range(1, 100), 'score_func': {'sklearn.feature_selection.f_classif': None}}, 'sklearn.feature_selection.VarianceThreshold': {'threshold': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.2]}}, 'crossover_rate': 0.1, 'cv': 5, 'disable_update_check': False, 'early_stop': None, 'generations': 10, 'max_eval_time_mins': 5, 'max_time_mins': None, 'memory': None, 'mutation_rate': 0.9, 'n_jobs': 7, 'offspring_size': 10, 'periodic_checkpoint_folder': None, 'population_size': 10, 'random_state': None, 'scoring': None, 'subsample': 1.0, 'verbosity': 2, 'warm_start': False}

Apologies if I missed this being addressed previously.

@weixuanfu
Copy link
Contributor

Hmm, I think those synthetic features should be in those first (left) columns but they usually had very high importance scores in the last operator of pipeline.

For now, TPOT does not provide this option for disabling synthetic feature generation. But:

One of my dev branch of TPOT called noCDF_noStacking has a option named simple_pipeline, which can disable both StackingEstimator and CombineDFs if simple_pipeline=True (e.g. TPOTClassifier(simple_pipeline=True)). But it is noted that this dev branch is not fully tested yet. If you want to try TPOT without StackingEstimator and FeatureUnion, you may install this branch in your test environment via the command below:

pip install --upgrade --no-deps --force-reinstall git+https://github.com/weixuanfu/tpot.git@noCDF_noStacking

Please check #152 for more details. We are working on a more advanced pipeline configuration option.

@don-lab-dc
Copy link

don-lab-dc commented Dec 21, 2018

Thanks @weixuanfu ! For purposes of transparency, explainability, and trust, it would be lovely to have the ability to connect TPOT to something like eli5 for feature importance inspection and exploration. This may not be so important for biological work (I don't really know), but for public safety work, it's quite important to be able to be able to explain -- if only very roughly -- how a model works.

@Muhammad-Hassan1000
Copy link

@weixuanfu I'm using TPOT and I want to extract the feature importance for every evaluated individual and not just the best pipeline. I am able to access all the pipelines using tpot.evaluated_individuals_ but then I want to retrieve feature importance through either .feature_importances_ or .coef_ or permutation_importance. Is there a way re-evaluate each pipeline to retrieve this information?
I know some models may not have feature_importance attribute therefore, I have used it in an If Else block.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants