diff --git a/examples/comparing-feature-selectors/analyze-results.ipynb b/examples/comparing-feature-selectors/analyze-results.ipynb new file mode 100644 index 0000000..a6a0e09 --- /dev/null +++ b/examples/comparing-feature-selectors/analyze-results.ipynb @@ -0,0 +1,2343 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Comparing Feature Selectors\n", + "Hi! You want to compare the performance of multiple feature selectors? This is an example Notebook, showing you how to do such an analysis. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "First of all, to recap:\n", + "\n", + "1. You just ran something similar to:\n", + "\n", + " `python benchmark.py --multirun ranker=\"glob(*)\" +callbacks.to_sql.url=\"sqlite:////tmp/results.sqlite\"`\n", + "2. There now should exist a `.sqlite` file at this path: `/tmp/results.sqlite`:\n", + "\n", + " ```\n", + " $ ls -al /tmp/results.sqlite\n", + " -rw-r--r-- 1 vscode vscode 20480 Sep 21 08:16 /tmp/results.sqlite\n", + " ```\n", + "\n", + "Let's now analyze the results! 📈" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We will install `plotly-express`, so we can make nice plots later." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip available: \u001b[0m\u001b[31;49m22.2.2\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m22.3\u001b[0m\n", + "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n", + "Note: you may need to restart the kernel to use updated packages.\n" + ] + } + ], + "source": [ + "%pip install plotly-express nbconvert --quiet" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next, let's find a place to store our results to. In this case, we choose to store it in a local SQLite database, located at `/tmp/results.sqlite`." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'sqlite:////tmp/results.sqlite'" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import os\n", + "\n", + "con: str = \"sqlite:////tmp/results.sqlite\"\n", + "con" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, we can read the `experiments` table." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
datasetdataset/ndataset/pdataset/taskdataset/groupdataset/domainrankervalidatorlocal_dirdate_created
id
3lllxl48My synthetic dataset1000020classificationNoneNoneANOVA F-valuek-NN/workspaces/fseval/examples/comparing-feature-...2022-10-22 14:28:27.506838
1944ropgMy synthetic dataset1000020classificationNoneNoneBorutak-NN/workspaces/fseval/examples/comparing-feature-...2022-10-22 14:28:31.230633
31gd56gfMy synthetic dataset1000020classificationNoneNoneChi-Squaredk-NN/workspaces/fseval/examples/comparing-feature-...2022-10-22 14:29:19.633012
a8washm5My synthetic dataset1000020classificationNoneNoneDecision Treek-NN/workspaces/fseval/examples/comparing-feature-...2022-10-22 14:29:23.459190
27i7uwg4My synthetic dataset1000020classificationNoneNoneInfinite Selectionk-NN/workspaces/fseval/examples/comparing-feature-...2022-10-22 14:29:27.506974
3velt3b9My synthetic dataset1000020classificationNoneNoneMultiSURFk-NN/workspaces/fseval/examples/comparing-feature-...2022-10-22 14:29:31.758090
3fdrxlt6My synthetic dataset1000020classificationNoneNoneMutual Infok-NN/workspaces/fseval/examples/comparing-feature-...2022-10-22 14:35:04.289361
14lecx0gMy synthetic dataset1000020classificationNoneNoneReliefFk-NN/workspaces/fseval/examples/comparing-feature-...2022-10-22 14:35:08.614262
3sggjvu3My synthetic dataset1000020classificationNoneNoneStability Selectionk-NN/workspaces/fseval/examples/comparing-feature-...2022-10-22 14:35:59.121416
dtt8bvo5My synthetic dataset1000020classificationNoneNoneXGBoostk-NN/workspaces/fseval/examples/comparing-feature-...2022-10-22 14:36:23.385401
\n", + "
" + ], + "text/plain": [ + " dataset dataset/n dataset/p dataset/task \\\n", + "id \n", + "3lllxl48 My synthetic dataset 10000 20 classification \n", + "1944ropg My synthetic dataset 10000 20 classification \n", + "31gd56gf My synthetic dataset 10000 20 classification \n", + "a8washm5 My synthetic dataset 10000 20 classification \n", + "27i7uwg4 My synthetic dataset 10000 20 classification \n", + "3velt3b9 My synthetic dataset 10000 20 classification \n", + "3fdrxlt6 My synthetic dataset 10000 20 classification \n", + "14lecx0g My synthetic dataset 10000 20 classification \n", + "3sggjvu3 My synthetic dataset 10000 20 classification \n", + "dtt8bvo5 My synthetic dataset 10000 20 classification \n", + "\n", + " dataset/group dataset/domain ranker validator \\\n", + "id \n", + "3lllxl48 None None ANOVA F-value k-NN \n", + "1944ropg None None Boruta k-NN \n", + "31gd56gf None None Chi-Squared k-NN \n", + "a8washm5 None None Decision Tree k-NN \n", + "27i7uwg4 None None Infinite Selection k-NN \n", + "3velt3b9 None None MultiSURF k-NN \n", + "3fdrxlt6 None None Mutual Info k-NN \n", + "14lecx0g None None ReliefF k-NN \n", + "3sggjvu3 None None Stability Selection k-NN \n", + "dtt8bvo5 None None XGBoost k-NN \n", + "\n", + " local_dir \\\n", + "id \n", + "3lllxl48 /workspaces/fseval/examples/comparing-feature-... \n", + "1944ropg /workspaces/fseval/examples/comparing-feature-... \n", + "31gd56gf /workspaces/fseval/examples/comparing-feature-... \n", + "a8washm5 /workspaces/fseval/examples/comparing-feature-... \n", + "27i7uwg4 /workspaces/fseval/examples/comparing-feature-... \n", + "3velt3b9 /workspaces/fseval/examples/comparing-feature-... \n", + "3fdrxlt6 /workspaces/fseval/examples/comparing-feature-... \n", + "14lecx0g /workspaces/fseval/examples/comparing-feature-... \n", + "3sggjvu3 /workspaces/fseval/examples/comparing-feature-... \n", + "dtt8bvo5 /workspaces/fseval/examples/comparing-feature-... \n", + "\n", + " date_created \n", + "id \n", + "3lllxl48 2022-10-22 14:28:27.506838 \n", + "1944ropg 2022-10-22 14:28:31.230633 \n", + "31gd56gf 2022-10-22 14:29:19.633012 \n", + "a8washm5 2022-10-22 14:29:23.459190 \n", + "27i7uwg4 2022-10-22 14:29:27.506974 \n", + "3velt3b9 2022-10-22 14:29:31.758090 \n", + "3fdrxlt6 2022-10-22 14:35:04.289361 \n", + "14lecx0g 2022-10-22 14:35:08.614262 \n", + "3sggjvu3 2022-10-22 14:35:59.121416 \n", + "dtt8bvo5 2022-10-22 14:36:23.385401 " + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import pandas as pd\n", + "\n", + "experiments: pd.DataFrame = pd.read_sql_table(\"experiments\", con=con, index_col=\"id\")\n", + "experiments" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's also read in the `validation_scores`." + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
indexn_features_to_selectfit_timescorebootstrap_state
id
3lllxl48010.0044330.79551
3lllxl48020.0042270.79101
3lllxl48030.0051830.79501
3lllxl48040.0038650.79651
3lllxl48050.0029020.79501
..................
dtt8bvo50160.0006700.78051
dtt8bvo50170.0004800.77251
dtt8bvo50180.0031590.77601
dtt8bvo50190.0008480.76501
dtt8bvo50200.0005650.75901
\n", + "

160 rows × 5 columns

\n", + "
" + ], + "text/plain": [ + " index n_features_to_select fit_time score bootstrap_state\n", + "id \n", + "3lllxl48 0 1 0.004433 0.7955 1\n", + "3lllxl48 0 2 0.004227 0.7910 1\n", + "3lllxl48 0 3 0.005183 0.7950 1\n", + "3lllxl48 0 4 0.003865 0.7965 1\n", + "3lllxl48 0 5 0.002902 0.7950 1\n", + "... ... ... ... ... ...\n", + "dtt8bvo5 0 16 0.000670 0.7805 1\n", + "dtt8bvo5 0 17 0.000480 0.7725 1\n", + "dtt8bvo5 0 18 0.003159 0.7760 1\n", + "dtt8bvo5 0 19 0.000848 0.7650 1\n", + "dtt8bvo5 0 20 0.000565 0.7590 1\n", + "\n", + "[160 rows x 5 columns]" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "validation_scores: pd.DataFrame = pd.read_sql_table(\"validation_scores\", con=con, index_col=\"id\")\n", + "validation_scores" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can now merge them. Notice that we set as the _index_ the experiment ID, so we can use `pd.DataFrame.join` to do this." + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
datasetdataset/ndataset/pdataset/taskdataset/groupdataset/domainrankervalidatorlocal_dirdate_createdindexn_features_to_selectfit_timescorebootstrap_state
id
14lecx0gMy synthetic dataset1000020classificationNoneNoneReliefFk-NN/workspaces/fseval/examples/comparing-feature-...2022-10-22 14:35:08.614262NaNNaNNaNNaNNaN
\n", + "
" + ], + "text/plain": [ + " dataset dataset/n dataset/p dataset/task \\\n", + "id \n", + "14lecx0g My synthetic dataset 10000 20 classification \n", + "\n", + " dataset/group dataset/domain ranker validator \\\n", + "id \n", + "14lecx0g None None ReliefF k-NN \n", + "\n", + " local_dir \\\n", + "id \n", + "14lecx0g /workspaces/fseval/examples/comparing-feature-... \n", + "\n", + " date_created index n_features_to_select fit_time \\\n", + "id \n", + "14lecx0g 2022-10-22 14:35:08.614262 NaN NaN NaN \n", + "\n", + " score bootstrap_state \n", + "id \n", + "14lecx0g NaN NaN " + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "validation_scores_with_experiment_info = experiments.join(\n", + " validation_scores\n", + ")\n", + "validation_scores_with_experiment_info.head(1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Cool! That will be all the information that we need. Let's first create an overview for all the rankers we benchmarked." + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
dataset/ndataset/pindexn_features_to_selectfit_timescorebootstrap_state
ranker
Infinite Selection10000.020.00.010.50.0046000.8189251.0
XGBoost10000.020.00.010.50.0029980.8185751.0
Decision Tree10000.020.00.010.50.0028100.8176751.0
Stability Selection10000.020.00.010.50.0024060.8033251.0
Chi-Squared10000.020.00.010.50.0025480.7959751.0
ANOVA F-value10000.020.00.010.50.0037450.7892751.0
Mutual Info10000.020.00.010.50.0023140.7864751.0
Boruta10000.020.00.010.50.0023660.5180751.0
MultiSURF10000.020.0NaNNaNNaNNaNNaN
ReliefF10000.020.0NaNNaNNaNNaNNaN
\n", + "
" + ], + "text/plain": [ + " dataset/n dataset/p index n_features_to_select \\\n", + "ranker \n", + "Infinite Selection 10000.0 20.0 0.0 10.5 \n", + "XGBoost 10000.0 20.0 0.0 10.5 \n", + "Decision Tree 10000.0 20.0 0.0 10.5 \n", + "Stability Selection 10000.0 20.0 0.0 10.5 \n", + "Chi-Squared 10000.0 20.0 0.0 10.5 \n", + "ANOVA F-value 10000.0 20.0 0.0 10.5 \n", + "Mutual Info 10000.0 20.0 0.0 10.5 \n", + "Boruta 10000.0 20.0 0.0 10.5 \n", + "MultiSURF 10000.0 20.0 NaN NaN \n", + "ReliefF 10000.0 20.0 NaN NaN \n", + "\n", + " fit_time score bootstrap_state \n", + "ranker \n", + "Infinite Selection 0.004600 0.818925 1.0 \n", + "XGBoost 0.002998 0.818575 1.0 \n", + "Decision Tree 0.002810 0.817675 1.0 \n", + "Stability Selection 0.002406 0.803325 1.0 \n", + "Chi-Squared 0.002548 0.795975 1.0 \n", + "ANOVA F-value 0.003745 0.789275 1.0 \n", + "Mutual Info 0.002314 0.786475 1.0 \n", + "Boruta 0.002366 0.518075 1.0 \n", + "MultiSURF NaN NaN NaN \n", + "ReliefF NaN NaN NaN " + ] + }, + "execution_count": 25, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "validation_scores_with_experiment_info \\\n", + " .groupby(\"ranker\") \\\n", + " .mean(numeric_only=True) \\\n", + " .sort_values(\"score\", ascending=False)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Already, we notice that MultiSURF and ReliefF are missing. This is because the experiments failed. That can happen in a big benchmark! We will ignore this for now and continue with the other Feature Selectors.\n", + "\n", + "👀 We can already observe, that the _average_ classification accuracy is the highest for Infinite Selection. Although it would be premature to say it is the best, this is an indication that it did will for this dataset.\n", + "\n", + "Let's plot the results _per_ `n_features_to_select`. Note, that `n_features_to_select` means a validation step was run using a feature subset of size `n_features_to_select`." + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.plotly.v1+json": { + "config": { + "plotlyServerURL": "https://plot.ly" + }, + "data": [ + { + "hovertemplate": "ranker=ReliefF
n_features_to_select=%{x}
score=%{y}", + "legendgroup": "ReliefF", + "line": { + "color": "#636efa", + "dash": "solid" + }, + "marker": { + "symbol": "circle" + }, + "mode": "lines", + "name": "ReliefF", + "orientation": "v", + "showlegend": true, + "type": "scatter", + "x": [ + null + ], + "xaxis": "x", + "y": [ + null + ], + "yaxis": "y" + }, + { + "hovertemplate": "ranker=Boruta
n_features_to_select=%{x}
score=%{y}", + "legendgroup": "Boruta", + "line": { + "color": "#EF553B", + "dash": "solid" + }, + "marker": { + "symbol": "circle" + }, + "mode": "lines", + "name": "Boruta", + "orientation": "v", + "showlegend": true, + "type": "scatter", + "x": [ + 1, + 2, + 3, + 4, + 5, + 6, + 7, + 8, + 9, + 10, + 11, + 12, + 13, + 14, + 15, + 16, + 17, + 18, + 19, + 20 + ], + "xaxis": "x", + "y": [ + 0.4975, + 0.505, + 0.508, + 0.4845, + 0.504, + 0.503, + 0.5155, + 0.5035, + 0.4955, + 0.4985, + 0.5155, + 0.497, + 0.5085, + 0.5055, + 0.5265, + 0.511, + 0.5085, + 0.511, + 0.504, + 0.759 + ], + "yaxis": "y" + }, + { + "hovertemplate": "ranker=Infinite Selection
n_features_to_select=%{x}
score=%{y}", + "legendgroup": "Infinite Selection", + "line": { + "color": "#00cc96", + "dash": "solid" + }, + "marker": { + "symbol": "circle" + }, + "mode": "lines", + "name": "Infinite Selection", + "orientation": "v", + "showlegend": true, + "type": "scatter", + "x": [ + 1, + 2, + 3, + 4, + 5, + 6, + 7, + 8, + 9, + 10, + 11, + 12, + 13, + 14, + 15, + 16, + 17, + 18, + 19, + 20 + ], + "xaxis": "x", + "y": [ + 0.7955, + 0.9075, + 0.892, + 0.882, + 0.876, + 0.8605, + 0.8365, + 0.8345, + 0.826, + 0.8165, + 0.8105, + 0.802, + 0.797, + 0.7955, + 0.795, + 0.773, + 0.776, + 0.776, + 0.7675, + 0.759 + ], + "yaxis": "y" + }, + { + "hovertemplate": "ranker=Chi-Squared
n_features_to_select=%{x}
score=%{y}", + "legendgroup": "Chi-Squared", + "line": { + "color": "#ab63fa", + "dash": "solid" + }, + "marker": { + "symbol": "circle" + }, + "mode": "lines", + "name": "Chi-Squared", + "orientation": "v", + "showlegend": true, + "type": "scatter", + "x": [ + 1, + 2, + 3, + 4, + 5, + 6, + 7, + 8, + 9, + 10, + 11, + 12, + 13, + 14, + 15, + 16, + 17, + 18, + 19, + 20 + ], + "xaxis": "x", + "y": [ + 0.7955, + 0.791, + 0.795, + 0.7965, + 0.795, + 0.78, + 0.8425, + 0.829, + 0.8255, + 0.814, + 0.806, + 0.804, + 0.799, + 0.794, + 0.7875, + 0.7765, + 0.783, + 0.778, + 0.7685, + 0.759 + ], + "yaxis": "y" + }, + { + "hovertemplate": "ranker=Mutual Info
n_features_to_select=%{x}
score=%{y}", + "legendgroup": "Mutual Info", + "line": { + "color": "#FFA15A", + "dash": "solid" + }, + "marker": { + "symbol": "circle" + }, + "mode": "lines", + "name": "Mutual Info", + "orientation": "v", + "showlegend": true, + "type": "scatter", + "x": [ + 1, + 2, + 3, + 4, + 5, + 6, + 7, + 8, + 9, + 10, + 11, + 12, + 13, + 14, + 15, + 16, + 17, + 18, + 19, + 20 + ], + "xaxis": "x", + "y": [ + 0.7955, + 0.798, + 0.7955, + 0.791, + 0.7895, + 0.798, + 0.786, + 0.7905, + 0.791, + 0.7925, + 0.782, + 0.782, + 0.7815, + 0.7815, + 0.8025, + 0.7905, + 0.7815, + 0.7755, + 0.766, + 0.759 + ], + "yaxis": "y" + }, + { + "hovertemplate": "ranker=ANOVA F-value
n_features_to_select=%{x}
score=%{y}", + "legendgroup": "ANOVA F-value", + "line": { + "color": "#19d3f3", + "dash": "solid" + }, + "marker": { + "symbol": "circle" + }, + "mode": "lines", + "name": "ANOVA F-value", + "orientation": "v", + "showlegend": true, + "type": "scatter", + "x": [ + 1, + 2, + 3, + 4, + 5, + 6, + 7, + 8, + 9, + 10, + 11, + 12, + 13, + 14, + 15, + 16, + 17, + 18, + 19, + 20 + ], + "xaxis": "x", + "y": [ + 0.7955, + 0.791, + 0.795, + 0.7965, + 0.795, + 0.78, + 0.7885, + 0.787, + 0.78, + 0.814, + 0.811, + 0.804, + 0.8015, + 0.794, + 0.7875, + 0.7765, + 0.783, + 0.778, + 0.7685, + 0.759 + ], + "yaxis": "y" + }, + { + "hovertemplate": "ranker=Stability Selection
n_features_to_select=%{x}
score=%{y}", + "legendgroup": "Stability Selection", + "line": { + "color": "#FF6692", + "dash": "solid" + }, + "marker": { + "symbol": "circle" + }, + "mode": "lines", + "name": "Stability Selection", + "orientation": "v", + "showlegend": true, + "type": "scatter", + "x": [ + 1, + 2, + 3, + 4, + 5, + 6, + 7, + 8, + 9, + 10, + 11, + 12, + 13, + 14, + 15, + 16, + 17, + 18, + 19, + 20 + ], + "xaxis": "x", + "y": [ + 0.491, + 0.9075, + 0.898, + 0.882, + 0.869, + 0.855, + 0.858, + 0.839, + 0.827, + 0.8235, + 0.8025, + 0.8015, + 0.7985, + 0.793, + 0.781, + 0.772, + 0.7675, + 0.7755, + 0.766, + 0.759 + ], + "yaxis": "y" + }, + { + "hovertemplate": "ranker=MultiSURF
n_features_to_select=%{x}
score=%{y}", + "legendgroup": "MultiSURF", + "line": { + "color": "#B6E880", + "dash": "solid" + }, + "marker": { + "symbol": "circle" + }, + "mode": "lines", + "name": "MultiSURF", + "orientation": "v", + "showlegend": true, + "type": "scatter", + "x": [ + null + ], + "xaxis": "x", + "y": [ + null + ], + "yaxis": "y" + }, + { + "hovertemplate": "ranker=Decision Tree
n_features_to_select=%{x}
score=%{y}", + "legendgroup": "Decision Tree", + "line": { + "color": "#FF97FF", + "dash": "solid" + }, + "marker": { + "symbol": "circle" + }, + "mode": "lines", + "name": "Decision Tree", + "orientation": "v", + "showlegend": true, + "type": "scatter", + "x": [ + 1, + 2, + 3, + 4, + 5, + 6, + 7, + 8, + 9, + 10, + 11, + 12, + 13, + 14, + 15, + 16, + 17, + 18, + 19, + 20 + ], + "xaxis": "x", + "y": [ + 0.7955, + 0.9075, + 0.892, + 0.888, + 0.8715, + 0.8575, + 0.85, + 0.826, + 0.8185, + 0.809, + 0.806, + 0.7965, + 0.786, + 0.7935, + 0.788, + 0.773, + 0.789, + 0.778, + 0.769, + 0.759 + ], + "yaxis": "y" + }, + { + "hovertemplate": "ranker=XGBoost
n_features_to_select=%{x}
score=%{y}", + "legendgroup": "XGBoost", + "line": { + "color": "#FECB52", + "dash": "solid" + }, + "marker": { + "symbol": "circle" + }, + "mode": "lines", + "name": "XGBoost", + "orientation": "v", + "showlegend": true, + "type": "scatter", + "x": [ + 1, + 2, + 3, + 4, + 5, + 6, + 7, + 8, + 9, + 10, + 11, + 12, + 13, + 14, + 15, + 16, + 17, + 18, + 19, + 20 + ], + "xaxis": "x", + "y": [ + 0.7955, + 0.9075, + 0.8975, + 0.886, + 0.879, + 0.8535, + 0.8455, + 0.83, + 0.818, + 0.8155, + 0.8055, + 0.801, + 0.8, + 0.796, + 0.788, + 0.7805, + 0.7725, + 0.776, + 0.765, + 0.759 + ], + "yaxis": "y" + } + ], + "layout": { + "legend": { + "title": { + "text": "ranker" + }, + "tracegroupgap": 0 + }, + "margin": { + "t": 60 + }, + "template": { + "data": { + "bar": [ + { + "error_x": { + "color": "#2a3f5f" + }, + "error_y": { + "color": "#2a3f5f" + }, + "marker": { + "line": { + "color": "#E5ECF6", + "width": 0.5 + }, + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "bar" + } + ], + "barpolar": [ + { + "marker": { + "line": { + "color": "#E5ECF6", + "width": 0.5 + }, + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "barpolar" + } + ], + "carpet": [ + { + "aaxis": { + "endlinecolor": "#2a3f5f", + "gridcolor": "white", + "linecolor": "white", + "minorgridcolor": "white", + "startlinecolor": "#2a3f5f" + }, + "baxis": { + "endlinecolor": "#2a3f5f", + "gridcolor": "white", + "linecolor": "white", + "minorgridcolor": "white", + "startlinecolor": "#2a3f5f" + }, + "type": "carpet" + } + ], + "choropleth": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "choropleth" + } + ], + "contour": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "contour" + } + ], + "contourcarpet": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "contourcarpet" + } + ], + "heatmap": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "heatmap" + } + ], + "heatmapgl": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "heatmapgl" + } + ], + "histogram": [ + { + "marker": { + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "histogram" + } + ], + "histogram2d": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "histogram2d" + } + ], + "histogram2dcontour": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "histogram2dcontour" + } + ], + "mesh3d": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "mesh3d" + } + ], + "parcoords": [ + { + "line": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "parcoords" + } + ], + "pie": [ + { + "automargin": true, + "type": "pie" + } + ], + "scatter": [ + { + "fillpattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + }, + "type": "scatter" + } + ], + "scatter3d": [ + { + "line": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatter3d" + } + ], + "scattercarpet": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattercarpet" + } + ], + "scattergeo": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattergeo" + } + ], + "scattergl": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattergl" + } + ], + "scattermapbox": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattermapbox" + } + ], + "scatterpolar": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterpolar" + } + ], + "scatterpolargl": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterpolargl" + } + ], + "scatterternary": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterternary" + } + ], + "surface": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "surface" + } + ], + "table": [ + { + "cells": { + "fill": { + "color": "#EBF0F8" + }, + "line": { + "color": "white" + } + }, + "header": { + "fill": { + "color": "#C8D4E3" + }, + "line": { + "color": "white" + } + }, + "type": "table" + } + ] + }, + "layout": { + "annotationdefaults": { + "arrowcolor": "#2a3f5f", + "arrowhead": 0, + "arrowwidth": 1 + }, + "autotypenumbers": "strict", + "coloraxis": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "colorscale": { + "diverging": [ + [ + 0, + "#8e0152" + ], + [ + 0.1, + "#c51b7d" + ], + [ + 0.2, + "#de77ae" + ], + [ + 0.3, + "#f1b6da" + ], + [ + 0.4, + "#fde0ef" + ], + [ + 0.5, + "#f7f7f7" + ], + [ + 0.6, + "#e6f5d0" + ], + [ + 0.7, + "#b8e186" + ], + [ + 0.8, + "#7fbc41" + ], + [ + 0.9, + "#4d9221" + ], + [ + 1, + "#276419" + ] + ], + "sequential": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "sequentialminus": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ] + }, + "colorway": [ + "#636efa", + "#EF553B", + "#00cc96", + "#ab63fa", + "#FFA15A", + "#19d3f3", + "#FF6692", + "#B6E880", + "#FF97FF", + "#FECB52" + ], + "font": { + "color": "#2a3f5f" + }, + "geo": { + "bgcolor": "white", + "lakecolor": "white", + "landcolor": "#E5ECF6", + "showlakes": true, + "showland": true, + "subunitcolor": "white" + }, + "hoverlabel": { + "align": "left" + }, + "hovermode": "closest", + "mapbox": { + "style": "light" + }, + "paper_bgcolor": "white", + "plot_bgcolor": "#E5ECF6", + "polar": { + "angularaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "bgcolor": "#E5ECF6", + "radialaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + } + }, + "scene": { + "xaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + }, + "yaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + }, + "zaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + } + }, + "shapedefaults": { + "line": { + "color": "#2a3f5f" + } + }, + "ternary": { + "aaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "baxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "bgcolor": "#E5ECF6", + "caxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + } + }, + "title": { + "x": 0.05 + }, + "xaxis": { + "automargin": true, + "gridcolor": "white", + "linecolor": "white", + "ticks": "", + "title": { + "standoff": 15 + }, + "zerolinecolor": "white", + "zerolinewidth": 2 + }, + "yaxis": { + "automargin": true, + "gridcolor": "white", + "linecolor": "white", + "ticks": "", + "title": { + "standoff": 15 + }, + "zerolinecolor": "white", + "zerolinewidth": 2 + } + } + }, + "xaxis": { + "anchor": "y", + "domain": [ + 0, + 1 + ], + "title": { + "text": "n_features_to_select" + } + }, + "yaxis": { + "anchor": "x", + "domain": [ + 0, + 1 + ], + "title": { + "text": "score" + } + } + } + }, + "text/html": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "import plotly.express as px\n", + "\n", + "px.line(\n", + " validation_scores_with_experiment_info,\n", + " x=\"n_features_to_select\",\n", + " y=\"score\",\n", + " color=\"ranker\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Indeed, we can see XGBoost, Infinite Selection and Decision Tree are solid contenders for this dataset.\n", + "\n", + "🙌🏻" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "--- \n", + "\n", + "This has shown how easy it is to do a large benchmark with `fseval`. Cheers!" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3.9.14 64-bit", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.14" + }, + "orig_nbformat": 4, + "vscode": { + "interpreter": { + "hash": "949777d72b0d2535278d3dc13498b2535136f6dfe0678499012e853ee9abcab1" + } + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/examples/comparing-feature-selectors/benchmark.py b/examples/comparing-feature-selectors/benchmark.py new file mode 100644 index 0000000..7dcd3be --- /dev/null +++ b/examples/comparing-feature-selectors/benchmark.py @@ -0,0 +1,53 @@ +import hydra +import numpy as np +from fseval.config import PipelineConfig +from fseval.main import run_pipeline +from infinite_selection import InfFS +from sklearn.base import BaseEstimator +from sklearn.feature_selection import chi2, f_classif, mutual_info_classif +from sklearn.preprocessing import minmax_scale +from stability_selection import StabilitySelection as RealStabilitySelection + + +class StabilitySelection(RealStabilitySelection): + def fit(self, X, y): + super(StabilitySelection, self).fit(X, y) + self.support_ = self.get_support() + self.feature_importances_ = np.max(self.stability_scores_, axis=1) + + +class InfiniteSelectionEstimator(BaseEstimator): + def fit(self, X, y): + inf = InfFS() + [RANKED, WEIGHT] = inf.infFS(X, y, alpha=0.5, supervision=1, verbose=1) + + self.feature_importances_ = WEIGHT + self.ranking_ = RANKED + + +class Chi2Classifier(BaseEstimator): + def fit(self, X, y): + X = minmax_scale(X) + scores, _ = chi2(X, y) + self.feature_importances_ = scores + + +class ANOVAFValueClassifier(BaseEstimator): + def fit(self, X, y): + scores, _ = f_classif(X, y) + self.feature_importances_ = scores + + +class MutualInfoClassifier(BaseEstimator): + def fit(self, X, y): + scores = mutual_info_classif(X, y) + self.feature_importances_ = scores + + +@hydra.main(config_path="conf", config_name="my_config", version_base="1.1") +def main(cfg: PipelineConfig) -> None: + run_pipeline(cfg) + + +if __name__ == "__main__": + main() diff --git a/examples/comparing-feature-selectors/conf/dataset/synthetic.yaml b/examples/comparing-feature-selectors/conf/dataset/synthetic.yaml new file mode 100644 index 0000000..07d1835 --- /dev/null +++ b/examples/comparing-feature-selectors/conf/dataset/synthetic.yaml @@ -0,0 +1,13 @@ +name: My synthetic dataset +task: classification +adapter: + _target_: sklearn.datasets.make_classification + n_samples: 10000 + n_informative: 2 + n_classes: 2 + n_features: 20 + n_redundant: 0 + random_state: 0 + shuffle: false +feature_importances: + X[:, 0:2]: 1.0 diff --git a/examples/comparing-feature-selectors/conf/my_config.yaml b/examples/comparing-feature-selectors/conf/my_config.yaml new file mode 100644 index 0000000..bcbcc27 --- /dev/null +++ b/examples/comparing-feature-selectors/conf/my_config.yaml @@ -0,0 +1,9 @@ +defaults: + - base_pipeline_config + - _self_ + - override dataset: synthetic + - override validator: knn + - override /callbacks: + - to_sql + +n_bootstraps: 1 diff --git a/examples/comparing-feature-selectors/conf/ranker/anova.yaml b/examples/comparing-feature-selectors/conf/ranker/anova.yaml new file mode 100644 index 0000000..f6b4299 --- /dev/null +++ b/examples/comparing-feature-selectors/conf/ranker/anova.yaml @@ -0,0 +1,5 @@ +name: ANOVA F-value +estimator: + _target_: benchmark.ANOVAFValueClassifier +_estimator_type: classifier +estimates_feature_importances: true diff --git a/examples/comparing-feature-selectors/conf/ranker/boruta.yaml b/examples/comparing-feature-selectors/conf/ranker/boruta.yaml new file mode 100644 index 0000000..ef7e0e4 --- /dev/null +++ b/examples/comparing-feature-selectors/conf/ranker/boruta.yaml @@ -0,0 +1,11 @@ +name: Boruta +estimator: + _target_: boruta.boruta_py.BorutaPy + estimator: + _target_: sklearn.ensemble.RandomForestClassifier + n_estimators: auto +_estimator_type: classifier +multioutput: false +estimates_feature_importances: false +estimates_feature_support: true +estimates_feature_ranking: true \ No newline at end of file diff --git a/examples/comparing-feature-selectors/conf/ranker/chi2.yaml b/examples/comparing-feature-selectors/conf/ranker/chi2.yaml new file mode 100644 index 0000000..ad6be06 --- /dev/null +++ b/examples/comparing-feature-selectors/conf/ranker/chi2.yaml @@ -0,0 +1,6 @@ +name: Chi-Squared +estimator: + _target_: benchmark.Chi2Classifier +_estimator_type: classifier +requires_positive_X: true +estimates_feature_importances: true \ No newline at end of file diff --git a/examples/comparing-feature-selectors/conf/ranker/decision_tree_classifier.yaml b/examples/comparing-feature-selectors/conf/ranker/decision_tree_classifier.yaml new file mode 100644 index 0000000..83b379e --- /dev/null +++ b/examples/comparing-feature-selectors/conf/ranker/decision_tree_classifier.yaml @@ -0,0 +1,7 @@ +name: Decision Tree +estimator: + _target_: sklearn.tree.DecisionTreeClassifier +_estimator_type: classifier +multioutput: true +estimates_feature_importances: true +estimates_target: true \ No newline at end of file diff --git a/examples/comparing-feature-selectors/conf/ranker/infinite_selection.yaml b/examples/comparing-feature-selectors/conf/ranker/infinite_selection.yaml new file mode 100644 index 0000000..2dfbb23 --- /dev/null +++ b/examples/comparing-feature-selectors/conf/ranker/infinite_selection.yaml @@ -0,0 +1,6 @@ +name: Infinite Selection +estimator: + _target_: benchmark.InfiniteSelectionEstimator +_estimator_type: classifier +estimates_feature_importances: true +estimates_feature_ranking: true \ No newline at end of file diff --git a/examples/comparing-feature-selectors/conf/ranker/multisurf_classifier.yaml b/examples/comparing-feature-selectors/conf/ranker/multisurf_classifier.yaml new file mode 100644 index 0000000..a9370af --- /dev/null +++ b/examples/comparing-feature-selectors/conf/ranker/multisurf_classifier.yaml @@ -0,0 +1,6 @@ +name: MultiSURF +estimator: + _target_: skrebate.MultiSURF +_estimator_type: classifier +multioutput: false +estimates_feature_importances: true \ No newline at end of file diff --git a/examples/comparing-feature-selectors/conf/ranker/mutual_info.yaml b/examples/comparing-feature-selectors/conf/ranker/mutual_info.yaml new file mode 100644 index 0000000..bcec11c --- /dev/null +++ b/examples/comparing-feature-selectors/conf/ranker/mutual_info.yaml @@ -0,0 +1,6 @@ +name: Mutual Info +estimator: + _target_: benchmark.MutualInfoClassifier +_estimator_type: classifier +multioutput: false +estimates_feature_importances: true diff --git a/examples/comparing-feature-selectors/conf/ranker/relieff_classifier.yaml b/examples/comparing-feature-selectors/conf/ranker/relieff_classifier.yaml new file mode 100644 index 0000000..2571683 --- /dev/null +++ b/examples/comparing-feature-selectors/conf/ranker/relieff_classifier.yaml @@ -0,0 +1,5 @@ +name: ReliefF +estimator: + _target_: skrebate.ReliefF +_estimator_type: classifier +estimates_feature_importances: true \ No newline at end of file diff --git a/examples/comparing-feature-selectors/conf/ranker/stability_selection.yaml b/examples/comparing-feature-selectors/conf/ranker/stability_selection.yaml new file mode 100644 index 0000000..0e212cb --- /dev/null +++ b/examples/comparing-feature-selectors/conf/ranker/stability_selection.yaml @@ -0,0 +1,10 @@ +name: Stability Selection +estimator: + _target_: benchmark.StabilitySelection + base_estimator: + _target_: sklearn.linear_model.LogisticRegression + penalty: l2 + bootstrap_func: stratified +_estimator_type: classifier +estimates_feature_importances: true +estimates_feature_support: true \ No newline at end of file diff --git a/examples/comparing-feature-selectors/conf/ranker/xgb_classifier.yaml b/examples/comparing-feature-selectors/conf/ranker/xgb_classifier.yaml new file mode 100644 index 0000000..94fb1e8 --- /dev/null +++ b/examples/comparing-feature-selectors/conf/ranker/xgb_classifier.yaml @@ -0,0 +1,8 @@ +name: XGBoost +estimator: + _target_: xgboost.XGBClassifier + use_label_encoder: False +_estimator_type: classifier +multioutput: false +estimates_feature_importances: true +estimates_target: true \ No newline at end of file diff --git a/examples/comparing-feature-selectors/conf/validator/knn.yaml b/examples/comparing-feature-selectors/conf/validator/knn.yaml new file mode 100644 index 0000000..7a3b4c5 --- /dev/null +++ b/examples/comparing-feature-selectors/conf/validator/knn.yaml @@ -0,0 +1,6 @@ +name: k-NN +estimator: + _target_: sklearn.neighbors.KNeighborsClassifier +_estimator_type: classifier +multioutput: false +estimates_target: true diff --git a/examples/comparing-feature-selectors/requirements.txt b/examples/comparing-feature-selectors/requirements.txt new file mode 100644 index 0000000..0f63acd --- /dev/null +++ b/examples/comparing-feature-selectors/requirements.txt @@ -0,0 +1,6 @@ +fseval +-e git+https://github.com/dunnkers/infinite-selection.git@6c9db1d5fe1b12bc34eb2af5893a4f3ca385aaff#egg=infinite_selection +-e git+https://github.com/dunnkers/stability-selection.git@baf54e7526bbce57d80871fcd93cdfdd67972a43#egg=stability_selection +Boruta>=0.3 +skrebate>=0.62 +xgboost>=1 diff --git a/website/docs/recipes/comparing-feature-selectors.md b/website/docs/recipes/comparing-feature-selectors.md new file mode 100644 index 0000000..35fc8d4 --- /dev/null +++ b/website/docs/recipes/comparing-feature-selectors.md @@ -0,0 +1,617 @@ +# Comparing Feature Selectors +Hi! You want to compare the performance of multiple feature selectors? This is an example Notebook, showing you how to do such an analysis. + +## Prerequisites + +We are going to use more or less the same configuration as we did in the [Quick start](../../quick-start) example, but then with more Feature Selectors. Again, start by downloading the example project: [comparing-feature-selectors.zip](pathname:///fseval/zipped-examples/comparing-feature-selectors.zip) + +### Installing the required packages + +Now, let's install the required packages. Make sure you are in the `comparing-feature-selectors` folder, containing the `requirements.txt` file, and then run the following: + +``` +pip install -r requirements.txt +``` + +## Running the experiment + +Run the following command to start the experiment: + +``` +python benchmark.py --multirun ranker="glob(*)" +callbacks.to_sql.url="sqlite:////tmp/results.sqlite +``` + +## Analyzing the results + +There now should exist a `.sqlite` file at this path: `/tmp/results.sqlite`: + + ``` + $ ls -al /tmp/results.sqlite + -rw-r--r-- 1 vscode vscode 20480 Sep 21 08:16 /tmp/results.sqlite + ``` + +Is that the case? Then let's now analyze the results! 📈 + +We will install `plotly-express`, so we can make nice plots later. + + +```python +%pip install plotly-express nbconvert --quiet +``` + + +Next, let's find a place to store our results to. In this case, we choose to store it in a local SQLite database, located at `/tmp/results.sqlite`. + + +```python +import os + +con: str = "sqlite:////tmp/results.sqlite" +con +``` + + + + + 'sqlite:////tmp/results.sqlite' + + + +Now, we can read the `experiments` table. + + +```python +import pandas as pd + +experiments: pd.DataFrame = pd.read_sql_table("experiments", con=con, index_col="id") +experiments +``` + + + + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
datasetdataset/ndataset/pdataset/taskdataset/groupdataset/domainrankervalidatorlocal_dirdate_created
id
3lllxl48My synthetic dataset1000020classificationNoneNoneANOVA F-valuek-NN/workspaces/fseval/examples/comparing-feature-...2022-10-22 14:28:27.506838
1944ropgMy synthetic dataset1000020classificationNoneNoneBorutak-NN/workspaces/fseval/examples/comparing-feature-...2022-10-22 14:28:31.230633
31gd56gfMy synthetic dataset1000020classificationNoneNoneChi-Squaredk-NN/workspaces/fseval/examples/comparing-feature-...2022-10-22 14:29:19.633012
a8washm5My synthetic dataset1000020classificationNoneNoneDecision Treek-NN/workspaces/fseval/examples/comparing-feature-...2022-10-22 14:29:23.459190
27i7uwg4My synthetic dataset1000020classificationNoneNoneInfinite Selectionk-NN/workspaces/fseval/examples/comparing-feature-...2022-10-22 14:29:27.506974
3velt3b9My synthetic dataset1000020classificationNoneNoneMultiSURFk-NN/workspaces/fseval/examples/comparing-feature-...2022-10-22 14:29:31.758090
3fdrxlt6My synthetic dataset1000020classificationNoneNoneMutual Infok-NN/workspaces/fseval/examples/comparing-feature-...2022-10-22 14:35:04.289361
14lecx0gMy synthetic dataset1000020classificationNoneNoneReliefFk-NN/workspaces/fseval/examples/comparing-feature-...2022-10-22 14:35:08.614262
3sggjvu3My synthetic dataset1000020classificationNoneNoneStability Selectionk-NN/workspaces/fseval/examples/comparing-feature-...2022-10-22 14:35:59.121416
dtt8bvo5My synthetic dataset1000020classificationNoneNoneXGBoostk-NN/workspaces/fseval/examples/comparing-feature-...2022-10-22 14:36:23.385401
+
+ + + +Let's also read in the `validation_scores`. + + +```python +validation_scores: pd.DataFrame = pd.read_sql_table("validation_scores", con=con, index_col="id") +validation_scores +``` + + + + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
indexn_features_to_selectfit_timescorebootstrap_state
id
3lllxl48010.0044330.79551
3lllxl48020.0042270.79101
3lllxl48030.0051830.79501
3lllxl48040.0038650.79651
3lllxl48050.0029020.79501
..................
dtt8bvo50160.0006700.78051
dtt8bvo50170.0004800.77251
dtt8bvo50180.0031590.77601
dtt8bvo50190.0008480.76501
dtt8bvo50200.0005650.75901
+

160 rows × 5 columns

+
+ + + +We can now merge them. Notice that we set as the _index_ the experiment ID, so we can use `pd.DataFrame.join` to do this. + + +```python +validation_scores_with_experiment_info = experiments.join( + validation_scores +) +validation_scores_with_experiment_info.head(1) +``` + + + + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
datasetdataset/ndataset/pdataset/taskdataset/groupdataset/domainrankervalidatorlocal_dirdate_createdindexn_features_to_selectfit_timescorebootstrap_state
id
14lecx0gMy synthetic dataset1000020classificationNoneNoneReliefFk-NN/workspaces/fseval/examples/comparing-feature-...2022-10-22 14:35:08.614262NaNNaNNaNNaNNaN
+
+ + + +Cool! That will be all the information that we need. Let's first create an overview for all the rankers we benchmarked. + + +```python +validation_scores_with_experiment_info \ + .groupby("ranker") \ + .mean(numeric_only=True) \ + .sort_values("score", ascending=False) +``` + + + + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
dataset/ndataset/pindexn_features_to_selectfit_timescorebootstrap_state
ranker
Infinite Selection10000.020.00.010.50.0046000.8189251.0
XGBoost10000.020.00.010.50.0029980.8185751.0
Decision Tree10000.020.00.010.50.0028100.8176751.0
Stability Selection10000.020.00.010.50.0024060.8033251.0
Chi-Squared10000.020.00.010.50.0025480.7959751.0
ANOVA F-value10000.020.00.010.50.0037450.7892751.0
Mutual Info10000.020.00.010.50.0023140.7864751.0
Boruta10000.020.00.010.50.0023660.5180751.0
MultiSURF10000.020.0NaNNaNNaNNaNNaN
ReliefF10000.020.0NaNNaNNaNNaNNaN
+
+ + + +Already, we notice that MultiSURF and ReliefF are missing. This is because the experiments failed. That can happen in a big benchmark! We will ignore this for now and continue with the other Feature Selectors. + +👀 We can already observe, that the _average_ classification accuracy is the highest for Infinite Selection. Although it would be premature to say it is the best, this is an indication that it did will for this dataset. + +Let's plot the results _per_ `n_features_to_select`. Note, that `n_features_to_select` means a validation step was run using a feature subset of size `n_features_to_select`. + + +```python +import plotly.express as px + +px.line( + validation_scores_with_experiment_info, + x="n_features_to_select", + y="score", + color="ranker" +) +``` + + +![feature selectors comparison plot](/img/recipes/feature-selectors-comparison-plot.png) + + +Indeed, we can see XGBoost, Infinite Selection and Decision Tree are solid contenders for this dataset. + +🙌🏻 + +--- + +This has shown how easy it is to do a large benchmark with `fseval`. Cheers! diff --git a/website/static/img/recipes/feature-selectors-comparison-plot.png b/website/static/img/recipes/feature-selectors-comparison-plot.png new file mode 100644 index 0000000..6f32139 Binary files /dev/null and b/website/static/img/recipes/feature-selectors-comparison-plot.png differ diff --git a/website/static/zipped-examples/comparing-feature-selectors.zip b/website/static/zipped-examples/comparing-feature-selectors.zip new file mode 100644 index 0000000..24e3f0c Binary files /dev/null and b/website/static/zipped-examples/comparing-feature-selectors.zip differ