diff --git a/.travis.yml b/.travis.yml
index 7ed24724..fded3a97 100644
--- a/.travis.yml
+++ b/.travis.yml
@@ -12,13 +12,6 @@ matrix:
env: PYTHON_VERSION="3.7" COVERAGE="true" DASK_ML_VERSION="1.0.0"
before_install:
- wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
- - name: "Python 3.7 on macOS"
- os: osx
- osx_image: xcode10.2 # Python 3.7.2 running on macOS 10.14.3
- language: shell # 'language: python' is an error on Travis CI macOS
- env: PYTHON_VERSION="3.7" DASK_ML_VERSION="1.0.0"
- before_install:
- - wget https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh -O miniconda.sh
install: source ./ci/.travis_install.sh
script: bash ./ci/.travis_test.sh
after_success:
diff --git a/docs/examples/index.html b/docs/examples/index.html
index 75b43b91..ba807942 100644
--- a/docs/examples/index.html
+++ b/docs/examples/index.html
@@ -203,7 +203,7 @@
MAGIC Gamma Telescope |
diff --git a/docs/index.html b/docs/index.html
index aadc2af1..880d4d71 100644
--- a/docs/index.html
+++ b/docs/index.html
@@ -213,5 +213,5 @@
diff --git a/docs/search/search_index.json b/docs/search/search_index.json
index 4fe79512..11784d5d 100644
--- a/docs/search/search_index.json
+++ b/docs/search/search_index.json
@@ -12,7 +12,7 @@
},
{
"location": "/using/",
- "text": "What to expect from AutoML software\n\n\nAutomated machine learning (AutoML) takes a higher-level approach to machine learning than most practitioners are used to,\nso we've gathered a handful of guidelines on what to expect when running AutoML software such as TPOT.\n\n\nAutoML algorithms aren't intended to run for only a few minutes\n\n\n\nOf course, you \ncan\n run TPOT for only a few minutes and it will find a reasonably good pipeline for your dataset.\nHowever, if you don't run TPOT for long enough, it may not find the best possible pipeline for your dataset. It may even not\nfind any suitable pipeline at all, in which case a \nRuntimeError('A pipeline has not yet been optimized. Please call fit() first.')\n\nwill be raised.\nOften it is worthwhile to run multiple instances of TPOT in parallel for a long time (hours to days) to allow TPOT to thoroughly search\nthe pipeline space for your dataset.\n\n\nAutoML algorithms can take a long time to finish their search\n\n\n\nAutoML algorithms aren't as simple as fitting one model on the dataset; they are considering multiple machine learning algorithms\n(random forests, linear models, SVMs, etc.) in a pipeline with multiple preprocessing steps (missing value imputation, scaling,\nPCA, feature selection, etc.), the hyperparameters for all of the models and preprocessing steps, as well as multiple ways\nto ensemble or stack the algorithms within the pipeline.\n\n\nAs such, TPOT will take a while to run on larger datasets, but it's important to realize why. With the default TPOT settings\n(100 generations with 100 population size), TPOT will evaluate 10,000 pipeline configurations before finishing.\nTo put this number into context, think about a grid search of 10,000 hyperparameter combinations for a machine learning algorithm\nand how long that grid search will take. That is 10,000 model configurations to evaluate with 10-fold cross-validation,\nwhich means that roughly 100,000 models are fit and evaluated on the training data in one grid search.\nThat's a time-consuming procedure, even for simpler models like decision trees.\n\n\nTypical TPOT runs will take hours to days to finish (unless it's a small dataset), but you can always interrupt\nthe run partway through and see the best results so far. TPOT also \nprovides\n a \nwarm_start\n parameter that\nlets you restart a TPOT run from where it left off.\n\n\nAutoML algorithms can recommend different solutions for the same dataset\n\n\n\nIf you're working with a reasonably complex dataset or run TPOT for a short amount of time, different TPOT runs\nmay result in different pipeline recommendations. TPOT's optimization algorithm is stochastic in nature, which means\nthat it uses randomness (in part) to search the possible pipeline space. When two TPOT runs recommend different\npipelines, this means that the TPOT runs didn't converge due to lack of time \nor\n that multiple pipelines\nperform more-or-less the same on your dataset.\n\n\nThis is actually an advantage over fixed grid search techniques: TPOT is meant to be an assistant that gives\nyou ideas on how to solve a particular machine learning problem by exploring pipeline configurations that you\nmight have never considered, then leaves the fine-tuning to more constrained parameter tuning techniques such\nas grid search.\n\n\nTPOT with code\n\n\nWe've taken care to design the TPOT interface to be as similar as possible to scikit-learn.\n\n\nTPOT can be imported just like any regular Python module. To import TPOT, type:\n\n\nfrom tpot import TPOTClassifier\n\n\n\n\nthen create an instance of TPOT as follows:\n\n\npipeline_optimizer = TPOTClassifier()\n\n\n\n\nIt's also possible to use TPOT for regression problems with the \nTPOTRegressor\n class. Other than the class name,\na \nTPOTRegressor\n is used the same way as a \nTPOTClassifier\n. You can read more about the \nTPOTClassifier\n and \nTPOTRegressor\n classes in the \nAPI documentation\n.\n\n\nSome example code with custom TPOT parameters might look like:\n\n\npipeline_optimizer = TPOTClassifier(generations=5, population_size=20, cv=5,\n random_state=42, verbosity=2)\n\n\n\n\nNow TPOT is ready to optimize a pipeline for you. You can tell TPOT to optimize a pipeline based on a data set with the \nfit\n function:\n\n\npipeline_optimizer.fit(X_train, y_train)\n\n\n\n\nThe \nfit\n function initializes the genetic programming algorithm to find the highest-scoring pipeline based on average k-fold cross-validation\nThen, the pipeline is trained on the entire set of provided samples, and the TPOT instance can be used as a fitted model.\n\n\nYou can then proceed to evaluate the final pipeline on the testing set with the \nscore\n function:\n\n\nprint(pipeline_optimizer.score(X_test, y_test))\n\n\n\n\nFinally, you can tell TPOT to export the corresponding Python code for the optimized pipeline to a text file with the \nexport\n function:\n\n\npipeline_optimizer.export('tpot_exported_pipeline.py')\n\n\n\n\nOnce this code finishes running, \ntpot_exported_pipeline.py\n will contain the Python code for the optimized pipeline.\n\n\nBelow is a full example script using TPOT to optimize a pipeline, score it, and export the best pipeline to a file.\n\n\nfrom tpot import TPOTClassifier\nfrom sklearn.datasets import load_digits\nfrom sklearn.model_selection import train_test_split\n\ndigits = load_digits()\nX_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,\n train_size=0.75, test_size=0.25)\n\npipeline_optimizer = TPOTClassifier(generations=5, population_size=20, cv=5,\n random_state=42, verbosity=2)\npipeline_optimizer.fit(X_train, y_train)\nprint(pipeline_optimizer.score(X_test, y_test))\npipeline_optimizer.export('tpot_exported_pipeline.py')\n\n\n\n\nCheck our \nexamples\n to see TPOT applied to some specific data sets.\n\n\nTPOT on the command line\n\n\nTo use TPOT via the command line, enter the following command with a path to the data file:\n\n\ntpot /path_to/data_file.csv\n\n\n\n\nAn example command-line call to TPOT may look like:\n\n\ntpot data/mnist.csv -is , -target class -o tpot_exported_pipeline.py -g 5 -p 20 -cv 5 -s 42 -v 2\n\n\n\n\nTPOT offers several arguments that can be provided at the command line. To see brief descriptions of these arguments,\nenter the following command:\n\n\ntpot --help\n\n\n\n\nDetailed descriptions of the command-line arguments are below.\n\n\n\n\n\n\nArgument\n\n\nParameter\n\n\nValid values\n\n\nEffect\n\n\n\n\n\n\n-is\n\n\nINPUT_SEPARATOR\n\n\nAny string\n\n\nCharacter used to separate columns in the input file.\n\n\n\n\n\n\n-target\n\n\nTARGET_NAME\n\n\nAny string\n\n\nName of the target column in the input file.\n\n\n\n\n\n\n-mode\n\n\nTPOT_MODE\n\n\n['classification', 'regression']\n\n\nWhether TPOT is being used for a supervised classification or regression problem.\n\n\n\n\n\n\n-o\n\n\nOUTPUT_FILE\n\n\nString path to a file\n\n\nFile to export the code for the final optimized pipeline.\n\n\n\n\n\n\n-g\n\n\nGENERATIONS\n\n\nAny positive integer or None\n\n\nNumber of iterations to run the pipeline optimization process. It must be a positive number or None. If None, the parameter max_time_mins must be defined as the runtime limit. Generally, TPOT will work better when you give it more generations (and therefore time) to optimize the pipeline.\n\n\nTPOT will evaluate POPULATION_SIZE + GENERATIONS x OFFSPRING_SIZE pipelines in total.\n\n\n\n\n\n\n-p\n\n\nPOPULATION_SIZE\n\n\nAny positive integer\n\n\nNumber of individuals to retain in the GP population every generation. Generally, TPOT will work better when you give it more individuals (and therefore time) to optimize the pipeline.\n\n\nTPOT will evaluate POPULATION_SIZE + GENERATIONS x OFFSPRING_SIZE pipelines in total.\n\n\n\n\n\n\n-os\n\n\nOFFSPRING_SIZE\n\n\nAny positive integer\n\n\nNumber of offspring to produce in each GP generation.\n\n\nBy default, OFFSPRING_SIZE = POPULATION_SIZE.\n\n\n\n\n\n\n-mr\n\n\nMUTATION_RATE\n\n\n[0.0, 1.0]\n\n\nGP mutation rate in the range [0.0, 1.0]. This tells the GP algorithm how many pipelines to apply random changes to every generation.\n\n\nWe recommend using the default parameter unless you understand how the mutation rate affects GP algorithms.\n\n\n\n\n\n\n-xr\n\n\nCROSSOVER_RATE\n\n\n[0.0, 1.0]\n\n\nGP crossover rate in the range [0.0, 1.0]. This tells the GP algorithm how many pipelines to \"breed\" every generation.\n\n\nWe recommend using the default parameter unless you understand how the crossover rate affects GP algorithms.\n\n\n\n\n\n\n-scoring\n\n\nSCORING_FN\n\n\n'accuracy', 'adjusted_rand_score', 'average_precision', 'balanced_accuracy',\n'f1',\n'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'neg_log_loss', 'neg_mean_absolute_error',\n'neg_mean_squared_error', 'neg_median_absolute_error', 'precision', 'precision_macro', 'precision_micro',\n'precision_samples', 'precision_weighted',\n'r2', 'recall', 'recall_macro', 'recall_micro', 'recall_samples',\n'recall_weighted', 'roc_auc', 'my_module.scorer_name*'\n\n\nFunction used to evaluate the quality of a given pipeline for the problem. By default, accuracy is used for classification and mean squared error (MSE) is used for regression.\n\n\nTPOT assumes that any function with \"error\" or \"loss\" in the name is meant to be minimized, whereas any other functions will be maximized.\n\n\nmy_module.scorer_name: You can also specify your own function or a full python path to an existing one.\n\n\nSee the section on \nscoring functions\n for more details.\n\n\n\n\n\n\n-cv\n\n\nCV\n\n\nAny integer > 1\n\n\nNumber of folds to evaluate each pipeline over in k-fold cross-validation during the TPOT optimization process.\n\n\n\n\n-sub\n\n\nSUBSAMPLE\n\n\n(0.0, 1.0]\n\n\nSubsample ratio of the training instance. Setting it to 0.5 means that TPOT randomly collects half of training samples for pipeline optimization process.\n\n\n\n\n\n\n-njobs\n\n\nNUM_JOBS\n\n\nAny positive integer or -1\n\n\nNumber of CPUs for evaluating pipelines in parallel during the TPOT optimization process.\n\n\nAssigning this to -1 will use as many cores as available on the computer. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used. Thus for n_jobs = -2, all CPUs but one are used.\n\n\n\n\n\n\n-maxtime\n\n\nMAX_TIME_MINS\n\n\nAny positive integer\n\n\nHow many minutes TPOT has to optimize the pipeline.\n\n\nHow many minutes TPOT has to optimize the pipeline.If not None, this setting will allow TPOT to run until max_time_mins minutes elapsed and then stop. TPOT will stop earlier if generationsis set and all generations are already evaluated.\n\n\n\n\n\n\n-maxeval\n\n\nMAX_EVAL_MINS\n\n\nAny positive float\n\n\nHow many minutes TPOT has to evaluate a single pipeline.\n\n\nSetting this parameter to higher values will allow TPOT to consider more complex pipelines but will also allow TPOT to run longer.\n\n\n\n\n\n\n-s\n\n\nRANDOM_STATE\n\n\nAny positive integer\n\n\nRandom number generator seed for reproducibility.\n\n\nSet this seed if you want your TPOT run to be reproducible with the same seed and data set in the future.\n\n\n\n\n\n\n-config\n\n\nCONFIG_FILE\n\n\nString or file path\n\n\nOperators and parameter configurations in TPOT:\n\n\n\n\n\nPath for configuration file: TPOT will use the path to a configuration file for customizing the operators and parameters that TPOT uses in the optimization process\n\n\nstring 'TPOT light', TPOT will use a built-in configuration with only fast models and preprocessors\n\n\nstring 'TPOT MDR', TPOT will use a built-in configuration specialized for genomic studies\n\n\nstring 'TPOT sparse': TPOT will use a configuration dictionary with a one-hot encoder and the operators normally included in TPOT that also support sparse matrices.\n\n\n\nSee the \nbuilt-in configurations\n section for the list of configurations included with TPOT, and the \ncustom configuration\n section for more information and examples of how to create your own TPOT configurations.\n\n\n\n\n\n\n\n-template\n\n\nTEMPLATE\n\n\nString\n\n\nTemplate of predefined pipeline structure. The option is for specifying a desired structure for the machine learning pipeline evaluated in TPOT. So far this option only supports linear pipeline structure. Each step in the pipeline should be a main class of operators (Selector, Transformer, Classifier or Regressor) or a specific operator (e.g. `SelectPercentile`) defined in TPOT operator configuration. If one step is a main class, TPOT will randomly assign all subclass operators (subclasses of [`SelectorMixin`](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_selection/base.py#L17), [`TransformerMixin`](https://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin.html), [`ClassifierMixin`](https://scikit-learn.org/stable/modules/generated/sklearn.base.ClassifierMixin.html) or [`RegressorMixin`](https://scikit-learn.org/stable/modules/generated/sklearn.base.RegressorMixin.html) in scikit-learn) to that step. Steps in the template are delimited by \"-\", e.g. \"SelectPercentile-Transformer-Classifier\". By default value of template is None, TPOT generates tree-based pipeline randomly.\n\nSee the \n template option in tpot\n section for more details.\n\n\n\n\n\n\n\n-memory\n\n\nMEMORY\n\n\nString or file path\n\n\nIf supplied, pipeline will cache each transformer after calling fit. This feature is used to avoid computing the fit transformers within a pipeline if the parameters and input data are identical with another fitted pipeline during optimization process. Memory caching mode in TPOT:\n\n\n\n\n\nPath for a caching directory: TPOT uses memory caching with the provided directory and TPOT does NOT clean the caching directory up upon shutdown.\n\n\nstring 'auto': TPOT uses memory caching with a temporary directory and cleans it up upon shutdown.\n\n\n\n\n\n\n\n\n\n\n-cf\n\n\nCHECKPOINT_FOLDER\n\n\nFolder path\n\n\n\nIf supplied, a folder you created, in which tpot will periodically save pipelines in pareto front so far while optimizing.\n\n\nThis is useful in multiple cases:\n\n\n\nsudden death before tpot could save an optimized pipeline\n\n\nprogress tracking\n\n\ngrabbing a pipeline while tpot is working\n\n\n\n\n\nExample:\n\n\nmkdir my_checkpoints\n\n\n-cf ./my_checkpoints\n\n\n\n\n\n-es\n\n\nEARLY_STOP\n\n\nAny positive integer\n\n\n\nHow many generations TPOT checks whether there is no improvement in optimization process.\n\n\nEnd optimization process if there is no improvement in the set number of generations.\n\n\n\n\n\n-v\n\n\nVERBOSITY\n\n\n{0, 1, 2, 3}\n\n\nHow much information TPOT communicates while it is running.\n\n\n0 = none, 1 = minimal, 2 = high, 3 = all.\n\n\nA setting of 2 or higher will add a progress bar during the optimization procedure.\n\n\n\n\n\n\n--no-update-check\n\n\nFlag indicating whether the TPOT version checker should be disabled.\n\n\n\n\n\n\n--version\n\n\nShow TPOT's version number and exit.\n\n\n\n\n\n\n--help\n\n\nShow TPOT's help documentation and exit.\n\n\n\n\n\n\n\nScoring functions\n\n\nTPOT makes use of \nsklearn.model_selection.cross_val_score\n for evaluating pipelines, and as such offers the same support for scoring functions. There are two ways to make use of scoring functions with TPOT:\n\n\n\n\n\n\nYou can pass in a string to the \nscoring\n parameter from the list above. Any other strings will cause TPOT to throw an exception.\n\n\n\n\n\n\nYou can pass the callable object/function with signature \nscorer(estimator, X, y)\n, where \nestimator\n is trained estimator to use for scoring, \nX\n are features that will be passed to \nestimator.predict\n and \ny\n are target values for \nX\n. To do this, you should implement your own function. See the example below for further explanation.\n\n\n\n\n\n\nfrom tpot import TPOTClassifier\nfrom sklearn.datasets import load_digits\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.metrics.scorer import make_scorer\n\ndigits = load_digits()\nX_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,\n train_size=0.75, test_size=0.25)\n# Make a custom metric function\ndef my_custom_accuracy(y_true, y_pred):\n return float(sum(y_pred == y_true)) / len(y_true)\n\n# Make a custom a scorer from the custom metric function\n# Note: greater_is_better=False in make_scorer below would mean that the scoring function should be minimized.\nmy_custom_scorer = make_scorer(my_custom_accuracy, greater_is_better=True)\n\ntpot = TPOTClassifier(generations=5, population_size=20, verbosity=2,\n scoring=my_custom_scorer)\ntpot.fit(X_train, y_train)\nprint(tpot.score(X_test, y_test))\ntpot.export('tpot_digits_pipeline.py')\n\n\n\n\n\n\nmy_module.scorer_name\n: You can also use a custom \nscore_func(y_true, y_pred)\n or \nscorer(estimator, X, y)\n function through the command line by adding the argument \n-scoring my_module.scorer\n to your command-line call. TPOT will import your module and use the custom scoring function from there. TPOT will include your current working directory when importing the module, so you can place it in the same directory where you are going to run TPOT.\nExample: \n-scoring sklearn.metrics.auc\n will use the function auc from sklearn.metrics module.\n\n\n\n\nBuilt-in TPOT configurations\n\n\nTPOT comes with a handful of default operators and parameter configurations that we believe work well for optimizing machine learning pipelines. Below is a list of the current built-in configurations that come with TPOT.\n\n\n\n\n\n\nConfiguration Name\n\n\nDescription\n\n\nOperators\n\n\n\n\n\n\n\nDefault TPOT\n\n\nTPOT will search over a broad range of preprocessors, feature constructors, feature selectors, models, and parameters to find a series of operators that minimize the error of the model predictions. Some of these operators are complex and may take a long time to run, especially on larger datasets.\n\n\n\nNote: This is the default configuration for TPOT.\n To use this configuration, use the default value (None) for the config_dict parameter.\n\n\nClassification\n\n\n\n\nRegression\n\n\n\n\n\n\n\nTPOT light\n\n\nTPOT will search over a restricted range of preprocessors, feature constructors, feature selectors, models, and parameters to find a series of operators that minimize the error of the model predictions. Only simpler and fast-running operators will be used in these pipelines, so TPOT light is useful for finding quick and simple pipelines for a classification or regression problem.\n\n\nThis configuration works for both the TPOTClassifier and TPOTRegressor.\n\n\nClassification\n\n\n\n\nRegression\n\n\n\n\n\n\n\nTPOT MDR\n\n\nTPOT will search over a series of feature selectors and \nMultifactor Dimensionality Reduction\n models to find a series of operators that maximize prediction accuracy. The TPOT MDR configuration is specialized for \ngenome-wide association studies (GWAS)\n, and is described in detail online \nhere\n.\n\n\nNote that TPOT MDR may be slow to run because the feature selection routines are computationally expensive, especially on large datasets.\n\n\nClassification\n\n\n\n\nRegression\n\n\n\n\n\n\n\nTPOT sparse\n\n\nTPOT uses a configuration dictionary with a one-hot encoder and the operators normally included in TPOT that also support sparse matrices.\n\n\nThis configuration works for both the TPOTClassifier and TPOTRegressor.\n\n\nClassification\n\n\n\n\nRegression\n\n\n\n\n\n\n\n\nTo use any of these configurations, simply pass the string name of the configuration to the \nconfig_dict\n parameter (or \n-config\n on the command line). For example, to use the \"TPOT light\" configuration:\n\n\nfrom tpot import TPOTClassifier\nfrom sklearn.datasets import load_digits\nfrom sklearn.model_selection import train_test_split\n\ndigits = load_digits()\nX_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,\n train_size=0.75, test_size=0.25)\n\ntpot = TPOTClassifier(generations=5, population_size=20, verbosity=2,\n config_dict='TPOT light')\ntpot.fit(X_train, y_train)\nprint(tpot.score(X_test, y_test))\ntpot.export('tpot_digits_pipeline.py')\n\n\n\n\n\nCustomizing TPOT's operators and parameters\n\n\nBeyond the default configurations that come with TPOT, in some cases it is useful to limit the algorithms and parameters that TPOT considers. For that reason, we allow users to provide TPOT with a custom configuration for its operators and parameters.\n\n\nThe custom TPOT configuration must be in nested dictionary format, where the first level key is the path and name of the operator (e.g., \nsklearn.naive_bayes.MultinomialNB\n) and the second level key is the corresponding parameter name for that operator (e.g., \nfit_prior\n). The second level key should point to a list of parameter values for that parameter, e.g., \n'fit_prior': [True, False]\n.\n\n\nFor a simple example, the configuration could be:\n\n\ntpot_config = {\n 'sklearn.naive_bayes.GaussianNB': {\n },\n\n 'sklearn.naive_bayes.BernoulliNB': {\n 'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.],\n 'fit_prior': [True, False]\n },\n\n 'sklearn.naive_bayes.MultinomialNB': {\n 'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.],\n 'fit_prior': [True, False]\n }\n}\n\n\n\n\nin which case TPOT would only consider pipelines containing \nGaussianNB\n, \nBernoulliNB\n, \nMultinomialNB\n, and tune those algorithm's parameters in the ranges provided. This dictionary can be passed directly within the code to the \nTPOTClassifier\n/\nTPOTRegressor\n \nconfig_dict\n parameter, described above. For example:\n\n\nfrom tpot import TPOTClassifier\nfrom sklearn.datasets import load_digits\nfrom sklearn.model_selection import train_test_split\n\ndigits = load_digits()\nX_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,\n train_size=0.75, test_size=0.25)\n\ntpot_config = {\n 'sklearn.naive_bayes.GaussianNB': {\n },\n\n 'sklearn.naive_bayes.BernoulliNB': {\n 'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.],\n 'fit_prior': [True, False]\n },\n\n 'sklearn.naive_bayes.MultinomialNB': {\n 'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.],\n 'fit_prior': [True, False]\n }\n}\n\ntpot = TPOTClassifier(generations=5, population_size=20, verbosity=2,\n config_dict=tpot_config)\ntpot.fit(X_train, y_train)\nprint(tpot.score(X_test, y_test))\ntpot.export('tpot_digits_pipeline.py')\n\n\n\n\nCommand-line users must create a separate \n.py\n file with the custom configuration and provide the path to the file to the \ntpot\n call. For example, if the simple example configuration above is saved in \ntpot_classifier_config.py\n, that configuration could be used on the command line with the command:\n\n\ntpot data/mnist.csv -is , -target class -config tpot_classifier_config.py -g 5 -p 20 -v 2 -o tpot_exported_pipeline.py\n\n\n\n\nWhen using the command-line interface, the configuration file specified in the \n-config\n parameter \nmust\n name its custom TPOT configuration \ntpot_config\n. Otherwise, TPOT will not be able to locate the configuration dictionary.\n\n\nFor more detailed examples of how to customize TPOT's operator configuration, see the default configurations for \nclassification\n and \nregression\n in TPOT's source code.\n\n\nNote that you must have all of the corresponding packages for the operators installed on your computer, otherwise TPOT will not be able to use them. For example, if XGBoost is not installed on your computer, then TPOT will simply not import nor use XGBoost in the pipelines it considers.\n\n\nTemplate option in TPOT\n\n\nTemplate option provides a way to specify a desired structure for machine learning pipeline, which may reduce TPOT computation time and potentially provide more interpretable results. Current implementation only supports linear pipelines.\n\n\nBelow is a simple example to use \ntemplate\n option. The pipelines generated/evaluated in TPOT will follow this structure: 1st step is a feature selector (a subclass of \nSelectorMixin\n), 2nd step is a feature transformer (a subclass of \nTransformerMixin\n) and 3rd step is a classifier for classification (a subclass of \nClassifierMixin\n). The last step must be \nClassifier\n for \nTPOTClassifier\n's template but \nRegressor\n for \nTPOTRegressor\n. \nNote: although \nSelectorMixin\n is subclass of \nTransformerMixin\n in scikit-learn, but \nTransformer\n in this option excludes those subclasses of \nSelectorMixin\n.\n\n\ntpot_obj = TPOTClassifier(\n template='Selector-Transformer-Classifier'\n )\n\n\n\n\nIf a specific operator, e.g. \nSelectPercentile\n, is preferred for usage in the 1st step of the pipeline, the template can be defined like 'SelectPercentile-Transformer-Classifier'.\n\n\nFeatureSetSelector in TPOT\n\n\nFeatureSetSelector\n is a special new operator in TPOT. This operator enables feature selection based on \npriori\n export knowledge. For example, in RNA-seq gene expression analysis, this operator can be used to select one or more gene (feature) set(s) based on GO (Gene Ontology) terms or annotated gene sets Molecular Signatures Database (\nMSigDB\n) in the 1st step of pipeline via \ntemplate\n option above, in order to reduce dimensions and TPOT computation time. This operator requires a dataset list in csv format. In this csv file, there are only three columns: 1st column is feature set names, 2nd column is the total number of features in one set and 3rd column is a list of feature names (if input X is pandas.DataFrame) or indexes (if input X is numpy.ndarray) delimited by \";\". Below is a example how to use this operator in TPOT.\n\n\nPlease check our \npreprint paper\n for more details.\n\n\nfrom tpot import TPOTClassifier\nimport numpy as np\nimport pandas as pd\nfrom tpot.config import classifier_config_dict\ntest_data = pd.read_csv(\"https://raw.githubusercontent.com/EpistasisLab/tpot/master/tests/tests.csv\")\ntest_X = test_data.drop(\"class\", axis=1)\ntest_y = test_data['class']\n\n# add FeatureSetSelector into tpot configuration\nclassifier_config_dict['tpot.builtins.FeatureSetSelector'] = {\n 'subset_list': ['https://raw.githubusercontent.com/EpistasisLab/tpot/master/tests/subset_test.csv'],\n 'sel_subset': [0,1] # select only one feature set, a list of index of subset in the list above\n #'sel_subset': list(combinations(range(3), 2)) # select two feature sets\n}\n\n\ntpot = TPOTClassifier(generations=5,\n population_size=50, verbosity=2,\n template='FeatureSetSelector-Transformer-Classifier',\n config_dict=classifier_config_dict)\ntpot.fit(test_X, test_y)\n\n\n\n\nPipeline caching in TPOT\n\n\nWith the \nmemory\n parameter, pipelines can cache the results of each transformer after fitting them. This feature is used to avoid repeated computation by transformers within a pipeline if the parameters and input data are identical to another fitted pipeline during optimization process. TPOT allows users to specify a custom directory path or \njoblib.Memory\n in case they want to re-use the memory cache in future TPOT runs (or a \nwarm_start\n run).\n\n\nThere are three methods for enabling memory caching in TPOT:\n\n\nfrom tpot import TPOTClassifier\nfrom tempfile import mkdtemp\nfrom joblib import Memory\nfrom shutil import rmtree\n\n# Method 1, auto mode: TPOT uses memory caching with a temporary directory and cleans it up upon shutdown\ntpot = TPOTClassifier(memory='auto')\n\n# Method 2, with a custom directory for memory caching\ntpot = TPOTClassifier(memory='/to/your/path')\n\n# Method 3, with a Memory object\ncachedir = mkdtemp() # Create a temporary folder\nmemory = Memory(cachedir=cachedir, verbose=0)\ntpot = TPOTClassifier(memory=memory)\n\n# Clear the cache directory when you don't need it anymore\nrmtree(cachedir)\n\n\n\n\nNote: TPOT does NOT clean up memory caches if users set a custom directory path or Memory object. We recommend that you clean up the memory caches when you don't need it anymore.\n\n\nCrash/freeze issue with n_jobs > 1 under OSX or Linux\n\n\nInternally, TPOT uses \njoblib\n to fit estimators in parallel.\nThis is the same parallelization framework used by scikit-learn. But it may crash/freeze with n_jobs > 1 under OSX or Linux \nas scikit-learn does\n, especially with large datasets.\n\n\nOne solution is to configure Python's \nmultiprocessing\n module to use the \nforkserver\n start method (instead of the default \nfork\n) to manage the process pools. You can enable the \nforkserver\n mode globally for your program by putting the following codes into your main script:\n\n\nimport multiprocessing\n\n# other imports, custom code, load data, define model...\n\nif __name__ == '__main__':\n multiprocessing.set_start_method('forkserver')\n\n # call scikit-learn utils or tpot utils with n_jobs > 1 here\n\n\n\n\nMore information about these start methods can be found in the \nmultiprocessing documentation\n.\n\n\nParallel Training with Dask\n\n\nFor large problems or working on Jupyter notebook, we highly recommend that you can distribute the work on a \nDask\n cluster.\nThe \ndask-examples binder\n has a runnable example\nwith a small dask cluster.\n\n\nTo use your Dask cluster to fit a TPOT model, specify the \nuse_dask\n keyword when you create the TPOT estimator. \nNote: if \nuse_dask=True\n, TPOT will use as many cores as available on the your Dask cluster. If \nn_jobs\n is specified, then it will control the chunk size (10*\nn_jobs\n if it is less then offspring size) of parallel training. \n\n\nestimator = TPOTEstimator(use_dask=True, n_jobs=-1)\n\n\n\n\nThis will use use all the workers on your cluster to do the training, and use \nDask-ML's pipeline rewriting\n to avoid re-fitting estimators multiple times on the same set of data.\nIt will also provide fine-grained diagnostics in the \ndistributed scheduler UI\n.\n\n\nAlternatively, Dask implements a joblib backend.\nYou can instruct TPOT to use the distributed backend during training by specifying a \njoblib.parallel_backend\n:\n\n\nimport joblib\nimport distributed.joblib\nfrom dask.distributed import Client\n\n# connect to the cluster\nclient = Client('schedueler-address')\n\n# create the estimator normally\nestimator = TPOTClassifier(n_jobs=-1)\n\n# perform the fit in this context manager\nwith joblib.parallel_backend(\"dask\"):\n estimator.fit(X, y)\n\n\n\n\nSee \ndask's distributed joblib integration\n for more.",
+ "text": "What to expect from AutoML software\n\n\nAutomated machine learning (AutoML) takes a higher-level approach to machine learning than most practitioners are used to,\nso we've gathered a handful of guidelines on what to expect when running AutoML software such as TPOT.\n\n\nAutoML algorithms aren't intended to run for only a few minutes\n\n\n\nOf course, you \ncan\n run TPOT for only a few minutes and it will find a reasonably good pipeline for your dataset.\nHowever, if you don't run TPOT for long enough, it may not find the best possible pipeline for your dataset. It may even not\nfind any suitable pipeline at all, in which case a \nRuntimeError('A pipeline has not yet been optimized. Please call fit() first.')\n\nwill be raised.\nOften it is worthwhile to run multiple instances of TPOT in parallel for a long time (hours to days) to allow TPOT to thoroughly search\nthe pipeline space for your dataset.\n\n\nAutoML algorithms can take a long time to finish their search\n\n\n\nAutoML algorithms aren't as simple as fitting one model on the dataset; they are considering multiple machine learning algorithms\n(random forests, linear models, SVMs, etc.) in a pipeline with multiple preprocessing steps (missing value imputation, scaling,\nPCA, feature selection, etc.), the hyperparameters for all of the models and preprocessing steps, as well as multiple ways\nto ensemble or stack the algorithms within the pipeline.\n\n\nAs such, TPOT will take a while to run on larger datasets, but it's important to realize why. With the default TPOT settings\n(100 generations with 100 population size), TPOT will evaluate 10,000 pipeline configurations before finishing.\nTo put this number into context, think about a grid search of 10,000 hyperparameter combinations for a machine learning algorithm\nand how long that grid search will take. That is 10,000 model configurations to evaluate with 10-fold cross-validation,\nwhich means that roughly 100,000 models are fit and evaluated on the training data in one grid search.\nThat's a time-consuming procedure, even for simpler models like decision trees.\n\n\nTypical TPOT runs will take hours to days to finish (unless it's a small dataset), but you can always interrupt\nthe run partway through and see the best results so far. TPOT also \nprovides\n a \nwarm_start\n parameter that\nlets you restart a TPOT run from where it left off.\n\n\nAutoML algorithms can recommend different solutions for the same dataset\n\n\n\nIf you're working with a reasonably complex dataset or run TPOT for a short amount of time, different TPOT runs\nmay result in different pipeline recommendations. TPOT's optimization algorithm is stochastic in nature, which means\nthat it uses randomness (in part) to search the possible pipeline space. When two TPOT runs recommend different\npipelines, this means that the TPOT runs didn't converge due to lack of time \nor\n that multiple pipelines\nperform more-or-less the same on your dataset.\n\n\nThis is actually an advantage over fixed grid search techniques: TPOT is meant to be an assistant that gives\nyou ideas on how to solve a particular machine learning problem by exploring pipeline configurations that you\nmight have never considered, then leaves the fine-tuning to more constrained parameter tuning techniques such\nas grid search.\n\n\nTPOT with code\n\n\nWe've taken care to design the TPOT interface to be as similar as possible to scikit-learn.\n\n\nTPOT can be imported just like any regular Python module. To import TPOT, type:\n\n\nfrom tpot import TPOTClassifier\n\n\n\n\nthen create an instance of TPOT as follows:\n\n\npipeline_optimizer = TPOTClassifier()\n\n\n\n\nIt's also possible to use TPOT for regression problems with the \nTPOTRegressor\n class. Other than the class name,\na \nTPOTRegressor\n is used the same way as a \nTPOTClassifier\n. You can read more about the \nTPOTClassifier\n and \nTPOTRegressor\n classes in the \nAPI documentation\n.\n\n\nSome example code with custom TPOT parameters might look like:\n\n\npipeline_optimizer = TPOTClassifier(generations=5, population_size=20, cv=5,\n random_state=42, verbosity=2)\n\n\n\n\nNow TPOT is ready to optimize a pipeline for you. You can tell TPOT to optimize a pipeline based on a data set with the \nfit\n function:\n\n\npipeline_optimizer.fit(X_train, y_train)\n\n\n\n\nThe \nfit\n function initializes the genetic programming algorithm to find the highest-scoring pipeline based on average k-fold cross-validation\nThen, the pipeline is trained on the entire set of provided samples, and the TPOT instance can be used as a fitted model.\n\n\nYou can then proceed to evaluate the final pipeline on the testing set with the \nscore\n function:\n\n\nprint(pipeline_optimizer.score(X_test, y_test))\n\n\n\n\nFinally, you can tell TPOT to export the corresponding Python code for the optimized pipeline to a text file with the \nexport\n function:\n\n\npipeline_optimizer.export('tpot_exported_pipeline.py')\n\n\n\n\nOnce this code finishes running, \ntpot_exported_pipeline.py\n will contain the Python code for the optimized pipeline.\n\n\nBelow is a full example script using TPOT to optimize a pipeline, score it, and export the best pipeline to a file.\n\n\nfrom tpot import TPOTClassifier\nfrom sklearn.datasets import load_digits\nfrom sklearn.model_selection import train_test_split\n\ndigits = load_digits()\nX_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,\n train_size=0.75, test_size=0.25)\n\npipeline_optimizer = TPOTClassifier(generations=5, population_size=20, cv=5,\n random_state=42, verbosity=2)\npipeline_optimizer.fit(X_train, y_train)\nprint(pipeline_optimizer.score(X_test, y_test))\npipeline_optimizer.export('tpot_exported_pipeline.py')\n\n\n\n\nCheck our \nexamples\n to see TPOT applied to some specific data sets.\n\n\nTPOT on the command line\n\n\nTo use TPOT via the command line, enter the following command with a path to the data file:\n\n\ntpot /path_to/data_file.csv\n\n\n\n\nAn example command-line call to TPOT may look like:\n\n\ntpot data/mnist.csv -is , -target class -o tpot_exported_pipeline.py -g 5 -p 20 -cv 5 -s 42 -v 2\n\n\n\n\nTPOT offers several arguments that can be provided at the command line. To see brief descriptions of these arguments,\nenter the following command:\n\n\ntpot --help\n\n\n\n\nDetailed descriptions of the command-line arguments are below.\n\n\n\n\n\n\nArgument\n\n\nParameter\n\n\nValid values\n\n\nEffect\n\n\n\n\n\n\n-is\n\n\nINPUT_SEPARATOR\n\n\nAny string\n\n\nCharacter used to separate columns in the input file.\n\n\n\n\n\n\n-target\n\n\nTARGET_NAME\n\n\nAny string\n\n\nName of the target column in the input file.\n\n\n\n\n\n\n-mode\n\n\nTPOT_MODE\n\n\n['classification', 'regression']\n\n\nWhether TPOT is being used for a supervised classification or regression problem.\n\n\n\n\n\n\n-o\n\n\nOUTPUT_FILE\n\n\nString path to a file\n\n\nFile to export the code for the final optimized pipeline.\n\n\n\n\n\n\n-g\n\n\nGENERATIONS\n\n\nAny positive integer or None\n\n\nNumber of iterations to run the pipeline optimization process. It must be a positive number or None. If None, the parameter max_time_mins must be defined as the runtime limit. Generally, TPOT will work better when you give it more generations (and therefore time) to optimize the pipeline.\n\n\nTPOT will evaluate POPULATION_SIZE + GENERATIONS x OFFSPRING_SIZE pipelines in total.\n\n\n\n\n\n\n-p\n\n\nPOPULATION_SIZE\n\n\nAny positive integer\n\n\nNumber of individuals to retain in the GP population every generation. Generally, TPOT will work better when you give it more individuals (and therefore time) to optimize the pipeline.\n\n\nTPOT will evaluate POPULATION_SIZE + GENERATIONS x OFFSPRING_SIZE pipelines in total.\n\n\n\n\n\n\n-os\n\n\nOFFSPRING_SIZE\n\n\nAny positive integer\n\n\nNumber of offspring to produce in each GP generation.\n\n\nBy default, OFFSPRING_SIZE = POPULATION_SIZE.\n\n\n\n\n\n\n-mr\n\n\nMUTATION_RATE\n\n\n[0.0, 1.0]\n\n\nGP mutation rate in the range [0.0, 1.0]. This tells the GP algorithm how many pipelines to apply random changes to every generation.\n\n\nWe recommend using the default parameter unless you understand how the mutation rate affects GP algorithms.\n\n\n\n\n\n\n-xr\n\n\nCROSSOVER_RATE\n\n\n[0.0, 1.0]\n\n\nGP crossover rate in the range [0.0, 1.0]. This tells the GP algorithm how many pipelines to \"breed\" every generation.\n\n\nWe recommend using the default parameter unless you understand how the crossover rate affects GP algorithms.\n\n\n\n\n\n\n-scoring\n\n\nSCORING_FN\n\n\n'accuracy', 'adjusted_rand_score', 'average_precision', 'balanced_accuracy',\n'f1',\n'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'neg_log_loss', 'neg_mean_absolute_error',\n'neg_mean_squared_error', 'neg_median_absolute_error', 'precision', 'precision_macro', 'precision_micro',\n'precision_samples', 'precision_weighted',\n'r2', 'recall', 'recall_macro', 'recall_micro', 'recall_samples',\n'recall_weighted', 'roc_auc', 'my_module.scorer_name*'\n\n\nFunction used to evaluate the quality of a given pipeline for the problem. By default, accuracy is used for classification and mean squared error (MSE) is used for regression.\n\n\nTPOT assumes that any function with \"error\" or \"loss\" in the name is meant to be minimized, whereas any other functions will be maximized.\n\n\nmy_module.scorer_name: You can also specify your own function or a full python path to an existing one.\n\n\nSee the section on \nscoring functions\n for more details.\n\n\n\n\n\n\n-cv\n\n\nCV\n\n\nAny integer > 1\n\n\nNumber of folds to evaluate each pipeline over in k-fold cross-validation during the TPOT optimization process.\n\n\n\n\n-sub\n\n\nSUBSAMPLE\n\n\n(0.0, 1.0]\n\n\nSubsample ratio of the training instance. Setting it to 0.5 means that TPOT randomly collects half of training samples for pipeline optimization process.\n\n\n\n\n\n\n-njobs\n\n\nNUM_JOBS\n\n\nAny positive integer or -1\n\n\nNumber of CPUs for evaluating pipelines in parallel during the TPOT optimization process.\n\n\nAssigning this to -1 will use as many cores as available on the computer. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used. Thus for n_jobs = -2, all CPUs but one are used.\n\n\n\n\n\n\n-maxtime\n\n\nMAX_TIME_MINS\n\n\nAny positive integer\n\n\nHow many minutes TPOT has to optimize the pipeline.\n\n\nHow many minutes TPOT has to optimize the pipeline.If not None, this setting will allow TPOT to run until max_time_mins minutes elapsed and then stop. TPOT will stop earlier if generationsis set and all generations are already evaluated.\n\n\n\n\n\n\n-maxeval\n\n\nMAX_EVAL_MINS\n\n\nAny positive float\n\n\nHow many minutes TPOT has to evaluate a single pipeline.\n\n\nSetting this parameter to higher values will allow TPOT to consider more complex pipelines but will also allow TPOT to run longer.\n\n\n\n\n\n\n-s\n\n\nRANDOM_STATE\n\n\nAny positive integer\n\n\nRandom number generator seed for reproducibility.\n\n\nSet this seed if you want your TPOT run to be reproducible with the same seed and data set in the future.\n\n\n\n\n\n\n-config\n\n\nCONFIG_FILE\n\n\nString or file path\n\n\nOperators and parameter configurations in TPOT:\n\n\n\n\n\nPath for configuration file: TPOT will use the path to a configuration file for customizing the operators and parameters that TPOT uses in the optimization process\n\n\nstring 'TPOT light', TPOT will use a built-in configuration with only fast models and preprocessors\n\n\nstring 'TPOT MDR', TPOT will use a built-in configuration specialized for genomic studies\n\n\nstring 'TPOT sparse': TPOT will use a configuration dictionary with a one-hot encoder and the operators normally included in TPOT that also support sparse matrices.\n\n\n\nSee the \nbuilt-in configurations\n section for the list of configurations included with TPOT, and the \ncustom configuration\n section for more information and examples of how to create your own TPOT configurations.\n\n\n\n\n\n\n\n-template\n\n\nTEMPLATE\n\n\nString\n\n\nTemplate of predefined pipeline structure. The option is for specifying a desired structure for the machine learning pipeline evaluated in TPOT. So far this option only supports linear pipeline structure. Each step in the pipeline should be a main class of operators (Selector, Transformer, Classifier or Regressor) or a specific operator (e.g. `SelectPercentile`) defined in TPOT operator configuration. If one step is a main class, TPOT will randomly assign all subclass operators (subclasses of [`SelectorMixin`](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_selection/base.py#L17), [`TransformerMixin`](https://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin.html), [`ClassifierMixin`](https://scikit-learn.org/stable/modules/generated/sklearn.base.ClassifierMixin.html) or [`RegressorMixin`](https://scikit-learn.org/stable/modules/generated/sklearn.base.RegressorMixin.html) in scikit-learn) to that step. Steps in the template are delimited by \"-\", e.g. \"SelectPercentile-Transformer-Classifier\". By default value of template is None, TPOT generates tree-based pipeline randomly.\n\nSee the \n template option in tpot\n section for more details.\n\n\n\n\n\n\n\n-memory\n\n\nMEMORY\n\n\nString or file path\n\n\nIf supplied, pipeline will cache each transformer after calling fit. This feature is used to avoid computing the fit transformers within a pipeline if the parameters and input data are identical with another fitted pipeline during optimization process. Memory caching mode in TPOT:\n\n\n\n\n\nPath for a caching directory: TPOT uses memory caching with the provided directory and TPOT does NOT clean the caching directory up upon shutdown.\n\n\nstring 'auto': TPOT uses memory caching with a temporary directory and cleans it up upon shutdown.\n\n\n\n\n\n\n\n\n\n\n-cf\n\n\nCHECKPOINT_FOLDER\n\n\nFolder path\n\n\n\nIf supplied, a folder you created, in which tpot will periodically save pipelines in pareto front so far while optimizing.\n\n\nThis is useful in multiple cases:\n\n\n\nsudden death before tpot could save an optimized pipeline\n\n\nprogress tracking\n\n\ngrabbing a pipeline while tpot is working\n\n\n\n\n\nExample:\n\n\nmkdir my_checkpoints\n\n\n-cf ./my_checkpoints\n\n\n\n\n\n-es\n\n\nEARLY_STOP\n\n\nAny positive integer\n\n\n\nHow many generations TPOT checks whether there is no improvement in optimization process.\n\n\nEnd optimization process if there is no improvement in the set number of generations.\n\n\n\n\n\n-v\n\n\nVERBOSITY\n\n\n{0, 1, 2, 3}\n\n\nHow much information TPOT communicates while it is running.\n\n\n0 = none, 1 = minimal, 2 = high, 3 = all.\n\n\nA setting of 2 or higher will add a progress bar during the optimization procedure.\n\n\n\n\n\n\n--no-update-check\n\n\nFlag indicating whether the TPOT version checker should be disabled.\n\n\n\n\n\n\n--version\n\n\nShow TPOT's version number and exit.\n\n\n\n\n\n\n--help\n\n\nShow TPOT's help documentation and exit.\n\n\n\n\n\n\n\nScoring functions\n\n\nTPOT makes use of \nsklearn.model_selection.cross_val_score\n for evaluating pipelines, and as such offers the same support for scoring functions. There are two ways to make use of scoring functions with TPOT:\n\n\n\n\n\n\nYou can pass in a string to the \nscoring\n parameter from the list above. Any other strings will cause TPOT to throw an exception.\n\n\n\n\n\n\nYou can pass the callable object/function with signature \nscorer(estimator, X, y)\n, where \nestimator\n is trained estimator to use for scoring, \nX\n are features that will be passed to \nestimator.predict\n and \ny\n are target values for \nX\n. To do this, you should implement your own function. See the example below for further explanation.\n\n\n\n\n\n\nfrom tpot import TPOTClassifier\nfrom sklearn.datasets import load_digits\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.metrics.scorer import make_scorer\n\ndigits = load_digits()\nX_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,\n train_size=0.75, test_size=0.25)\n# Make a custom metric function\ndef my_custom_accuracy(y_true, y_pred):\n return float(sum(y_pred == y_true)) / len(y_true)\n\n# Make a custom a scorer from the custom metric function\n# Note: greater_is_better=False in make_scorer below would mean that the scoring function should be minimized.\nmy_custom_scorer = make_scorer(my_custom_accuracy, greater_is_better=True)\n\ntpot = TPOTClassifier(generations=5, population_size=20, verbosity=2,\n scoring=my_custom_scorer)\ntpot.fit(X_train, y_train)\nprint(tpot.score(X_test, y_test))\ntpot.export('tpot_digits_pipeline.py')\n\n\n\n\n\n\nmy_module.scorer_name\n: You can also use a custom \nscore_func(y_true, y_pred)\n or \nscorer(estimator, X, y)\n function through the command line by adding the argument \n-scoring my_module.scorer\n to your command-line call. TPOT will import your module and use the custom scoring function from there. TPOT will include your current working directory when importing the module, so you can place it in the same directory where you are going to run TPOT.\nExample: \n-scoring sklearn.metrics.auc\n will use the function auc from sklearn.metrics module.\n\n\n\n\nBuilt-in TPOT configurations\n\n\nTPOT comes with a handful of default operators and parameter configurations that we believe work well for optimizing machine learning pipelines. Below is a list of the current built-in configurations that come with TPOT.\n\n\n\n\n\n\nConfiguration Name\n\n\nDescription\n\n\nOperators\n\n\n\n\n\n\n\nDefault TPOT\n\n\nTPOT will search over a broad range of preprocessors, feature constructors, feature selectors, models, and parameters to find a series of operators that minimize the error of the model predictions. Some of these operators are complex and may take a long time to run, especially on larger datasets.\n\n\n\nNote: This is the default configuration for TPOT.\n To use this configuration, use the default value (None) for the config_dict parameter.\n\n\nClassification\n\n\n\n\nRegression\n\n\n\n\n\n\n\nTPOT light\n\n\nTPOT will search over a restricted range of preprocessors, feature constructors, feature selectors, models, and parameters to find a series of operators that minimize the error of the model predictions. Only simpler and fast-running operators will be used in these pipelines, so TPOT light is useful for finding quick and simple pipelines for a classification or regression problem.\n\n\nThis configuration works for both the TPOTClassifier and TPOTRegressor.\n\n\nClassification\n\n\n\n\nRegression\n\n\n\n\n\n\n\nTPOT MDR\n\n\nTPOT will search over a series of feature selectors and \nMultifactor Dimensionality Reduction\n models to find a series of operators that maximize prediction accuracy. The TPOT MDR configuration is specialized for \ngenome-wide association studies (GWAS)\n, and is described in detail online \nhere\n.\n\n\nNote that TPOT MDR may be slow to run because the feature selection routines are computationally expensive, especially on large datasets.\n\n\nClassification\n\n\n\n\nRegression\n\n\n\n\n\n\n\nTPOT sparse\n\n\nTPOT uses a configuration dictionary with a one-hot encoder and the operators normally included in TPOT that also support sparse matrices.\n\n\nThis configuration works for both the TPOTClassifier and TPOTRegressor.\n\n\nClassification\n\n\n\n\nRegression\n\n\n\n\n\n\n\n\nTo use any of these configurations, simply pass the string name of the configuration to the \nconfig_dict\n parameter (or \n-config\n on the command line). For example, to use the \"TPOT light\" configuration:\n\n\nfrom tpot import TPOTClassifier\nfrom sklearn.datasets import load_digits\nfrom sklearn.model_selection import train_test_split\n\ndigits = load_digits()\nX_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,\n train_size=0.75, test_size=0.25)\n\ntpot = TPOTClassifier(generations=5, population_size=20, verbosity=2,\n config_dict='TPOT light')\ntpot.fit(X_train, y_train)\nprint(tpot.score(X_test, y_test))\ntpot.export('tpot_digits_pipeline.py')\n\n\n\n\n\nCustomizing TPOT's operators and parameters\n\n\nBeyond the default configurations that come with TPOT, in some cases it is useful to limit the algorithms and parameters that TPOT considers. For that reason, we allow users to provide TPOT with a custom configuration for its operators and parameters.\n\n\nThe custom TPOT configuration must be in nested dictionary format, where the first level key is the path and name of the operator (e.g., \nsklearn.naive_bayes.MultinomialNB\n) and the second level key is the corresponding parameter name for that operator (e.g., \nfit_prior\n). The second level key should point to a list of parameter values for that parameter, e.g., \n'fit_prior': [True, False]\n.\n\n\nFor a simple example, the configuration could be:\n\n\ntpot_config = {\n 'sklearn.naive_bayes.GaussianNB': {\n },\n\n 'sklearn.naive_bayes.BernoulliNB': {\n 'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.],\n 'fit_prior': [True, False]\n },\n\n 'sklearn.naive_bayes.MultinomialNB': {\n 'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.],\n 'fit_prior': [True, False]\n }\n}\n\n\n\n\nin which case TPOT would only consider pipelines containing \nGaussianNB\n, \nBernoulliNB\n, \nMultinomialNB\n, and tune those algorithm's parameters in the ranges provided. This dictionary can be passed directly within the code to the \nTPOTClassifier\n/\nTPOTRegressor\n \nconfig_dict\n parameter, described above. For example:\n\n\nfrom tpot import TPOTClassifier\nfrom sklearn.datasets import load_digits\nfrom sklearn.model_selection import train_test_split\n\ndigits = load_digits()\nX_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,\n train_size=0.75, test_size=0.25)\n\ntpot_config = {\n 'sklearn.naive_bayes.GaussianNB': {\n },\n\n 'sklearn.naive_bayes.BernoulliNB': {\n 'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.],\n 'fit_prior': [True, False]\n },\n\n 'sklearn.naive_bayes.MultinomialNB': {\n 'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.],\n 'fit_prior': [True, False]\n }\n}\n\ntpot = TPOTClassifier(generations=5, population_size=20, verbosity=2,\n config_dict=tpot_config)\ntpot.fit(X_train, y_train)\nprint(tpot.score(X_test, y_test))\ntpot.export('tpot_digits_pipeline.py')\n\n\n\n\nCommand-line users must create a separate \n.py\n file with the custom configuration and provide the path to the file to the \ntpot\n call. For example, if the simple example configuration above is saved in \ntpot_classifier_config.py\n, that configuration could be used on the command line with the command:\n\n\ntpot data/mnist.csv -is , -target class -config tpot_classifier_config.py -g 5 -p 20 -v 2 -o tpot_exported_pipeline.py\n\n\n\n\nWhen using the command-line interface, the configuration file specified in the \n-config\n parameter \nmust\n name its custom TPOT configuration \ntpot_config\n. Otherwise, TPOT will not be able to locate the configuration dictionary.\n\n\nFor more detailed examples of how to customize TPOT's operator configuration, see the default configurations for \nclassification\n and \nregression\n in TPOT's source code.\n\n\nNote that you must have all of the corresponding packages for the operators installed on your computer, otherwise TPOT will not be able to use them. For example, if XGBoost is not installed on your computer, then TPOT will simply not import nor use XGBoost in the pipelines it considers.\n\n\nTemplate option in TPOT\n\n\nTemplate option provides a way to specify a desired structure for machine learning pipeline, which may reduce TPOT computation time and potentially provide more interpretable results. Current implementation only supports linear pipelines.\n\n\nBelow is a simple example to use \ntemplate\n option. The pipelines generated/evaluated in TPOT will follow this structure: 1st step is a feature selector (a subclass of \nSelectorMixin\n), 2nd step is a feature transformer (a subclass of \nTransformerMixin\n) and 3rd step is a classifier for classification (a subclass of \nClassifierMixin\n). The last step must be \nClassifier\n for \nTPOTClassifier\n's template but \nRegressor\n for \nTPOTRegressor\n. \nNote: although \nSelectorMixin\n is subclass of \nTransformerMixin\n in scikit-learn, but \nTransformer\n in this option excludes those subclasses of \nSelectorMixin\n.\n\n\ntpot_obj = TPOTClassifier(\n template='Selector-Transformer-Classifier'\n )\n\n\n\n\nIf a specific operator, e.g. \nSelectPercentile\n, is preferred for usage in the 1st step of the pipeline, the template can be defined like 'SelectPercentile-Transformer-Classifier'.\n\n\nFeatureSetSelector in TPOT\n\n\nFeatureSetSelector\n is a special new operator in TPOT. This operator enables feature selection based on \npriori\n expert knowledge. For example, in RNA-seq gene expression analysis, this operator can be used to select one or more gene (feature) set(s) based on GO (Gene Ontology) terms or annotated gene sets Molecular Signatures Database (\nMSigDB\n) in the 1st step of pipeline via \ntemplate\n option above, in order to reduce dimensions and TPOT computation time. This operator requires a dataset list in csv format. In this csv file, there are only three columns: 1st column is feature set names, 2nd column is the total number of features in one set and 3rd column is a list of feature names (if input X is pandas.DataFrame) or indexes (if input X is numpy.ndarray) delimited by \";\". Below is a example how to use this operator in TPOT.\n\n\nPlease check our \npreprint paper\n for more details.\n\n\nfrom tpot import TPOTClassifier\nimport numpy as np\nimport pandas as pd\nfrom tpot.config import classifier_config_dict\ntest_data = pd.read_csv(\"https://raw.githubusercontent.com/EpistasisLab/tpot/master/tests/tests.csv\")\ntest_X = test_data.drop(\"class\", axis=1)\ntest_y = test_data['class']\n\n# add FeatureSetSelector into tpot configuration\nclassifier_config_dict['tpot.builtins.FeatureSetSelector'] = {\n 'subset_list': ['https://raw.githubusercontent.com/EpistasisLab/tpot/master/tests/subset_test.csv'],\n 'sel_subset': [0,1] # select only one feature set, a list of index of subset in the list above\n #'sel_subset': list(combinations(range(3), 2)) # select two feature sets\n}\n\n\ntpot = TPOTClassifier(generations=5,\n population_size=50, verbosity=2,\n template='FeatureSetSelector-Transformer-Classifier',\n config_dict=classifier_config_dict)\ntpot.fit(test_X, test_y)\n\n\n\n\nPipeline caching in TPOT\n\n\nWith the \nmemory\n parameter, pipelines can cache the results of each transformer after fitting them. This feature is used to avoid repeated computation by transformers within a pipeline if the parameters and input data are identical to another fitted pipeline during optimization process. TPOT allows users to specify a custom directory path or \njoblib.Memory\n in case they want to re-use the memory cache in future TPOT runs (or a \nwarm_start\n run).\n\n\nThere are three methods for enabling memory caching in TPOT:\n\n\nfrom tpot import TPOTClassifier\nfrom tempfile import mkdtemp\nfrom joblib import Memory\nfrom shutil import rmtree\n\n# Method 1, auto mode: TPOT uses memory caching with a temporary directory and cleans it up upon shutdown\ntpot = TPOTClassifier(memory='auto')\n\n# Method 2, with a custom directory for memory caching\ntpot = TPOTClassifier(memory='/to/your/path')\n\n# Method 3, with a Memory object\ncachedir = mkdtemp() # Create a temporary folder\nmemory = Memory(cachedir=cachedir, verbose=0)\ntpot = TPOTClassifier(memory=memory)\n\n# Clear the cache directory when you don't need it anymore\nrmtree(cachedir)\n\n\n\n\nNote: TPOT does NOT clean up memory caches if users set a custom directory path or Memory object. We recommend that you clean up the memory caches when you don't need it anymore.\n\n\nCrash/freeze issue with n_jobs > 1 under OSX or Linux\n\n\nInternally, TPOT uses \njoblib\n to fit estimators in parallel.\nThis is the same parallelization framework used by scikit-learn. But it may crash/freeze with n_jobs > 1 under OSX or Linux \nas scikit-learn does\n, especially with large datasets.\n\n\nOne solution is to configure Python's \nmultiprocessing\n module to use the \nforkserver\n start method (instead of the default \nfork\n) to manage the process pools. You can enable the \nforkserver\n mode globally for your program by putting the following codes into your main script:\n\n\nimport multiprocessing\n\n# other imports, custom code, load data, define model...\n\nif __name__ == '__main__':\n multiprocessing.set_start_method('forkserver')\n\n # call scikit-learn utils or tpot utils with n_jobs > 1 here\n\n\n\n\nMore information about these start methods can be found in the \nmultiprocessing documentation\n.\n\n\nParallel Training with Dask\n\n\nFor large problems or working on Jupyter notebook, we highly recommend that you can distribute the work on a \nDask\n cluster.\nThe \ndask-examples binder\n has a runnable example\nwith a small dask cluster.\n\n\nTo use your Dask cluster to fit a TPOT model, specify the \nuse_dask\n keyword when you create the TPOT estimator. \nNote: if \nuse_dask=True\n, TPOT will use as many cores as available on the your Dask cluster. If \nn_jobs\n is specified, then it will control the chunk size (10*\nn_jobs\n if it is less then offspring size) of parallel training. \n\n\nestimator = TPOTEstimator(use_dask=True, n_jobs=-1)\n\n\n\n\nThis will use use all the workers on your cluster to do the training, and use \nDask-ML's pipeline rewriting\n to avoid re-fitting estimators multiple times on the same set of data.\nIt will also provide fine-grained diagnostics in the \ndistributed scheduler UI\n.\n\n\nAlternatively, Dask implements a joblib backend.\nYou can instruct TPOT to use the distributed backend during training by specifying a \njoblib.parallel_backend\n:\n\n\nimport joblib\nimport distributed.joblib\nfrom dask.distributed import Client\n\n# connect to the cluster\nclient = Client('schedueler-address')\n\n# create the estimator normally\nestimator = TPOTClassifier(n_jobs=-1)\n\n# perform the fit in this context manager\nwith joblib.parallel_backend(\"dask\"):\n estimator.fit(X, y)\n\n\n\n\nSee \ndask's distributed joblib integration\n for more.",
"title": "Using TPOT"
},
{
@@ -52,7 +52,7 @@
},
{
"location": "/using/#featuresetselector-in-tpot",
- "text": "FeatureSetSelector is a special new operator in TPOT. This operator enables feature selection based on priori export knowledge. For example, in RNA-seq gene expression analysis, this operator can be used to select one or more gene (feature) set(s) based on GO (Gene Ontology) terms or annotated gene sets Molecular Signatures Database ( MSigDB ) in the 1st step of pipeline via template option above, in order to reduce dimensions and TPOT computation time. This operator requires a dataset list in csv format. In this csv file, there are only three columns: 1st column is feature set names, 2nd column is the total number of features in one set and 3rd column is a list of feature names (if input X is pandas.DataFrame) or indexes (if input X is numpy.ndarray) delimited by \";\". Below is a example how to use this operator in TPOT. Please check our preprint paper for more details. from tpot import TPOTClassifier\nimport numpy as np\nimport pandas as pd\nfrom tpot.config import classifier_config_dict\ntest_data = pd.read_csv(\"https://raw.githubusercontent.com/EpistasisLab/tpot/master/tests/tests.csv\")\ntest_X = test_data.drop(\"class\", axis=1)\ntest_y = test_data['class']\n\n# add FeatureSetSelector into tpot configuration\nclassifier_config_dict['tpot.builtins.FeatureSetSelector'] = {\n 'subset_list': ['https://raw.githubusercontent.com/EpistasisLab/tpot/master/tests/subset_test.csv'],\n 'sel_subset': [0,1] # select only one feature set, a list of index of subset in the list above\n #'sel_subset': list(combinations(range(3), 2)) # select two feature sets\n}\n\n\ntpot = TPOTClassifier(generations=5,\n population_size=50, verbosity=2,\n template='FeatureSetSelector-Transformer-Classifier',\n config_dict=classifier_config_dict)\ntpot.fit(test_X, test_y)",
+ "text": "FeatureSetSelector is a special new operator in TPOT. This operator enables feature selection based on priori expert knowledge. For example, in RNA-seq gene expression analysis, this operator can be used to select one or more gene (feature) set(s) based on GO (Gene Ontology) terms or annotated gene sets Molecular Signatures Database ( MSigDB ) in the 1st step of pipeline via template option above, in order to reduce dimensions and TPOT computation time. This operator requires a dataset list in csv format. In this csv file, there are only three columns: 1st column is feature set names, 2nd column is the total number of features in one set and 3rd column is a list of feature names (if input X is pandas.DataFrame) or indexes (if input X is numpy.ndarray) delimited by \";\". Below is a example how to use this operator in TPOT. Please check our preprint paper for more details. from tpot import TPOTClassifier\nimport numpy as np\nimport pandas as pd\nfrom tpot.config import classifier_config_dict\ntest_data = pd.read_csv(\"https://raw.githubusercontent.com/EpistasisLab/tpot/master/tests/tests.csv\")\ntest_X = test_data.drop(\"class\", axis=1)\ntest_y = test_data['class']\n\n# add FeatureSetSelector into tpot configuration\nclassifier_config_dict['tpot.builtins.FeatureSetSelector'] = {\n 'subset_list': ['https://raw.githubusercontent.com/EpistasisLab/tpot/master/tests/subset_test.csv'],\n 'sel_subset': [0,1] # select only one feature set, a list of index of subset in the list above\n #'sel_subset': list(combinations(range(3), 2)) # select two feature sets\n}\n\n\ntpot = TPOTClassifier(generations=5,\n population_size=50, verbosity=2,\n template='FeatureSetSelector-Transformer-Classifier',\n config_dict=classifier_config_dict)\ntpot.fit(test_X, test_y)",
"title": "FeatureSetSelector in TPOT"
},
{
diff --git a/docs/sitemap.xml b/docs/sitemap.xml
index 0c1bfebb..4e7766a0 100644
--- a/docs/sitemap.xml
+++ b/docs/sitemap.xml
@@ -4,7 +4,7 @@
http://epistasislab.github.io/tpot/
- 2019-11-05
+ 2020-01-03
daily
@@ -12,7 +12,7 @@
http://epistasislab.github.io/tpot/installing/
- 2019-11-05
+ 2020-01-03
daily
@@ -20,7 +20,7 @@
http://epistasislab.github.io/tpot/using/
- 2019-11-05
+ 2020-01-03
daily
@@ -28,7 +28,7 @@
http://epistasislab.github.io/tpot/api/
- 2019-11-05
+ 2020-01-03
daily
@@ -36,7 +36,7 @@
http://epistasislab.github.io/tpot/examples/
- 2019-11-05
+ 2020-01-03
daily
@@ -44,7 +44,7 @@
http://epistasislab.github.io/tpot/contributing/
- 2019-11-05
+ 2020-01-03
daily
@@ -52,7 +52,7 @@
http://epistasislab.github.io/tpot/releases/
- 2019-11-05
+ 2020-01-03
daily
@@ -60,7 +60,7 @@
http://epistasislab.github.io/tpot/citing/
- 2019-11-05
+ 2020-01-03
daily
@@ -68,7 +68,7 @@
http://epistasislab.github.io/tpot/support/
- 2019-11-05
+ 2020-01-03
daily
@@ -76,7 +76,7 @@
http://epistasislab.github.io/tpot/related/
- 2019-11-05
+ 2020-01-03
daily
diff --git a/docs/using/index.html b/docs/using/index.html
index a1c9d1e4..2f5bcb65 100644
--- a/docs/using/index.html
+++ b/docs/using/index.html
@@ -661,7 +661,7 @@ Template option in TPOT
If a specific operator, e.g. SelectPercentile
, is preferred for usage in the 1st step of the pipeline, the template can be defined like 'SelectPercentile-Transformer-Classifier'.
FeatureSetSelector in TPOT
-FeatureSetSelector
is a special new operator in TPOT. This operator enables feature selection based on priori export knowledge. For example, in RNA-seq gene expression analysis, this operator can be used to select one or more gene (feature) set(s) based on GO (Gene Ontology) terms or annotated gene sets Molecular Signatures Database (MSigDB) in the 1st step of pipeline via template
option above, in order to reduce dimensions and TPOT computation time. This operator requires a dataset list in csv format. In this csv file, there are only three columns: 1st column is feature set names, 2nd column is the total number of features in one set and 3rd column is a list of feature names (if input X is pandas.DataFrame) or indexes (if input X is numpy.ndarray) delimited by ";". Below is a example how to use this operator in TPOT.
+FeatureSetSelector
is a special new operator in TPOT. This operator enables feature selection based on priori expert knowledge. For example, in RNA-seq gene expression analysis, this operator can be used to select one or more gene (feature) set(s) based on GO (Gene Ontology) terms or annotated gene sets Molecular Signatures Database (MSigDB) in the 1st step of pipeline via template
option above, in order to reduce dimensions and TPOT computation time. This operator requires a dataset list in csv format. In this csv file, there are only three columns: 1st column is feature set names, 2nd column is the total number of features in one set and 3rd column is a list of feature names (if input X is pandas.DataFrame) or indexes (if input X is numpy.ndarray) delimited by ";". Below is a example how to use this operator in TPOT.
Please check our preprint paper for more details.
from tpot import TPOTClassifier
import numpy as np
diff --git a/docs_sources/using.md b/docs_sources/using.md
index f83a2652..1621681c 100644
--- a/docs_sources/using.md
+++ b/docs_sources/using.md
@@ -550,7 +550,7 @@ If a specific operator, e.g. `SelectPercentile`, is preferred for usage in the 1
# FeatureSetSelector in TPOT
-`FeatureSetSelector` is a special new operator in TPOT. This operator enables feature selection based on *priori* export knowledge. For example, in RNA-seq gene expression analysis, this operator can be used to select one or more gene (feature) set(s) based on GO (Gene Ontology) terms or annotated gene sets Molecular Signatures Database ([MSigDB](http://software.broadinstitute.org/gsea/msigdb/index.jsp)) in the 1st step of pipeline via `template` option above, in order to reduce dimensions and TPOT computation time. This operator requires a dataset list in csv format. In this csv file, there are only three columns: 1st column is feature set names, 2nd column is the total number of features in one set and 3rd column is a list of feature names (if input X is pandas.DataFrame) or indexes (if input X is numpy.ndarray) delimited by ";". Below is a example how to use this operator in TPOT.
+`FeatureSetSelector` is a special new operator in TPOT. This operator enables feature selection based on *priori* expert knowledge. For example, in RNA-seq gene expression analysis, this operator can be used to select one or more gene (feature) set(s) based on GO (Gene Ontology) terms or annotated gene sets Molecular Signatures Database ([MSigDB](http://software.broadinstitute.org/gsea/msigdb/index.jsp)) in the 1st step of pipeline via `template` option above, in order to reduce dimensions and TPOT computation time. This operator requires a dataset list in csv format. In this csv file, there are only three columns: 1st column is feature set names, 2nd column is the total number of features in one set and 3rd column is a list of feature names (if input X is pandas.DataFrame) or indexes (if input X is numpy.ndarray) delimited by ";". Below is a example how to use this operator in TPOT.
Please check our [preprint paper](https://www.biorxiv.org/content/10.1101/502484v1.article-info) for more details.
diff --git a/optional-requirements.txt b/optional-requirements.txt
index 3fa01ca5..27bb64c8 100644
--- a/optional-requirements.txt
+++ b/optional-requirements.txt
@@ -1,3 +1,3 @@
-xgboost==0.6a2
+xgboost==0.90
scikit-mdr==0.4.4
skrebate==0.3.4
diff --git a/requirements.txt b/requirements.txt
index 9fa9dca7..789d396e 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,10 +1,10 @@
deap>=1.2
nose==1.3.7
numpy>=1.16.3
-scikit-learn>=0.21.0
+scikit-learn>=0.22.0
scipy>=1.3.1
tqdm>=4.36.1
update-checker>=0.16
-stopit>=1.1.1
+stopit>=1.1.2
pandas>=0.24.2
joblib>=0.13.2
diff --git a/setup.py b/setup.py
index 0c98684a..5fb80ba5 100644
--- a/setup.py
+++ b/setup.py
@@ -37,7 +37,7 @@ def calculate_version():
zip_safe=True,
install_requires=['numpy>=1.16.3',
'scipy>=1.3.1',
- 'scikit-learn>=0.21.0',
+ 'scikit-learn>=0.22.0',
'deap>=1.2',
'update_checker>=0.16',
'tqdm>=4.36.1',
diff --git a/tests/driver_tests.py b/tests/driver_tests.py
index 5532d960..e9fd9207 100644
--- a/tests/driver_tests.py
+++ b/tests/driver_tests.py
@@ -296,8 +296,6 @@ def test_print_args(self):
VERBOSITY = 1
"""
- print
-
self.assertEqual(_sort_lines(expected_output), _sort_lines(output))
diff --git a/tests/export_tests.py b/tests/export_tests.py
index 0ffbbc27..a9833cb0 100644
--- a/tests/export_tests.py
+++ b/tests/export_tests.py
@@ -71,7 +71,6 @@ def test_export_random_ind():
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB
-from tpot.export_utils import set_param_recursive
# NOTE: Make sure that the outcome column is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
@@ -80,14 +79,14 @@ def test_export_random_ind():
train_test_split(features, tpot_data['target'], random_state=39)
exported_pipeline = BernoulliNB(alpha=1.0, fit_prior=False)
-# Fix random state for all the steps in exported pipeline
-set_param_recursive(exported_pipeline.steps, 'random_state', 39)
+# Fix random state in exported estimator
+if hasattr(exported_pipeline, 'random_state'):
+ setattr(exported_pipeline, 'random_state', 39)
exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)
"""
exported_code = export_pipeline(pipeline, tpot_obj.operators, tpot_obj._pset, random_state=tpot_obj.random_state)
-
assert expected_code == exported_code
@@ -487,18 +486,17 @@ def test_export_pipeline_6():
"""Assert that exported_pipeline() generated a compile source file with random_state and data_file_path."""
pipeline_string = (
- 'KNeighborsClassifier('
- 'input_matrix, '
- 'KNeighborsClassifier__n_neighbors=10, '
- 'KNeighborsClassifier__p=1, '
- 'KNeighborsClassifier__weights=uniform'
- ')'
+ 'DecisionTreeClassifier(SelectPercentile(input_matrix, SelectPercentile__percentile=20),'
+ 'DecisionTreeClassifier__criterion=gini, DecisionTreeClassifier__max_depth=8,'
+ 'DecisionTreeClassifier__min_samples_leaf=5, DecisionTreeClassifier__min_samples_split=5)'
)
pipeline = creator.Individual.from_string(pipeline_string, tpot_obj._pset)
expected_code = """import numpy as np
import pandas as pd
+from sklearn.feature_selection import SelectPercentile, f_classif
from sklearn.model_selection import train_test_split
-from sklearn.neighbors import KNeighborsClassifier
+from sklearn.pipeline import make_pipeline
+from sklearn.tree import DecisionTreeClassifier
from tpot.export_utils import set_param_recursive
# NOTE: Make sure that the outcome column is labeled 'target' in the data file
@@ -507,7 +505,10 @@ def test_export_pipeline_6():
training_features, testing_features, training_target, testing_target = \\
train_test_split(features, tpot_data['target'], random_state=42)
-exported_pipeline = KNeighborsClassifier(n_neighbors=10, p=1, weights="uniform")
+exported_pipeline = make_pipeline(
+ SelectPercentile(score_func=f_classif, percentile=20),
+ DecisionTreeClassifier(criterion="gini", max_depth=8, min_samples_leaf=5, min_samples_split=5)
+)
# Fix random state for all the steps in exported pipeline
set_param_recursive(exported_pipeline.steps, 'random_state', 42)
diff --git a/tests/stacking_estimator_tests.py b/tests/stacking_estimator_tests.py
index e7b62a60..7e58b8e2 100644
--- a/tests/stacking_estimator_tests.py
+++ b/tests/stacking_estimator_tests.py
@@ -78,7 +78,7 @@ def test_StackingEstimator_3():
# test cv score
cv_score = np.mean(cross_val_score(sklearn_pipeline, training_features, training_target, cv=3, scoring='accuracy'))
- known_cv_score = 0.9472823753147593
+ known_cv_score = 0.9643652561247217
assert np.allclose(known_cv_score, cv_score)
@@ -101,6 +101,6 @@ def test_StackingEstimator_4():
# test cv score
cv_score = np.mean(cross_val_score(sklearn_pipeline, training_features_r, training_target_r, cv=3, scoring='r2'))
- known_cv_score = 0.7989564328211737
+ known_cv_score = 0.8216045257587923
assert np.allclose(known_cv_score, cv_score)
diff --git a/tests/tpot_tests.py b/tests/tpot_tests.py
index e1b17032..140643bd 100644
--- a/tests/tpot_tests.py
+++ b/tests/tpot_tests.py
@@ -58,7 +58,10 @@
from joblib import Memory
from sklearn.metrics import make_scorer, roc_auc_score
from sklearn.base import BaseEstimator, ClassifierMixin, RegressorMixin, TransformerMixin
-from sklearn.feature_selection.base import SelectorMixin
+try:
+ from sklearn.feature_selection._base import SelectorMixin
+except ImportError:
+ from sklearn.feature_selection.base import SelectorMixin
from deap import creator, gp
from deap.tools import ParetoFront
from nose.tools import nottest, assert_raises, assert_not_equal, assert_greater_equal, assert_equal, assert_in
@@ -965,7 +968,7 @@ def test_fit_4():
assert tpot_obj.generations == 1000000
# reset generations to 20 just in case that the failed test may take too much time
- tpot_obj.generations == 20
+ tpot_obj.generations = 20
tpot_obj.fit(training_features, training_target)
assert tpot_obj._pop == []
@@ -988,7 +991,7 @@ def test_fit_5():
assert tpot_obj.generations == 1000000
# reset generations to 20 just in case that the failed test may take too much time
- tpot_obj.generations == 20
+ tpot_obj.generations = 20
tpot_obj.fit(training_features, training_target)
assert tpot_obj._pop != []
@@ -1426,7 +1429,15 @@ def pareto_eq(ind1, ind2):
sklearn_pipeline = tpot_obj._toolbox.compile(expr=deap_pipeline)
try:
- cv_scores = cross_val_score(sklearn_pipeline, training_features, training_target, cv=5, scoring='accuracy', verbose=0)
+ with warnings.catch_warnings():
+ warnings.simplefilter('ignore')
+ cv_scores = cross_val_score(sklearn_pipeline,
+ training_features,
+ training_target,
+ cv=5,
+ scoring='accuracy',
+ verbose=0,
+ error_score='raise')
mean_cv_scores = np.mean(cv_scores)
except Exception as e:
mean_cv_scores = -float('inf')
@@ -1460,7 +1471,15 @@ def pareto_eq(ind1, ind2):
sklearn_pipeline = tpot_obj._toolbox.compile(expr=deap_pipeline)
try:
- cv_scores = cross_val_score(sklearn_pipeline, training_features, training_target, cv=5, scoring='accuracy', verbose=0)
+ with warnings.catch_warnings():
+ warnings.simplefilter('ignore')
+ cv_scores = cross_val_score(sklearn_pipeline,
+ training_features,
+ training_target,
+ cv=5,
+ scoring='accuracy',
+ verbose=0,
+ error_score='raise')
mean_cv_scores = np.mean(cv_scores)
except Exception as e:
mean_cv_scores = -float('inf')
diff --git a/tpot/_version.py b/tpot/_version.py
index 69737bab..4ceeadd6 100644
--- a/tpot/_version.py
+++ b/tpot/_version.py
@@ -23,4 +23,4 @@
"""
-__version__ = '0.11.0'
+__version__ = '0.11.1'
diff --git a/tpot/base.py b/tpot/base.py
index f46e4267..0553da11 100644
--- a/tpot/base.py
+++ b/tpot/base.py
@@ -52,7 +52,6 @@
from sklearn.preprocessing import FunctionTransformer
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
-from sklearn.metrics.scorer import _BaseScorer
from joblib import Parallel, delayed, Memory
@@ -304,7 +303,7 @@ def _setup_scoring_function(self, scoring):
args_list = inspect.getfullargspec(scoring)[0]
if args_list == ["y_true", "y_pred"] or (hasattr(module, 'startswith') and \
(module.startswith('sklearn.metrics.') or module.startswith('tpot.metrics')) and \
- not module.startswith('sklearn.metrics.scorer') and \
+ not module.startswith('sklearn.metrics._scorer') and \
not module.startswith('sklearn.metrics.tests.')):
raise ValueError(
'Scoring function {} looks like it is a metric function '
@@ -500,6 +499,35 @@ def _fit_init(self):
self._pareto_front = None
self._last_optimized_pareto_front = None
self._last_optimized_pareto_front_n_gens = 0
+ self._setup_config(self.config_dict)
+
+ self._setup_template(self.template)
+
+ self.operators = []
+ self.arguments = []
+ for key in sorted(self._config_dict.keys()):
+ op_class, arg_types = TPOTOperatorClassFactory(
+ key,
+ self._config_dict[key],
+ BaseClass=Operator,
+ ArgBaseClass=ARGType,
+ verbose=self.verbosity
+ )
+ if op_class:
+ self.operators.append(op_class)
+ self.arguments += arg_types
+ self.operators_context = {
+ 'make_pipeline': make_pipeline,
+ 'make_union': make_union,
+ 'StackingEstimator': StackingEstimator,
+ 'FunctionTransformer': FunctionTransformer,
+ 'copy': copy
+ }
+ self._setup_pset()
+ self._setup_toolbox()
+ # Dictionary of individuals that have already been evaluated in previous
+ # generations or previous runs
+ self.evaluated_individuals_ = {}
self._optimized_pipeline = None
self._optimized_pipeline_score = None
@@ -516,23 +544,6 @@ def _fit_init(self):
# any one given individual (or pair of individuals)
self._max_mut_loops = 50
- self._setup_config(self.config_dict)
-
- self._setup_template(self.template)
-
- self.operators = []
- self.arguments = []
- for key in sorted(self._config_dict.keys()):
- op_class, arg_types = TPOTOperatorClassFactory(
- key,
- self._config_dict[key],
- BaseClass=Operator,
- ArgBaseClass=ARGType,
- verbose=self.verbosity
- )
- if op_class:
- self.operators.append(op_class)
- self.arguments += arg_types
if self.max_time_mins is None and self.generations is None:
raise ValueError("Either the parameter generations should bet set or a maximum evaluation time should be defined via max_time_mins")
@@ -551,23 +562,11 @@ def _fit_init(self):
'The sum of the crossover and mutation probabilities must be <= 1.0.'
)
- self.operators_context = {
- 'make_pipeline': make_pipeline,
- 'make_union': make_union,
- 'StackingEstimator': StackingEstimator,
- 'FunctionTransformer': FunctionTransformer,
- 'copy': copy
- }
-
self._pbar = None
# Specifies where to output the progress messages (default: sys.stdout).
# Maybe open this API in future version of TPOT.(io.TextIOWrapper or io.StringIO)
self._file = sys.stdout
- # Dictionary of individuals that have already been evaluated in previous
- # generations
- self.evaluated_individuals_ = {}
-
self._setup_scoring_function(self.scoring)
if self.subsample <= 0.0 or self.subsample > 1.0:
@@ -584,9 +583,6 @@ def _fit_init(self):
else:
self._n_jobs = self.n_jobs
- self._setup_pset()
- self._setup_toolbox()
-
def fit(self, features, target, sample_weight=None, groups=None):
"""Fit an optimized machine learning pipeline.
@@ -1264,10 +1260,10 @@ def _evaluate_individuals(self, population, features, target, sample_weight=None
"""
# Evaluate the individuals with an invalid fitness
individuals = [ind for ind in population if not ind.fitness.valid]
-
+ num_population = len(population)
# update pbar for valid individuals (with fitness values)
if self.verbosity > 0:
- self._pbar.update(len(population)-len(individuals))
+ self._pbar.update(num_population-len(individuals))
operator_counts, eval_individuals_str, sklearn_pipeline_list, stats_dicts = self._preprocess_individuals(individuals)
@@ -1285,7 +1281,10 @@ def _evaluate_individuals(self, population, features, target, sample_weight=None
)
result_score_list = []
+
try:
+ # check time limit before pipeline evaluation
+ self._stop_by_max_time_mins()
# Don't use parallelization if n_jobs==1
if self._n_jobs == 1 and not self.use_dask:
for sklearn_pipeline in sklearn_pipeline_list:
@@ -1340,11 +1339,8 @@ def _evaluate_individuals(self, population, features, target, sample_weight=None
ind_str = str(ind)
ind.fitness.values = (self.evaluated_individuals_[ind_str]['operator_count'],
self.evaluated_individuals_[ind_str]['internal_cv_score'])
- # for individuals were not evaluated in this generation, TPOT will assign a bad fitness score
- for ind in individuals[num_eval_ind:]:
- ind.fitness.values = (5000.,-float('inf'))
- self._pareto_front.update(population)
+ self._pareto_front.update(individuals[:num_eval_ind])
self._pop = population
raise KeyboardInterrupt
@@ -1533,7 +1529,7 @@ def _random_mutation_operator(self, individual, allow_shrink=True):
partial(mutNodeReplacement, pset=self._pset)
]
# We can't shrink pipelines with only one primitive, so we only add it if we find more primitives.
- number_of_primitives = sum([isinstance(node, deap.gp.Primitive) for node in individual])
+ number_of_primitives = sum(isinstance(node, deap.gp.Primitive) for node in individual)
if number_of_primitives > 1 and allow_shrink:
mutation_techniques.append(partial(gp.mutShrink))
else:
@@ -1559,7 +1555,6 @@ def _random_mutation_operator(self, individual, allow_shrink=True):
break
else:
unsuccesful_mutations += 1
-
# Sometimes you have pipelines for which every shrunk version has already been explored too.
# To still mutate the individual, one of the two other mutators should be applied instead.
if ((unsuccesful_mutations == 50) and
@@ -1611,8 +1606,7 @@ def _operator_count(self, individual):
How many operators in a pipeline
"""
operator_count = 0
- for i in range(len(individual)):
- node = individual[i]
+ for node in individual:
if type(node) is deap.gp.Primitive and node.name != 'CombineDFs':
operator_count += 1
return operator_count
diff --git a/tpot/builtins/feature_set_selector.py b/tpot/builtins/feature_set_selector.py
index 428a60ac..dd168a7a 100644
--- a/tpot/builtins/feature_set_selector.py
+++ b/tpot/builtins/feature_set_selector.py
@@ -25,8 +25,10 @@
import pandas as pd
import os, os.path
from sklearn.base import BaseEstimator
-from sklearn.feature_selection.base import SelectorMixin
-from sklearn.utils.validation import check_is_fitted
+try:
+ from sklearn.feature_selection._base import SelectorMixin
+except ImportError:
+ from sklearn.feature_selection.base import SelectorMixin
class FeatureSetSelector(BaseEstimator, SelectorMixin):
@@ -142,7 +144,7 @@ def _get_support_mask(self):
An element is True iff its corresponding feature is selected for
retention.
"""
- check_is_fitted(self, 'feat_list_idx')
+
n_features = len(self.feature_names)
mask = np.zeros(n_features, dtype=bool)
mask[np.asarray(self.feat_list_idx)] = True
diff --git a/tpot/builtins/stacking_estimator.py b/tpot/builtins/stacking_estimator.py
index e742e5a5..4da1883a 100644
--- a/tpot/builtins/stacking_estimator.py
+++ b/tpot/builtins/stacking_estimator.py
@@ -89,7 +89,7 @@ def transform(self, X):
if np.all(np.isfinite(y_pred_proba)):
X_transformed = np.hstack((y_pred_proba, X))
- # add class prodiction as a synthetic feature
+ # add class prediction as a synthetic feature
X_transformed = np.hstack((np.reshape(self.estimator.predict(X), (-1, 1)), X_transformed))
return X_transformed
diff --git a/tpot/export_utils.py b/tpot/export_utils.py
index a66e3da2..7842d26c 100644
--- a/tpot/export_utils.py
+++ b/tpot/export_utils.py
@@ -219,7 +219,7 @@ def merge_imports(old_dict, new_dict):
if impute:
pipeline_text += """from sklearn.impute import SimpleImputer
"""
- if random_state is not None:
+ if random_state is not None and 'sklearn.pipeline' in pipeline_imports:
pipeline_text += """from tpot.export_utils import set_param_recursive
"""
@@ -281,10 +281,20 @@ def pipeline_code_wrapper(pipeline_code, random_state=None):
results = exported_pipeline.predict(testing_features)
""".format(pipeline_code)
else:
- exported_code = """exported_pipeline = {}
+ if pipeline_code.startswith('make_pipeline'):
+ exported_code = """exported_pipeline = {}
# Fix random state for all the steps in exported pipeline
set_param_recursive(exported_pipeline.steps, 'random_state', {})
+exported_pipeline.fit(training_features, training_target)
+results = exported_pipeline.predict(testing_features)
+""".format(pipeline_code, random_state)
+ else:
+ exported_code = """exported_pipeline = {}
+# Fix random state in exported estimator
+if hasattr(exported_pipeline, 'random_state'):
+ setattr(exported_pipeline, 'random_state', {})
+
exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)
""".format(pipeline_code, random_state)
diff --git a/tpot/gp_deap.py b/tpot/gp_deap.py
index a68c95e1..ae1afab7 100644
--- a/tpot/gp_deap.py
+++ b/tpot/gp_deap.py
@@ -28,7 +28,7 @@
from inspect import isclass
from .operator_utils import set_sample_weight
from sklearn.utils import indexable
-from sklearn.metrics.scorer import check_scoring
+from sklearn.metrics import check_scoring
from sklearn.model_selection._validation import _fit_and_score
from sklearn.model_selection._split import check_cv
@@ -456,10 +456,11 @@ def _wrapped_cross_val_score(sklearn_pipeline, features, target,
test=test,
verbose=0,
parameters=None,
+ error_score='raise',
fit_params=sample_weight_dict)
for train, test in cv_iter]
- CV_score = np.array(scores)[:, 0]
- return np.nanmean(CV_score)
+ CV_score = np.array(scores)[:, 0]
+ return np.nanmean(CV_score)
except TimeoutException:
return "Timeout"
except Exception as e:
diff --git a/tpot/operator_utils.py b/tpot/operator_utils.py
index b6061b51..dd297214 100644
--- a/tpot/operator_utils.py
+++ b/tpot/operator_utils.py
@@ -25,7 +25,10 @@
import numpy as np
from sklearn.base import BaseEstimator, ClassifierMixin, RegressorMixin, TransformerMixin
-from sklearn.feature_selection.base import SelectorMixin
+try:
+ from sklearn.feature_selection._base import SelectorMixin
+except ImportError:
+ from sklearn.feature_selection.base import SelectorMixin
import inspect