IRisk Lab Project, UIUC, Fall 2021
Now a personally-maintained project
The project aims to create a AutoML package with special focus on insurance data (with some imbalance in nature). The pipeline is now workable with encoding, imputation, balancing, scaling, feature selection, models (regression, classification models) as pipeline components and model selection/hyperparameter optimization (HPO) process as it's core of connection among all components for tuning.
System Requirements:
-
Linux (write and test on Ubuntu 20.04) and Windows (Windows is now compatible)
-
As all trials are running parallelized, more threads (correspondingly, more memory) will increase the training efficiency
-
pip
is required for installation (git
for building from source) -
Python version: should support version >= 3.7 (write and test on
3.8
and3.9
) -
If neural network is required, GPU supported device will increase the training efficiency
Open a command terminal.
pip install InsurAutoML
Open a command terminal.
git clone https://github.com/PanyiDong/InsurAutoML.git
cd InsurAutoML
pip install -e .
This method will use setup.py
to install the dependencies, by default, if no GPU support, should install normal
version. If you wish to use a extended version with more flexibilities, use pip install -e .[extended]
; or, if GPU is supported, and you wish to test neural network related architectures, use pip install -e .[nn]
for neural network installation.
At this moment, extended
contains few more ML packages and search algorithms that allows testing on a larger model/hyperparameter space with different search strategies. The differences may becomes larger for later versions.
It's recommended to use above
setup.py
to install dependencies as it provides more flexibility to recognize your operating system, compilation issues and some optional packages.
pip install -r requirements.txt
For neural network related support (recommended for CUDA
supported devices), please use pip install -r requirements_nn.txt
. The pipeline works without any neural network support with the loss of neural network support. If no CUDA device available, please use a non-torch
environment as those neural network methods can take forever to finish.
NOTE: 1. The installation of
rpy2
is not straightforward, so it may not actually useable, but it's in-built here inrequirements.txt
. If user install throughsetup.py
, it will not be included if R is not installed (through the environment variableR_HOME
).
After putting data to the folder, few of arguments need to be recognized.
Must have arguments :
train_data
and response
are two must-have arguments corresponds to training data name (no file extension needed), and response variable name.
Your can specify the data folder (or by default at current folder), test_data name (or evaluation set will be split from train_data), evaluation metrics and all model parameters by arguments.
Optional arguments :
data_folder
: refers to the folder where the data is stored, default set to current folder.
test_data
: refers to the test_data file name (no file extension needed), default is to split from train_data
.
test_eval
: refers to evaluation metric, default is set to accuracy for classification and MSE for regression tasks.
n_estimators
: refers to number of pipelines used to form the final ensemble, default set to 5.
timeout
: refers to the time budget in seconds for experiments, default set to 360.
max_evals
: refers to number of trials for experiments, default set to 64.
Use python main.py -h
for full list of supported arguments and explanation.
Example below runs a classification task on heart.csv
file in example/example_data
folder
In the command terminal:
python main.py --data_folder example/example_data --train_data heart --response HeartDisease
And an example run regression task on insurance.csv
file in example/example_data
folder
python main.py --data_folder example/example_data --train_data insurance --response expenses
In the end, will see the test evaluation results, recordings of the training procedures and optimal model ensemble file.
Or, you can treat it like a package and follows the fit/predict workflow like jupyter notebooks in examples
.
from InsurAutoML import AutoTabular
model = AutoTabular()
model.fit(train_X, train_y)
model.predict(test_X)
~~By default, progress reporter CLIReporter
is prepared for terminal/command-like report, when using jupyter notebook, call by AutoTabular(progress_reporter = "JupyterNotebookReporter")
for overwriting previous outputs. ~~ Now, the pipeline can identify whether console terminal/Jupyter Notebook environment is used, don't need to worry about it.
One important issue I find now is that, ray.tune
does not force to stop running trials but only stop generating new trials. So, if setting timeout
too low, it's common the fitting does not end in time (long running trials are expected not to finish in short time). However, if the pipeline is used as normal cases, which the timeout
and max_evals
both should be set to a rather large number, this case should not be as significant as very short time limit. I think that force trial runtime to stop can be a solution, but with few tryouts, I haven't found anything useful yet.
Required Packages: numpy, pandas, scipy, matplotlib, ray, ray[tune], ray[rllib], tqdm, mlflow, tensorboardX, hyperopt, auto-sklearn
$^{1}$ , scikit-learn, lightgbm, xgboost, pygam, rpy2$^{2}$ , tensorflow$^{3}$ , pytorch$^{4}$ 1. auto-sklearn is a AutoML package, which provides some models for the pipeline. But it's only compatible with Linux system. So if you are using windows, this package should not be installed.
2. rpy2 is only used for reading .rda/.rdata datasets. If rpy2 is not installed, it will not cause import problems (using importlib to check), but you will not be able to read R datasets
3. tensorflow is now only used for imputation with GAIN network. If tensorflow not installed, it will not caused import problems, but the GAIN imputation method will be disabled in default hyperparameter space.
4. pytorch is required for neural network support.
Current Progress:
base: load data with some common file format (.csv, .data, .rda, etc.) given path into a database dict, and some common utilities
encoding: encoding for categorical/string type to numerical, and can perform dummy coding; category shift
impute for missing values: some of the available methods: Simple Imputer, Miss Forest Imputer, MICE, GAIN, etc.
common scaling methods: Standardize, MinMax Scaling, Winsorization, etc.
deal with imbalance data: use over-/under-sampling methods to balance the dataset, some of the available methods: Tome kLink, One Sided Selection (OSS), Smote, etc.
feature selection: PCA, AFFS, etc. And some models from sklearn will be applied.
apply
ray.tune
(with plentiful search algorithms and search schedulers) to create a pipeline of AutoML workflow. Consider the methods as a hyperparameter, and create a hyperparameter space, where we can find the optimal ML workflow. Only supervised classification/regression models supported.
The pipeline of AutoML:
Data Encoding: encode some string type and datetime types to numerical type so it can be more easily handled.
Imputation: impute the missing values in the dataset using supervised/unsupervised learning in case the following processes can not handle NaN values. Will be no_processing if no missing values
Balancing: balance the datasets using under-/over-sampling methods to emphasize the importance of extreme minority class. Will not change the datasets if the datasets are already balanced.
Scaling: scale the datasets to increase the efficiency of convergence, or allows better performance. Can be selected as no_processing.
Feature Selection: remove redundant or insignificant features to improve the performance of the model. Can be selected as no_processing.
Regression/Classification: perform regression/classification models to fit the datasets.
With the new update, customized/personalized preprocessing methods and models are supported by using the additional.py
file as additional input. In the file, there's also a template of how to write the methods and corresponding hyperparameter space.
Save and load the models: To save reproduction time, when the optimal model/hyperparameter settings are configured, all settings will be stored as a model
file (pickle
file, file name may vary by experiments, controlled by argument model_name
in Python interface, or train_data
+ "_model" in command line interface). Next time when AutoML pipeline starts training, it will detect whether the model
file exists and directly use the optimal pipeline, which can save the training time (for optimization). Fitted models, preprocessed train/test datasets, hyperparameter settings for each trials will also by stored in tmp
folders for inspection. (Both model
and tmp
are changeable arguments if you prefer other names.)
Configuration allowed for AutoTabular
(AutoTabularClassifier
, AutoTabularRegressor
) (only some common arguments here):
timeout: maximum allowed time for the tuning job.
max_evals: maximum allowed trials for the tuning job. (each trial is trained multiple times, whose training iterations are controlled by limitation time,
timeout
and performance improvements.)allow_error_prop: maximum allowed failure errors proportion (number of allowed error = proportion * max_evals)
model_name: saved model names, will be used to recognized for later runs
encoder, imputer, balancing, scaling, feature_selection, models:
auto
(use default methods) or limit as a list of allowed methods.objective: metrics use to evaluate trials' performance
search_algo: search algorithm,
GridSearch
,RandomSearch
,HyperOpt
,Optuna
,CFO
andNevergrad
are now supported. Default is set toHyperOpt
. The use ofCFO
andNevergrad
requires additional packages, see the error message for installation guide when using. This will be all algorithms that will be supported in short time.cpu_threads, use_gpu: computational resources used for the job, will use all available by default
Other files in the repository:
-
report.pdf
and presentation slides provides an introduction to the basic idea of AutoML pipeline and demonstrates test performance on some real-life datasets, andAppendix
provides test datasets in the report. -
Dockerfiles
provides a Docker environment preparation files, you can easily build a virtual environment and test your datasets on the AutoML pipeline. The dockerfiles will install necessary packages and clone this repository to workspace.
I'm still interested to expand current AutoML package, and have plans to add MLP/RNN structure NAS for NLP tasks (there's a workable algorithm for MLP NAS through nni and still in development of complete structure). Furthermore, explore the topics in explainability, fairness of ML models. But no schedules for such progress can be made.