diff --git a/doc/ant-xgboost_on_sqlflow_design.md b/doc/ant-xgboost_design.md similarity index 100% rename from doc/ant-xgboost_on_sqlflow_design.md rename to doc/ant-xgboost_design.md diff --git a/doc/ant-xgboost_user_guide.md b/doc/ant-xgboost_user_guide.md new file mode 100644 index 0000000000..8bcde9ecbc --- /dev/null +++ b/doc/ant-xgboost_user_guide.md @@ -0,0 +1,221 @@ +# _user guide:_ Ant-XGBoost on sqlflow + +## Overview + +[Ant-XGBoost](https://github.com/alipay/ant-xgboost) is fork of [dmlc/xgboost](https://github.com/dmlc/xgboost), which is maintained by active contributors of dmlc/xgboost in Alipay Inc. + +Ant-XGBoost extends `dmlc/xgboost` with the capability of running on Kubernetes and automatic hyper-parameter estimation. +In particular, Ant-XGBoost includes `auto_train` methods for automatic training and introduces an additional parameter `convergence_criteria` for generalized early stopping strategy. +See supplementary section for more details about automatic training and generalized early stopping strategy. + +## Tutorial +We provide an [interactive tutorial](../example/jupyter/tutorial_antxgb.ipynb) via jupyter notebook, which can be run out-of-the-box in [sqlflow playground](https://play.sqlflow.org). +If you want to run it locally, you need to install SQLFlow first. You can learn how to install sqlflow at [here](../doc/installation.md). + +## Concepts +### Estimators +We provide various XGBoost estimators for better user experience. +All of them are case-insensitive and sharing same prefix `xgboost`. They are listed below. + +* xgboost.Estimator + + General estimator, with this `train.objective` should be defined explicitly. + +* xgboost.Classifier + + Estimator for classification task, works with `train.num_class`. Default value is binary classification. + +* xgboost.BinaryClassifier + + Estimator for binary classification task, the value of `train.objective` is `binary:logistic`. + +* xgboost.MultiClassifier + + Estimator for multi classification task, the value of `train.objective` is `multi:softprob`. It should work with `train.num_class` > 2. + +* xgboost.Regressor + + Estimator for regression task, the value of `train.objective` is `reg:squarederror`(`reg:linear`). + +### Columns + +* Feature Columns + + For now, two kinds of feature columns are available. + + First one is `dense schema`, which concatenate numeric table columns transparently, such as `COLUMN f1, f2, f3, f4`. + + Second one is `sparse key-value schema`, which received LIBSVM style key-value string formatted like `$k1:$v1,$k2:$v2,...`. + This schema is decorated with keyword `SPARSE`, such as `COLUMN SPARSE(col1)`. + +* Label Column + + Following general sqlflow syntax, label clause of AntXGBoost is formatted in `LABEL $label_col`. + +* Group Column + + In training mode, group column can be declared in a separate column clause. Group column is identified by keyword `group`, such as `COLUMN ${group_col} FOR group`. + +* Weight Column + + As group column schema, weight column is identified by keyword `weight`, such as `COLUMN ${weight_col} FOR weight`. + +* Result Columns + + Schema of straightforward result (class_id for classification task, score for regression task) is following general sqlflow syntax(`PREDICT ${output_table}.${result_column}`). + + In addition, we also provide supplementary information of XGBoost prediction. They can be configured with `pred.attributes`. + + * append columns + + Columns of prediction data table which need to be appended into result table, such as id_col, label_col. + + The syntax is `pred.append_columns = [$col1, $col2, ...]`. + + * classification probability + + Probability of the chosen class, which only work in classification task. + + The syntax is `pred.prob_column = ${col}`. + + * classification detail + A json string who holds the probability distribution of all classes, formatted like `{$class_id:$class_prob,...}`. Only work in classification task. + + The syntax is `pred.detail_column = ${col}`. + + * encoding of leaf indices + + Predicted leaf index in each tree, they are joined orderly into a string with format `$id_1,$id_2,...`. + + The syntax is `pred.encoding_column = ${col}`. + +### Attributes + +There exists two kinds of attributes, `train.attributes` and `pred.attrbutes`. +`train.attributes`, which starts with prefix `train.`, only work in training mode. +`pred.attributes`, which starts with prefix `pred.`, only work in prediction mode. + +All attributes are optional except `train.objective` must be defined when training with `xgboost.Estimator`. + +#### Available train.attributes + +* [General Params](https://xgboost.readthedocs.io/en/latest/parameter.html#general-parameters) + * train.booster + * train.verbosity + +* [Tree Booster Params](https://xgboost.readthedocs.io/en/latest/parameter.html#parameters-for-tree-booster) + * train.eta + * train.gamma + * train.max_depth + * train.min_child_weight + * train.max_delta_step + * train.subsample + * train.colsample_bytree + * train.colsample_bylevel + * train.colsample_bynode + * train.lambda + * train.alpha + * train.tree_method + * train.sketch_eps + * train.scale_pos_weight + * train.grow_policy + * train.max_leaves + * train.max_bin + * train.num_parallel_tree + +* [Learning Task Params](https://xgboost.readthedocs.io/en/latest/parameter.html#learning-task-parameters) + * train.objective + * train.eval_metric + * train.seed + * train.num_round + * The number of rounds for boosting + * train.num_class + * The number of label class in classification task + +* AutoTrain Params + * train.convergence_criteria + * see supplementary for more details + * train.auto_train + * see supplementary for more details + + +#### Available pred.attributes + * pred.append_columns + * pred.prob_column + * pred.detail_column + * pred.encoding_column + + +## Overall SQL Syntax for Ant-XGBoost +### Training Syntax +```sql +// standard select clause +SELECT ... FROM ${TABLE_NAME} +// train clause +TRAIN xgboost.${estimatorType} +WITH + [optional] ${train.attributes} + ...... + ...... +COLUMN ${feature_columns} +[optional] COLUMN ${group_column} FOR group +[optional] COLUMN ${weight_column} FOR weight +LABEL ${label_column} +INTO ${model}; +``` +### Prediction Syntax +```sql +// standard select clause +SELECT ... FROM ${TABLE_NAME} +// pred clause +PREDICT ${output_table}.${result_column} +WITH + [optional] ${pred.attributes} + ...... +USING ${model}; +``` + +## Supplementary +### Generalized Early Stopping Strategy +`dmlc/xgboost` stops when no significant improvements in the recent n boosting rounds, where n is a configurable parameter. +In Ant-XGBoost, we generalize this strategy and call the new strategy convergence test. +We keep track of the series of metric values and determine whether the series is converged or not. +There are three main parameters to test convergence: `minNumPoints`, `n` and `c`. +Only after the series is longer than (or equal to) `minNumPoints`, it start to be eligible for convergence test. +Once a series is at least `minNumPoints` long, we find the index `idx` for the best metric value so far. +We say a series is converged if `idx + n < size * c`, where `size` is the current number of points in the series. +The intuition is that the best metric value should be peaked (or bottomed) with a wide margin. + +With `n` and `c` we can implement complex convergence rules, but there are two common cases. +* `n > 0` and `c = 1.0` + + This reduces to the standard early stopping strategy that is employed by dmlc/xgboost. + +* `n = 0` and `c in [0, 1]` + + For example, `n = 0` and `c = 0.8`. This means there should be at least 20% of points after the best metric value. Smaller value in `c` leads to a more conservative convergence test. This rule tests convergence in an adaptive way; for some problem the metric values are noisy and grow slowly, this rule will have better chance to find the optimal model. + +In addition, convergence test understands the optimization direction for all built-in metrics, so there is no need to set `maximize` parameter (defaults to `false`, forgetting to set this parameter often leads to strange behavior when metric value should be maximized). + + +### AutoTrain +With convergence test, we implement a simple `auto_train` method. There are several components in `auto_train`: +* Automatic parameter validation, setting and rewriting + + Setting the right parameters in XGBoost is not easy. For example, when working with `binary:logistic`, one should not set `num_class` to 2 (otherwise XGBoost will fail with an exception). + + In Ant-XGBoost, we validate parameters to make sure all parameters are consistent with each other, e.g., `num_class = 3` and `objective = binary:logistic` will raise an exception. + + In addition, we try our best to understand the input parameters from the user and automatically set or rewrite some parameters in `auto_train` mode. + For example, when the feature dimension is very high, building a single tree will be very inefficient, + we will automatically set `colsample_bytree` and make sure at most 2000 features are used to build for each tree. Note that automatic parameter rewritting is only turned on in `auto_train` mode, in standard `train`, we only validate parameters and the behavior is fully controled by the user. + +* automatic training + + With convergence test, number of trees in a boosted ensemble becomes a less important parameter; + one can always set a very large number and rely on convergence test to figure out the right number of trees. + The most important parameters to tune in boosted trees now become learning rate and max depth. + In Ant-XGBoost, we employ grid search with early stopping to efficiently search the best model structure; + unpromising learning rate or depth will be skipped entirely. + + While the current `auto_train` method is a very simple approach, we are working on better strategies to further scale up hyperparameter tuning in XGBoost training. diff --git a/example/jupyter/tutorial_antxgb.ipynb b/example/jupyter/tutorial_antxgb.ipynb new file mode 100644 index 0000000000..b792a1dd6c --- /dev/null +++ b/example/jupyter/tutorial_antxgb.ipynb @@ -0,0 +1,271 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Ant-XGBoost on sqlflow Tutorial\n", + "This tutorial demonstrates how to\n", + "1. train a XGBoost model for Iris flower classification\n", + "2. auto-train a XGBoost model to fit boston housing price \n", + "\n", + "## The Dataset\n", + "#### Iris\n", + "The Iris data set contains four features and one label. The four features identify the botanical characteristics of individual Iris flowers. Each feature is stored as a single float number. The label indicates the class of individual Iris flowers. The label is stored as a integer and has possible value of 0, 1, 2.\n", + "#### Boston housing price\n", + "The Boston data frame has 506 rows and 14 columns.This data frame contains the following columns:\n", + "- crim\n", + " - per capita crime rate by town.\n", + "- zn\n", + " - proportion of residential land zoned for lots over 25,000 sq.ft.\n", + "- indus\n", + " - proportion of non-retail business acres per town.\n", + "- chas\n", + " - Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).\n", + "- nox\n", + " - nitrogen oxides concentration (parts per 10 million).\n", + "- rm\n", + " - average number of rooms per dwelling.\n", + "- age\n", + " - proportion of owner-occupied units built prior to 1940.\n", + "- dis\n", + " - weighted mean of distances to five Boston employment centres.\n", + "- rad\n", + " - index of accessibility to radial highways.\n", + "- tax\n", + " - full-value property-tax rate per 10,000 dollar.\n", + "- ptratio\n", + " - pupil-teacher ratio by town.\n", + "- black\n", + " - 1000 * (Bk - 0.63) ^ 2 where Bk is the proportion of blacks by town.\n", + "- lstat\n", + " - lower status of the population (percent).\n", + "- medv(Label)\n", + " - median value of owner-occupied homes united by 1000 dollar.\n", + "\n", + "\n", + "We have separated two datasets in train and test tables: `iris.train`, `iris.test`, `boston.train`, `boston.test`. We will be using them as training data and test data respectively.\n", + "\n", + "We can have a quick peek of the data by running the following standard SQL statements." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%sqlflow\n", + "SELECT * FROM iris.train LIMIT 5;" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "%%sqlflow\n", + "SELECT * FROM boston.train LIMIT 5;" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Iris Classification\n", + "At first, let's train a xgboost model to classify Iris flower. Since there exists three kinds of Iris flowers, we use `multi:softprob` objective and set `num_class` to 3. We also configure tree depth, learning rate and number of iteration. All of above can be done by specifying the training clause for SQLFlow's extended syntax.\n", + "\n", + "```\n", + "TRAIN xgboost.Estimator\n", + "WITH\n", + " train.objective = \"multi:softprob\",\n", + " train.num_class = 3,\n", + " train.max_depth = 4,\n", + " train.eta = 0.5,\n", + " train.num_round = 10\n", + "```\n", + "\n", + "To specify the training data, we use standard SQL statements like `SELECT * FROM iris.train`.\n", + "\n", + "We explicit specify which column is used for features and which column is used for the label by writing\n", + "\n", + "```\n", + "COLUMN sepal_length, sepal_width, petal_length, petal_width\n", + "LABEL class\n", + "```\n", + "At the end of the training process, we save the trained xgboost model into table `sqlflow_models.my_iris_xgboost_model` by writing\n", + "```\n", + "INTO sqlflow_models.my_iris_xgboost_model\n", + "```\n", + "\n", + "Putting it all together, SQLFlow training statement of iris task is done." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "%%sqlflow\n", + "SELECT *\n", + "FROM iris.train\n", + "TRAIN xgboost.Estimator\n", + "WITH\n", + " train.objective = \"multi:softprob\",\n", + " train.num_class = 3,\n", + " train.max_depth = 4,\n", + " train.eta = 0.5,\n", + " train.num_round = 10\n", + "COLUMN sepal_length, sepal_width, petal_length, petal_width\n", + "LABEL class\n", + "INTO sqlflow_models.my_iris_xgboost_model;" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Secondly, let's do prediction on `iris.test`.\n", + "\n", + "To specify the prediction data, we use standard SQL statements like `SELECT * FROM iris.test`.\n", + "\n", + "Say we want the model, previously stored at sqlflow_models.my_iris_xgboost_model, to read the prediction data and write the predicted result into table `iris.predict` column `result`. \n", + "\n", + "We can add some supplementary outputs by setting `pred.attributes`.\n", + "In this case, we append ground truth of prediction data with `pred.append_columns = [class]`.\n", + "We also want to inspect probability information.\n", + "So, we require probability of chosen class with `pred.prob_column = p`; require probability distribution with `pred.detail_column = dist`.\n", + "\n", + "We can write the following SQLFlow prediction statement." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "%%sqlflow\n", + "SELECT *\n", + "FROM iris.test\n", + "predict iris.predict.result\n", + "WITH\n", + " pred.append_columns = [class],\n", + " pred.prob_column = p,\n", + " pred.detail_column = dist\n", + "USING sqlflow_models.my_iris_xgboost_model;" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "After the prediction, we can checkout the prediction result by" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "%%sqlflow\n", + "SELECT *\n", + "FROM iris.predict\n", + "LIMIT 5;" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Fitting Boston Housing Price \n", + "After iris demo, we have essential concepts about SQLFlow. For now, let's try another case with auto-train, additional feature of Ant-XGBoost.\n", + "\n", + "Since `medv` is continuous, we use `reg:squarederror` objective to fit it. With SQLFlow, we have an alternative approach to define xgboost objective; naming a specialized estimator. In this case, we specify `TRAIN xgboost.Regressor` instead of writing an objective explicitly.\n", + "\n", + "Above all, we get a quite concise training statement." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%sqlflow\n", + "SELECT *\n", + "FROM boston.train\n", + "TRAIN xgboost.Regressor\n", + "WITH\n", + " train.auto_train = true,\n", + " train.num_round = 50\n", + "COLUMN crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, b, lstat\n", + "LABEL medv\n", + "INTO sqlflow_models.my_boston_xgboost_model; " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Below is corresponding prediction statement, we append all columns of prediction data into result table." + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [], + "source": [ + "%%sqlflow\n", + "SELECT *\n", + "FROM boston.test\n", + "PREDICT boston.predict.score\n", + "WITH\n", + " pred.append_columns = [crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, b, lstat, medv]\n", + "USING sqlflow_models.my_boston_xgboost_model;" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's have a glance at prediction results." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%sqlflow\n", + "SELECT * FROM boston.predict;" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.7" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}