From 0333b4f882d96837251e3a9d823479ad84a90d83 Mon Sep 17 00:00:00 2001 From: "wm624@hotmail.com" Date: Thu, 8 Dec 2016 15:37:14 -0800 Subject: [PATCH 1/5] add spark.logit vignettes --- R/pkg/vignettes/sparkr-vignettes.Rmd | 103 +++++++++++++++++++++++++++ 1 file changed, 103 insertions(+) diff --git a/R/pkg/vignettes/sparkr-vignettes.Rmd b/R/pkg/vignettes/sparkr-vignettes.Rmd index a36f8fc0c145..758ac4275c6d 100644 --- a/R/pkg/vignettes/sparkr-vignettes.Rmd +++ b/R/pkg/vignettes/sparkr-vignettes.Rmd @@ -768,6 +768,109 @@ newDF <- createDataFrame(data.frame(x = c(1.5, 3.2))) head(predict(isoregModel, newDF)) ``` +### Logistic Regression Model + +(Coming in 2.1.0) + +`spark.logit` fits an [Logistic Regression Model](https://en.wikipedia.org/wiki/Logistic_regression) against a Spark DataFrame. +Logistic regression is a widely-used model when the response is categorical. It can be seen as a special case of the [Generalized Linear Model](https://en.wikipedia.org/wiki/Generalized_linear_model). +`spark.logit` can be used to predict a binary outcome by using binomial logistic regression, or it can be used to predict +a multiclass outcome by using multinomial logistic regression. The `family` parameter can be used to select between the +two algorithms, or leave it unset and Spark will infer the correct variant. + +#### Binomial logistic regression + +For the binomial model, suppose the response variable takes value in + +$$\mathbb{G} = \{1,2\}$$ + +Denote + +$$y_i = I(g_i = 1)$$ + +We model + +$$\mathbf{Pr}(\mathbf{G} = 2 | \mathbf{X} = x) + \frac{e^{\beta_0+\beta^Tx }}{1 + e^{\beta_0+\beta^Tx }}$$ + +which can be written in the following form + +$$log(\frac{\mathbf{Pr}(\mathbf{G}=2|X=x)}{\mathbf{Pr}(\mathbf{G}=1|X=x)}) = \beta_0+\beta^Tx$$ + +the so-called “logistic” or log-odds transformation. + +The objective function for the penalized logistic regression uses the negative binomial log-likelihood, and is + +$$\min_{(\beta_0, \beta) \in \mathfrak{R}^{p+1}}-[\frac{1}{N}\sum_{i=1}^N y_i \cdot (\beta_0 + \beta^Tx) - log(1+e^{\beta_0+x_i^T\beta})] + \lambda [\frac{1}{2}(1-\alpha)\left \| \beta \right \|_2^2 +\alpha \left \| \beta \right \|_1]$$ + +#### Multinomial logistic regression + +Multiclass classification is supported via multinomial logistic (softmax) regression. In multinomial logistic regression, +the algorithm produces $K$ sets of coefficients, or a matrix of dimension $K \times J$ where $K$ is is the number of outcome classes +and $J$ is the number of features. If the algorithm is fit with an intercept term then a length $K$ vector of intercepts is available. + +The conditional probabilities of the outcome classes $k \in 1,2, ..., \mathbf{K}$ are modeled using the softmax function: + +$$Pr(Y=k|\mathbf{X, \beta_k, \beta_{0k}}) = \frac{e^{\beta_k\cdot \mathbf{X} + \beta_{0k}}}{\sum_{{k}'=0}^{K-1}e^{\beta_{{k}'}\cdot \mathbf{X}+\beta_{0{k}'}}}$$ + +We minimize the weighted negative log-likelihood, using a multinomial response model, with elastic-net penalty to control for overfitting. + +$$\min_{(\beta_0, \beta) \in \mathfrak{R}^{p+1}}-[\frac{1}{N}\sum_{i=1}^N w_i \cdot logPr(Y=y_i|\mathbf{x_i})] + \lambda [\frac{1}{2}(1-\alpha)\left \| \beta \right \|_2^2 +\alpha \left \| \beta \right \|_1]$$ + +For a detailed derivation please see [here](https://en.wikipedia.org/wiki/Multinomial_logistic_regression#As_a_log-linear_model). + +There are several parameters `spark.logit` takes for fitting the model. + +* `regParam`: the regularization parameter. + +* `elasticNetParam`: the ElasticNet mixing parameter. For alpha = 0.0, the penalty is an L2 penalty. +For alpha = 1.0, it is an L1 penalty. For 0.0 < alpha < 1.0, the penalty is a combination of L1 and L2. +Default is 0.0 which is an L2 penalty. + +* `maxIter`: maximum iteration number. + +* `tol`: convergence tolerance of iterations. + +* `family`: the name of family which is a description of the label distribution to be used in the model. Supported options: +"auto": Automatically select the family based on the number of classes: If number of classes == 1 || number of classes == 2, set to "binomial". +Else, set to "multinomial".} "binomial": Binary logistic regression with pivoting. "multinomial": Multinomial logistic (softmax) regression without pivoting. + +* `standardization`: whether to standardize the training features before fitting the model. The coefficients +of models will be always returned on the original scale, so it will be transparent for users. Note that with/without standardization, +the models should be always converged to the same solution when no regularization is applied. Default is TRUE, same as glmnet. + +* `thresholds`: in binary classification, in range [0, 1]. If the estimated probability of class label 1 is > threshold, +then predict 1, else 0. A high threshold encourages the model to predict 0 more often; a low threshold encourages +the model to predict 1 more often. Note: Setting this with threshold p is equivalent to setting thresholds c(1-p, p). +In multiclass (or binary) classification to adjust the probability of predicting each class. Array must have length +equal to the number of classes, with values > 0, excepting that at most one value may be 0. The class with largest value +p/t is predicted, where p is the original probability of that class and t is the class's threshold. + +* `weightCol`: the weight column name. + +Let us look at an artificial example. + +Binomial logistic regression +```{r, warning=FALSE} +df <- createDataFrame(iris) +training <- df[df$Species %in% c("versicolor", "virginica"), ] +model <- spark.logit(training, Species ~ ., regParam = 0.5) +summary <- summary(model) +head(summary) +``` + +Fitted values on training data +```{r} +fitted <- predict(model, training) +``` + +Multinomial logistic regression +```{r, warning=FALSE} +df <- createDataFrame(iris) +model <- spark.logit(df, Species ~ ., regParam = 0.5) +summary <- summary(model) +head(summary) +``` + #### What's More? We also expect Decision Tree, Random Forest, Kolmogorov-Smirnov Test coming in the next version 2.1.0. From fc175ff7436cfdb2ff868a4d177dddd2546ba443 Mon Sep 17 00:00:00 2001 From: "wm624@hotmail.com" Date: Thu, 8 Dec 2016 16:03:45 -0800 Subject: [PATCH 2/5] clean up --- R/pkg/vignettes/sparkr-vignettes.Rmd | 15 ++++++--------- 1 file changed, 6 insertions(+), 9 deletions(-) diff --git a/R/pkg/vignettes/sparkr-vignettes.Rmd b/R/pkg/vignettes/sparkr-vignettes.Rmd index 758ac4275c6d..9d2a9a51e584 100644 --- a/R/pkg/vignettes/sparkr-vignettes.Rmd +++ b/R/pkg/vignettes/sparkr-vignettes.Rmd @@ -565,7 +565,7 @@ head(aftPredictions) #### Gaussian Mixture Model -(Coming in 2.1.0) +(Added in 2.1.0) `spark.gaussianMixture` fits multivariate [Gaussian Mixture Model](https://en.wikipedia.org/wiki/Mixture_model#Multivariate_Gaussian_mixture_model) (GMM) against a `SparkDataFrame`. [Expectation-Maximization](https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm) (EM) is used to approximate the maximum likelihood estimator (MLE) of the model. @@ -584,7 +584,7 @@ head(select(gmmFitted, "V1", "V2", "prediction")) #### Latent Dirichlet Allocation -(Coming in 2.1.0) +(Added in 2.1.0) `spark.lda` fits a [Latent Dirichlet Allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) model on a `SparkDataFrame`. It is often used in topic modeling in which topics are inferred from a collection of text documents. LDA can be thought of as a clustering algorithm as follows: @@ -657,7 +657,7 @@ perplexity #### Multilayer Perceptron -(Coming in 2.1.0) +(Added in 2.1.0) Multilayer perceptron classifier (MLPC) is a classifier based on the [feedforward artificial neural network](https://en.wikipedia.org/wiki/Feedforward_neural_network). MLPC consists of multiple layers of nodes. Each layer is fully connected to the next layer in the network. Nodes in the input layer represent the input data. All other nodes map inputs to outputs by a linear combination of the inputs with the node’s weights $w$ and bias $b$ and applying an activation function. This can be written in matrix form for MLPC with $K+1$ layers as follows: $$ @@ -694,7 +694,7 @@ MLPC employs backpropagation for learning the model. We use the logistic loss fu #### Collaborative Filtering -(Coming in 2.1.0) +(Added in 2.1.0) `spark.als` learns latent factors in [collaborative filtering](https://en.wikipedia.org/wiki/Recommender_system#Collaborative_filtering) via [alternating least squares](http://dl.acm.org/citation.cfm?id=1608614). @@ -725,7 +725,7 @@ head(predicted) #### Isotonic Regression Model -(Coming in 2.1.0) +(Added in 2.1.0) `spark.isoreg` fits an [Isotonic Regression](https://en.wikipedia.org/wiki/Isotonic_regression) model against a `SparkDataFrame`. It solves a weighted univariate a regression problem under a complete order constraint. Specifically, given a set of real observed responses $y_1, \ldots, y_n$, corresponding real features $x_1, \ldots, x_n$, and optionally positive weights $w_1, \ldots, w_n$, we want to find a monotone (piecewise linear) function $f$ to minimize $$ @@ -770,7 +770,7 @@ head(predict(isoregModel, newDF)) ### Logistic Regression Model -(Coming in 2.1.0) +(Added in 2.1.0) `spark.logit` fits an [Logistic Regression Model](https://en.wikipedia.org/wiki/Logistic_regression) against a Spark DataFrame. Logistic regression is a widely-used model when the response is categorical. It can be seen as a special case of the [Generalized Linear Model](https://en.wikipedia.org/wiki/Generalized_linear_model). @@ -871,9 +871,6 @@ summary <- summary(model) head(summary) ``` -#### What's More? -We also expect Decision Tree, Random Forest, Kolmogorov-Smirnov Test coming in the next version 2.1.0. - ### Model Persistence The following example shows how to save/load an ML model by SparkR. ```{r, warning=FALSE} From e7c424a3c1fe6753b208928842874cf2208bbe27 Mon Sep 17 00:00:00 2001 From: "wm624@hotmail.com" Date: Fri, 9 Dec 2016 16:35:32 -0800 Subject: [PATCH 3/5] simplify the document --- R/pkg/vignettes/sparkr-vignettes.Rmd | 86 ++++------------------------ 1 file changed, 12 insertions(+), 74 deletions(-) diff --git a/R/pkg/vignettes/sparkr-vignettes.Rmd b/R/pkg/vignettes/sparkr-vignettes.Rmd index 9d2a9a51e584..a5dbde0bd35f 100644 --- a/R/pkg/vignettes/sparkr-vignettes.Rmd +++ b/R/pkg/vignettes/sparkr-vignettes.Rmd @@ -772,86 +772,23 @@ head(predict(isoregModel, newDF)) (Added in 2.1.0) -`spark.logit` fits an [Logistic Regression Model](https://en.wikipedia.org/wiki/Logistic_regression) against a Spark DataFrame. -Logistic regression is a widely-used model when the response is categorical. It can be seen as a special case of the [Generalized Linear Model](https://en.wikipedia.org/wiki/Generalized_linear_model). -`spark.logit` can be used to predict a binary outcome by using binomial logistic regression, or it can be used to predict -a multiclass outcome by using multinomial logistic regression. The `family` parameter can be used to select between the -two algorithms, or leave it unset and Spark will infer the correct variant. +[Logistic regression](https://en.wikipedia.org/wiki/Logistic_regression) is a widely-used model when the response is categorical. It can be seen as a special case of the [Generalized Linear Model](https://en.wikipedia.org/wiki/Generalized_linear_model). +There are two types of logistic regression models, namely binomial logistic regression (i.e., response is binary) and multinomial +logistic regression (i.e., response falls into multiple classes). We provide `spark.logit` on top of `spark.glm` to support logistic regression with advanced hyper-parameters. +It supports both binary and multiclass classification, elastic-net regularization, and feature standardization, similar to `glmnet`. -#### Binomial logistic regression -For the binomial model, suppose the response variable takes value in +`spark.logit` fits an Logistic Regression Model against a Spark DataFrame. The `family` parameter can be used to select between the +binomial and multinomial algorithms, or leave it unset and Spark will infer the correct variant. -$$\mathbb{G} = \{1,2\}$$ - -Denote - -$$y_i = I(g_i = 1)$$ - -We model - -$$\mathbf{Pr}(\mathbf{G} = 2 | \mathbf{X} = x) + \frac{e^{\beta_0+\beta^Tx }}{1 + e^{\beta_0+\beta^Tx }}$$ - -which can be written in the following form - -$$log(\frac{\mathbf{Pr}(\mathbf{G}=2|X=x)}{\mathbf{Pr}(\mathbf{G}=1|X=x)}) = \beta_0+\beta^Tx$$ - -the so-called “logistic” or log-odds transformation. - -The objective function for the penalized logistic regression uses the negative binomial log-likelihood, and is - -$$\min_{(\beta_0, \beta) \in \mathfrak{R}^{p+1}}-[\frac{1}{N}\sum_{i=1}^N y_i \cdot (\beta_0 + \beta^Tx) - log(1+e^{\beta_0+x_i^T\beta})] + \lambda [\frac{1}{2}(1-\alpha)\left \| \beta \right \|_2^2 +\alpha \left \| \beta \right \|_1]$$ - -#### Multinomial logistic regression - -Multiclass classification is supported via multinomial logistic (softmax) regression. In multinomial logistic regression, -the algorithm produces $K$ sets of coefficients, or a matrix of dimension $K \times J$ where $K$ is is the number of outcome classes -and $J$ is the number of features. If the algorithm is fit with an intercept term then a length $K$ vector of intercepts is available. - -The conditional probabilities of the outcome classes $k \in 1,2, ..., \mathbf{K}$ are modeled using the softmax function: - -$$Pr(Y=k|\mathbf{X, \beta_k, \beta_{0k}}) = \frac{e^{\beta_k\cdot \mathbf{X} + \beta_{0k}}}{\sum_{{k}'=0}^{K-1}e^{\beta_{{k}'}\cdot \mathbf{X}+\beta_{0{k}'}}}$$ - -We minimize the weighted negative log-likelihood, using a multinomial response model, with elastic-net penalty to control for overfitting. - -$$\min_{(\beta_0, \beta) \in \mathfrak{R}^{p+1}}-[\frac{1}{N}\sum_{i=1}^N w_i \cdot logPr(Y=y_i|\mathbf{x_i})] + \lambda [\frac{1}{2}(1-\alpha)\left \| \beta \right \|_2^2 +\alpha \left \| \beta \right \|_1]$$ - -For a detailed derivation please see [here](https://en.wikipedia.org/wiki/Multinomial_logistic_regression#As_a_log-linear_model). - -There are several parameters `spark.logit` takes for fitting the model. - -* `regParam`: the regularization parameter. - -* `elasticNetParam`: the ElasticNet mixing parameter. For alpha = 0.0, the penalty is an L2 penalty. -For alpha = 1.0, it is an L1 penalty. For 0.0 < alpha < 1.0, the penalty is a combination of L1 and L2. -Default is 0.0 which is an L2 penalty. - -* `maxIter`: maximum iteration number. - -* `tol`: convergence tolerance of iterations. - -* `family`: the name of family which is a description of the label distribution to be used in the model. Supported options: -"auto": Automatically select the family based on the number of classes: If number of classes == 1 || number of classes == 2, set to "binomial". -Else, set to "multinomial".} "binomial": Binary logistic regression with pivoting. "multinomial": Multinomial logistic (softmax) regression without pivoting. - -* `standardization`: whether to standardize the training features before fitting the model. The coefficients -of models will be always returned on the original scale, so it will be transparent for users. Note that with/without standardization, -the models should be always converged to the same solution when no regularization is applied. Default is TRUE, same as glmnet. - -* `thresholds`: in binary classification, in range [0, 1]. If the estimated probability of class label 1 is > threshold, -then predict 1, else 0. A high threshold encourages the model to predict 0 more often; a low threshold encourages -the model to predict 1 more often. Note: Setting this with threshold p is equivalent to setting thresholds c(1-p, p). -In multiclass (or binary) classification to adjust the probability of predicting each class. Array must have length -equal to the number of classes, with values > 0, excepting that at most one value may be 0. The class with largest value -p/t is predicted, where p is the original probability of that class and t is the class's threshold. - -* `weightCol`: the weight column name. - -Let us look at an artificial example. +We use a simple example to demonstrate `spark.logit` usage. In general, there are three steps of using `spark.logit`: +1). Create a dataframe from proper data source; 2). Fit a logistic regression model using `spark.logit` with a proper parameter setting; +and 3). Obtain the coefficient matrix of the fitted model using `summary` and use the model for prediction with `predict`. Binomial logistic regression ```{r, warning=FALSE} df <- createDataFrame(iris) +# Create a dataframe containing two classes training <- df[df$Species %in% c("versicolor", "virginica"), ] model <- spark.logit(training, Species ~ ., regParam = 0.5) summary <- summary(model) @@ -863,9 +800,10 @@ Fitted values on training data fitted <- predict(model, training) ``` -Multinomial logistic regression +Multinomial logistic regression against three classes ```{r, warning=FALSE} df <- createDataFrame(iris) +# Note family = "multinomial" is optional in this case since the dataset has multiple classes. model <- spark.logit(df, Species ~ ., regParam = 0.5) summary <- summary(model) head(summary) From 070d7fef6959f10238f288c5f677664bb20fe984 Mon Sep 17 00:00:00 2001 From: "wm624@hotmail.com" Date: Fri, 9 Dec 2016 17:04:36 -0800 Subject: [PATCH 4/5] minor revision --- R/pkg/vignettes/sparkr-vignettes.Rmd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/R/pkg/vignettes/sparkr-vignettes.Rmd b/R/pkg/vignettes/sparkr-vignettes.Rmd index a5dbde0bd35f..33c1bf6c61fb 100644 --- a/R/pkg/vignettes/sparkr-vignettes.Rmd +++ b/R/pkg/vignettes/sparkr-vignettes.Rmd @@ -782,7 +782,7 @@ It supports both binary and multiclass classification, elastic-net regularizatio binomial and multinomial algorithms, or leave it unset and Spark will infer the correct variant. We use a simple example to demonstrate `spark.logit` usage. In general, there are three steps of using `spark.logit`: -1). Create a dataframe from proper data source; 2). Fit a logistic regression model using `spark.logit` with a proper parameter setting; +1). Create a dataframe from a proper data source; 2). Fit a logistic regression model using `spark.logit` with a proper parameter setting; and 3). Obtain the coefficient matrix of the fitted model using `summary` and use the model for prediction with `predict`. Binomial logistic regression @@ -795,7 +795,7 @@ summary <- summary(model) head(summary) ``` -Fitted values on training data +Predict values on training data ```{r} fitted <- predict(model, training) ``` From 5fe125f11f04d481507cae246c33bc4969c43e2e Mon Sep 17 00:00:00 2001 From: "wm624@hotmail.com" Date: Fri, 9 Dec 2016 21:24:29 -0800 Subject: [PATCH 5/5] address review comments --- R/pkg/vignettes/sparkr-vignettes.Rmd | 21 +++++++-------------- 1 file changed, 7 insertions(+), 14 deletions(-) diff --git a/R/pkg/vignettes/sparkr-vignettes.Rmd b/R/pkg/vignettes/sparkr-vignettes.Rmd index 33c1bf6c61fb..625b759626f3 100644 --- a/R/pkg/vignettes/sparkr-vignettes.Rmd +++ b/R/pkg/vignettes/sparkr-vignettes.Rmd @@ -772,14 +772,9 @@ head(predict(isoregModel, newDF)) (Added in 2.1.0) -[Logistic regression](https://en.wikipedia.org/wiki/Logistic_regression) is a widely-used model when the response is categorical. It can be seen as a special case of the [Generalized Linear Model](https://en.wikipedia.org/wiki/Generalized_linear_model). -There are two types of logistic regression models, namely binomial logistic regression (i.e., response is binary) and multinomial -logistic regression (i.e., response falls into multiple classes). We provide `spark.logit` on top of `spark.glm` to support logistic regression with advanced hyper-parameters. -It supports both binary and multiclass classification, elastic-net regularization, and feature standardization, similar to `glmnet`. - - -`spark.logit` fits an Logistic Regression Model against a Spark DataFrame. The `family` parameter can be used to select between the -binomial and multinomial algorithms, or leave it unset and Spark will infer the correct variant. +[Logistic regression](https://en.wikipedia.org/wiki/Logistic_regression) is a widely-used model when the response is categorical. It can be seen as a special case of the [Generalized Linear Predictive Model](https://en.wikipedia.org/wiki/Generalized_linear_model). +We provide `spark.logit` on top of `spark.glm` to support logistic regression with advanced hyper-parameters. +It supports both binary and multiclass classification with elastic-net regularization and feature standardization, similar to `glmnet`. We use a simple example to demonstrate `spark.logit` usage. In general, there are three steps of using `spark.logit`: 1). Create a dataframe from a proper data source; 2). Fit a logistic regression model using `spark.logit` with a proper parameter setting; @@ -788,11 +783,10 @@ and 3). Obtain the coefficient matrix of the fitted model using `summary` and us Binomial logistic regression ```{r, warning=FALSE} df <- createDataFrame(iris) -# Create a dataframe containing two classes +# Create a DataFrame containing two classes training <- df[df$Species %in% c("versicolor", "virginica"), ] model <- spark.logit(training, Species ~ ., regParam = 0.5) -summary <- summary(model) -head(summary) +summary(model) ``` Predict values on training data @@ -803,10 +797,9 @@ fitted <- predict(model, training) Multinomial logistic regression against three classes ```{r, warning=FALSE} df <- createDataFrame(iris) -# Note family = "multinomial" is optional in this case since the dataset has multiple classes. +# Note in this case, Spark infers it is multinomial logistic regression, so family = "multinomial" is optional. model <- spark.logit(df, Species ~ ., regParam = 0.5) -summary <- summary(model) -head(summary) +summary(model) ``` ### Model Persistence