[SPARK-18797][SparkR]:Update spark.logit in sparkr-vignettes #16222

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

wangmiao1981 wants to merge 5 commits into apache:master from wangmiao1981:veg

Contributor

wangmiao1981 commented Dec 8, 2016

What changes were proposed in this pull request?

spark.logit is added in 2.1. We need to update spark-vignettes to reflect the changes. This is part of SparkR QA work.

How was this patch tested?

Manual build html. Please see attached image for the result.


          add spark.logit vignettes

0333b4f

Contributor Author

wangmiao1981 commented Dec 8, 2016

cc @felixcheung


          clean up

fc175ff

SparkQA commented Dec 9, 2016

Test build #69892 has finished for PR 16222 at commit 0333b4f.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
Multiclass classification is supported via multinomial logistic (softmax) regression. In multinomial logistic regression,
In multiclass (or binary) classification to adjust the probability of predicting each class. Array must have length

SparkQA commented Dec 9, 2016

Test build #69895 has finished for PR 16222 at commit fc175ff.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr mentioned this pull request

[SPARK-18792] [R] mention spark.logit in vignettes #16224

Closed

Contributor

mengxr commented Dec 9, 2016 •

edited

Loading

Do we need to explain what "logistic regression" is? Maintenance could be easier if we only provide pointers.
We shouldn't make the vignettes repeat the content in the API doc.

Contributor Author

wangmiao1981 commented Dec 9, 2016

Yes, I can add explanation of what "logistic regression" is.
"We shouldn't make the vignettes repeat the content in the API doc." Do you mean removing the parameter explanation? I saw other algorithms have such explanations. Any suggestions on replacing them?

Thanks!

Contributor

mengxr commented Dec 9, 2016

@wangmiao1981 Sorry, I was actually suggesting removing the math part. Those are standard logistic regression formulations, which could be found in many other places. We don't really need to repeat them here. Just providing some pointers should be sufficient.

The second half overlaps with the API doc. It would cause maintenance overhead, e.g. how to keep the content in sync. Take dplyr for example, the reference manual contains the API doc while the vignettes are more tutorial-like. We should move towards this direction and having less duplicate content in our code base. For this one in particular, I suggest doing a quick introduction followed by some example code.

Contributor Author

wangmiao1981 commented Dec 9, 2016

Got it! I will make a new pass. Thanks!

wangmiao1981 added 2 commits

December 9, 2016 16:35


          simplify the document

e7c424a


          minor revision

070d7fe

mengxr requested changes

View reviewed changes

Contributor

mengxr left a comment

made one pass

R/pkg/vignettes/sparkr-vignettes.Rmd Outdated

    
              (Added in 2.1.0)

              [Logistic regression](https://en.wikipedia.org/wiki/Logistic_regression) is a widely-used model when the response is categorical. It can be seen as a special case of the [Generalized Linear Model](https://en.wikipedia.org/wiki/Generalized_linear_model).

Contributor

mengxr Dec 10, 2016

model -> predictive model

R/pkg/vignettes/sparkr-vignettes.Rmd Outdated

    
              (Added in 2.1.0)

              [Logistic regression](https://en.wikipedia.org/wiki/Logistic_regression) is a widely-used model when the response is categorical. It can be seen as a special case of the [Generalized Linear Model](https://en.wikipedia.org/wiki/Generalized_linear_model).

              There are two types of logistic regression models, namely binomial logistic regression (i.e., response is binary) and multinomial

Contributor

mengxr Dec 10, 2016

This sentence is not very necessary, which is basically explaining what logistic regression models are.

R/pkg/vignettes/sparkr-vignettes.Rmd Outdated

    
              [Logistic regression](https://en.wikipedia.org/wiki/Logistic_regression) is a widely-used model when the response is categorical. It can be seen as a special case of the [Generalized Linear Model](https://en.wikipedia.org/wiki/Generalized_linear_model).

              There are two types of logistic regression models, namely binomial logistic regression (i.e., response is binary) and multinomial

              logistic regression (i.e., response falls into multiple classes). We provide `spark.logit` on top of `spark.glm` to support logistic regression with advanced hyper-parameters.

              It supports both binary and multiclass classification, elastic-net regularization, and feature standardization, similar to `glmnet`.

Contributor

mengxr Dec 10, 2016

to be consistent with the text, we can say "both binomial and multinomial logistic regression models with elastic-net regularization and feature standardization, similar ..."

R/pkg/vignettes/sparkr-vignettes.Rmd Outdated

    
              It supports both binary and multiclass classification, elastic-net regularization, and feature standardization, similar to `glmnet`.

              `spark.logit` fits an Logistic Regression Model against a Spark DataFrame. The `family` parameter can be used to select between the

Contributor

mengxr Dec 10, 2016

This paragraph is not necessary. We can continue with the examples directly.

R/pkg/vignettes/sparkr-vignettes.Rmd Outdated

    
              Binomial logistic regression

              ```{r, warning=FALSE}

              df <- createDataFrame(iris)

              # Create a dataframe containing two classes

Contributor

mengxr Dec 10, 2016

dataframe -> DataFrame

R/pkg/vignettes/sparkr-vignettes.Rmd Outdated

    
              # Create a dataframe containing two classes

              training <- df[df$Species %in% c("versicolor", "virginica"), ]

              model <- spark.logit(training, Species ~ ., regParam = 0.5)

              summary <- summary(model)

Contributor

mengxr Dec 10, 2016

Unfortunately, we didn't implement print.summary. If summary(model) is still somewhat human-readable, we should use it. Once we implemented print.summary, we don't need to change code here.

R/pkg/vignettes/sparkr-vignettes.Rmd

    
              df <- createDataFrame(iris)

              # Create a dataframe containing two classes

              training <- df[df$Species %in% c("versicolor", "virginica"), ]

              model <- spark.logit(training, Species ~ ., regParam = 0.5)

Contributor

mengxr Dec 10, 2016

Just curious, did you check whether regParam = 0.5 returns a good model or not?

Contributor Author

wangmiao1981 Dec 10, 2016

I changed the test as an example. I didn't check whether regParam = 0.5 returns good model or not. I can do some experiments to check it out.

R/pkg/vignettes/sparkr-vignettes.Rmd Outdated

    
              df <- createDataFrame(iris)

              # Note family = "multinomial" is optional in this case since the dataset has multiple classes.

              model <- spark.logit(df, Species ~ ., regParam = 0.5)

              summary <- summary(model)

Contributor

mengxr Dec 10, 2016

ditto

R/pkg/vignettes/sparkr-vignettes.Rmd Outdated

    
              Multinomial logistic regression against three classes

              ```{r, warning=FALSE}

              df <- createDataFrame(iris)

              # Note family = "multinomial" is optional in this case since the dataset has multiple classes.

Contributor

mengxr Dec 10, 2016

This reads like family = "binomial" is required if the dataset has only two classes.

SparkQA commented Dec 10, 2016

Test build #69947 has finished for PR 16222 at commit e7c424a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA commented Dec 10, 2016

Test build #69948 has finished for PR 16222 at commit 070d7fe.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.


          address review comments

5fe125f

Contributor Author

wangmiao1981 commented Dec 10, 2016 •

edited

Loading

I used the following R code and glmnet to check whether regParam = 0.5 fits a good model. The dataset is the same as the example in the document.

iris2 <- iris[iris$Species %in% c("versicolor", "virginica"), ]
iris.x = as.matrix(iris2[, 1:4])
iris.y = as.factor(as.character(iris2[, 5]))
cvfit = cv.glmnet(iris.x, iris.y, family = "binomial", type.measure = "class")
cvfit$lambda.min
[1] 0.000423808

multinomial:

iris.x = as.matrix(iris[, 1:4])
iris.y = as.factor(as.character(iris[, 5]))
cvfit = cv.glmnet(iris.x, iris.y, family = "multinomial", type.measure = "class")
cvfit$lambda.min
[1] 0.05618186

If I understand correctly, regParam = 0.5 doesn't fit a good model for both binomial and multinomial cases, as the minimal lambda is < 0.1.

SparkQA commented Dec 10, 2016

Test build #69954 has finished for PR 16222 at commit 5fe125f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

asfgit pushed a commit that referenced this pull request


          [SPARK-18797][SPARKR] Update spark.logit in sparkr-vignettes

9f0e3be

## What changes were proposed in this pull request?
spark.logit is added in 2.1. We need to update spark-vignettes to reflect the changes. This is part of SparkR QA work.

## How was this patch tested?

Manual build html. Please see attached image for the result.
![test](https://cloud.githubusercontent.com/assets/5033592/21032237/01b565fe-bd5d-11e6-8b59-4de4b6ef611d.jpeg)

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #16222 from wangmiao1981/veg.

(cherry picked from commit 2aa16d0)
Signed-off-by: Xiangrui Meng <meng@databricks.com>

Contributor

mengxr commented Dec 13, 2016

LGTM. Merged into master and branch-2.1. I will change the regParam value in a follow-up PR.

asfgit closed this in

2aa16d0

wangmiao1981 mentioned this pull request

[SPARK-18865][SparkR]:SparkR vignettes MLP and LDA updates #16284

Closed

asfgit pushed a commit that referenced this pull request


          [SPARK-18865][SPARKR] SparkR vignettes MLP and LDA updates

0d94201

## What changes were proposed in this pull request?

When do the QA work, I found that the following issues:

1). `spark.mlp` doesn't include an example;
2). `spark.mlp` and `spark.lda` have redundant parameter explanations;
3). `spark.lda` document misses default values for some parameters.

I also changed the `spark.logit` regParam in the examples, as we discussed in #16222.

## How was this patch tested?

Manual test

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #16284 from wangmiao1981/ks.

(cherry picked from commit 3243885)
Signed-off-by: Felix Cheung <felixcheung@apache.org>

asfgit pushed a commit that referenced this pull request


          [SPARK-18865][SPARKR] SparkR vignettes MLP and LDA updates

## What changes were proposed in this pull request?

When do the QA work, I found that the following issues:

1). `spark.mlp` doesn't include an example;
2). `spark.mlp` and `spark.lda` have redundant parameter explanations;
3). `spark.lda` document misses default values for some parameters.

I also changed the `spark.logit` regParam in the examples, as we discussed in #16222.

## How was this patch tested?

Manual test

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #16284 from wangmiao1981/ks.

robert3005 pushed a commit to palantir/spark that referenced this pull request


          [SPARK-18797][SPARKR] Update spark.logit in sparkr-vignettes

e5253eb

## What changes were proposed in this pull request?
spark.logit is added in 2.1. We need to update spark-vignettes to reflect the changes. This is part of SparkR QA work.

## How was this patch tested?

Manual build html. Please see attached image for the result.
![test](https://cloud.githubusercontent.com/assets/5033592/21032237/01b565fe-bd5d-11e6-8b59-4de4b6ef611d.jpeg)

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes apache#16222 from wangmiao1981/veg.

robert3005 pushed a commit to palantir/spark that referenced this pull request


          [SPARK-18865][SPARKR] SparkR vignettes MLP and LDA updates

d6966a0

## What changes were proposed in this pull request?

When do the QA work, I found that the following issues:

1). `spark.mlp` doesn't include an example;
2). `spark.mlp` and `spark.lda` have redundant parameter explanations;
3). `spark.lda` document misses default values for some parameters.

I also changed the `spark.logit` regParam in the examples, as we discussed in apache#16222.

## How was this patch tested?

Manual test

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes apache#16284 from wangmiao1981/ks.

uzadude pushed a commit to uzadude/spark that referenced this pull request


          [SPARK-18797][SPARKR] Update spark.logit in sparkr-vignettes

514a6de

## What changes were proposed in this pull request?
spark.logit is added in 2.1. We need to update spark-vignettes to reflect the changes. This is part of SparkR QA work.

## How was this patch tested?

Manual build html. Please see attached image for the result.
![test](https://cloud.githubusercontent.com/assets/5033592/21032237/01b565fe-bd5d-11e6-8b59-4de4b6ef611d.jpeg)

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes apache#16222 from wangmiao1981/veg.

uzadude pushed a commit to uzadude/spark that referenced this pull request


          [SPARK-18865][SPARKR] SparkR vignettes MLP and LDA updates

feeded2

## What changes were proposed in this pull request?

When do the QA work, I found that the following issues:

1). `spark.mlp` doesn't include an example;
2). `spark.mlp` and `spark.lda` have redundant parameter explanations;
3). `spark.lda` document misses default values for some parameters.

I also changed the `spark.logit` regParam in the examples, as we discussed in apache#16222.

## How was this patch tested?

Manual test

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes apache#16284 from wangmiao1981/ks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet