Skip to content

Conversation

@wangmiao1981
Copy link
Contributor

What changes were proposed in this pull request?

spark.logit is added in 2.1. We need to update spark-vignettes to reflect the changes. This is part of SparkR QA work.

How was this patch tested?

Manual build html. Please see attached image for the result.
test

@wangmiao1981
Copy link
Contributor Author

cc @felixcheung

@SparkQA
Copy link

SparkQA commented Dec 9, 2016

Test build #69892 has finished for PR 16222 at commit 0333b4f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • Multiclass classification is supported via multinomial logistic (softmax) regression. In multinomial logistic regression,
  • In multiclass (or binary) classification to adjust the probability of predicting each class. Array must have length

@SparkQA
Copy link

SparkQA commented Dec 9, 2016

Test build #69895 has finished for PR 16222 at commit fc175ff.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented Dec 9, 2016

@wangmiao1981

  • Do we need to explain what "logistic regression" is? Maintenance could be easier if we only provide pointers.
  • We shouldn't make the vignettes repeat the content in the API doc.

@wangmiao1981
Copy link
Contributor Author

@mengxr

  1. Yes, I can add explanation of what "logistic regression" is.

  2. "We shouldn't make the vignettes repeat the content in the API doc." Do you mean removing the parameter explanation? I saw other algorithms have such explanations. Any suggestions on replacing them?

Thanks!

@mengxr
Copy link
Contributor

mengxr commented Dec 9, 2016

@wangmiao1981 Sorry, I was actually suggesting removing the math part. Those are standard logistic regression formulations, which could be found in many other places. We don't really need to repeat them here. Just providing some pointers should be sufficient.

The second half overlaps with the API doc. It would cause maintenance overhead, e.g. how to keep the content in sync. Take dplyr for example, the reference manual contains the API doc while the vignettes are more tutorial-like. We should move towards this direction and having less duplicate content in our code base. For this one in particular, I suggest doing a quick introduction followed by some example code.

@wangmiao1981
Copy link
Contributor Author

Got it! I will make a new pass. Thanks!

Copy link
Contributor

@mengxr mengxr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made one pass


(Added in 2.1.0)

[Logistic regression](https://en.wikipedia.org/wiki/Logistic_regression) is a widely-used model when the response is categorical. It can be seen as a special case of the [Generalized Linear Model](https://en.wikipedia.org/wiki/Generalized_linear_model).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

model -> predictive model

(Added in 2.1.0)

[Logistic regression](https://en.wikipedia.org/wiki/Logistic_regression) is a widely-used model when the response is categorical. It can be seen as a special case of the [Generalized Linear Model](https://en.wikipedia.org/wiki/Generalized_linear_model).
There are two types of logistic regression models, namely binomial logistic regression (i.e., response is binary) and multinomial
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence is not very necessary, which is basically explaining what logistic regression models are.

[Logistic regression](https://en.wikipedia.org/wiki/Logistic_regression) is a widely-used model when the response is categorical. It can be seen as a special case of the [Generalized Linear Model](https://en.wikipedia.org/wiki/Generalized_linear_model).
There are two types of logistic regression models, namely binomial logistic regression (i.e., response is binary) and multinomial
logistic regression (i.e., response falls into multiple classes). We provide `spark.logit` on top of `spark.glm` to support logistic regression with advanced hyper-parameters.
It supports both binary and multiclass classification, elastic-net regularization, and feature standardization, similar to `glmnet`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be consistent with the text, we can say "both binomial and multinomial logistic regression models with elastic-net regularization and feature standardization, similar ..."

It supports both binary and multiclass classification, elastic-net regularization, and feature standardization, similar to `glmnet`.


`spark.logit` fits an Logistic Regression Model against a Spark DataFrame. The `family` parameter can be used to select between the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This paragraph is not necessary. We can continue with the examples directly.

Binomial logistic regression
```{r, warning=FALSE}
df <- createDataFrame(iris)
# Create a dataframe containing two classes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dataframe -> DataFrame

# Create a dataframe containing two classes
training <- df[df$Species %in% c("versicolor", "virginica"), ]
model <- spark.logit(training, Species ~ ., regParam = 0.5)
summary <- summary(model)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, we didn't implement print.summary. If summary(model) is still somewhat human-readable, we should use it. Once we implemented print.summary, we don't need to change code here.

df <- createDataFrame(iris)
# Create a dataframe containing two classes
training <- df[df$Species %in% c("versicolor", "virginica"), ]
model <- spark.logit(training, Species ~ ., regParam = 0.5)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, did you check whether regParam = 0.5 returns a good model or not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the test as an example. I didn't check whether regParam = 0.5 returns good model or not. I can do some experiments to check it out.

df <- createDataFrame(iris)
# Note family = "multinomial" is optional in this case since the dataset has multiple classes.
model <- spark.logit(df, Species ~ ., regParam = 0.5)
summary <- summary(model)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Multinomial logistic regression against three classes
```{r, warning=FALSE}
df <- createDataFrame(iris)
# Note family = "multinomial" is optional in this case since the dataset has multiple classes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reads like family = "binomial" is required if the dataset has only two classes.

@SparkQA
Copy link

SparkQA commented Dec 10, 2016

Test build #69947 has finished for PR 16222 at commit e7c424a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 10, 2016

Test build #69948 has finished for PR 16222 at commit 070d7fe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangmiao1981
Copy link
Contributor Author

wangmiao1981 commented Dec 10, 2016

@mengxr

I used the following R code and glmnet to check whether regParam = 0.5 fits a good model. The dataset is the same as the example in the document.

iris2 <- iris[iris$Species %in% c("versicolor", "virginica"), ]
iris.x = as.matrix(iris2[, 1:4])
iris.y = as.factor(as.character(iris2[, 5]))
cvfit = cv.glmnet(iris.x, iris.y, family = "binomial", type.measure = "class")
cvfit$lambda.min
[1] 0.000423808

multinomial:

iris.x = as.matrix(iris[, 1:4])
iris.y = as.factor(as.character(iris[, 5]))
cvfit = cv.glmnet(iris.x, iris.y, family = "multinomial", type.measure = "class")
cvfit$lambda.min
[1] 0.05618186

If I understand correctly, regParam = 0.5 doesn't fit a good model for both binomial and multinomial cases, as the minimal lambda is < 0.1.

@SparkQA
Copy link

SparkQA commented Dec 10, 2016

Test build #69954 has finished for PR 16222 at commit 5fe125f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Dec 13, 2016
## What changes were proposed in this pull request?
spark.logit is added in 2.1. We need to update spark-vignettes to reflect the changes. This is part of SparkR QA work.

## How was this patch tested?

Manual build html. Please see attached image for the result.
![test](https://cloud.githubusercontent.com/assets/5033592/21032237/01b565fe-bd5d-11e6-8b59-4de4b6ef611d.jpeg)

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #16222 from wangmiao1981/veg.

(cherry picked from commit 2aa16d0)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
@mengxr
Copy link
Contributor

mengxr commented Dec 13, 2016

LGTM. Merged into master and branch-2.1. I will change the regParam value in a follow-up PR.

@asfgit asfgit closed this in 2aa16d0 Dec 13, 2016
asfgit pushed a commit that referenced this pull request Dec 15, 2016
## What changes were proposed in this pull request?

When do the QA work, I found that the following issues:

1). `spark.mlp` doesn't include an example;
2). `spark.mlp` and `spark.lda` have redundant parameter explanations;
3). `spark.lda` document misses default values for some parameters.

I also changed the `spark.logit` regParam in the examples, as we discussed in #16222.

## How was this patch tested?

Manual test

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #16284 from wangmiao1981/ks.

(cherry picked from commit 3243885)
Signed-off-by: Felix Cheung <felixcheung@apache.org>
asfgit pushed a commit that referenced this pull request Dec 15, 2016
## What changes were proposed in this pull request?

When do the QA work, I found that the following issues:

1). `spark.mlp` doesn't include an example;
2). `spark.mlp` and `spark.lda` have redundant parameter explanations;
3). `spark.lda` document misses default values for some parameters.

I also changed the `spark.logit` regParam in the examples, as we discussed in #16222.

## How was this patch tested?

Manual test

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #16284 from wangmiao1981/ks.
robert3005 pushed a commit to palantir/spark that referenced this pull request Dec 15, 2016
## What changes were proposed in this pull request?
spark.logit is added in 2.1. We need to update spark-vignettes to reflect the changes. This is part of SparkR QA work.

## How was this patch tested?

Manual build html. Please see attached image for the result.
![test](https://cloud.githubusercontent.com/assets/5033592/21032237/01b565fe-bd5d-11e6-8b59-4de4b6ef611d.jpeg)

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes apache#16222 from wangmiao1981/veg.
robert3005 pushed a commit to palantir/spark that referenced this pull request Dec 15, 2016
## What changes were proposed in this pull request?

When do the QA work, I found that the following issues:

1). `spark.mlp` doesn't include an example;
2). `spark.mlp` and `spark.lda` have redundant parameter explanations;
3). `spark.lda` document misses default values for some parameters.

I also changed the `spark.logit` regParam in the examples, as we discussed in apache#16222.

## How was this patch tested?

Manual test

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes apache#16284 from wangmiao1981/ks.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?
spark.logit is added in 2.1. We need to update spark-vignettes to reflect the changes. This is part of SparkR QA work.

## How was this patch tested?

Manual build html. Please see attached image for the result.
![test](https://cloud.githubusercontent.com/assets/5033592/21032237/01b565fe-bd5d-11e6-8b59-4de4b6ef611d.jpeg)

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes apache#16222 from wangmiao1981/veg.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?

When do the QA work, I found that the following issues:

1). `spark.mlp` doesn't include an example;
2). `spark.mlp` and `spark.lda` have redundant parameter explanations;
3). `spark.lda` document misses default values for some parameters.

I also changed the `spark.logit` regParam in the examples, as we discussed in apache#16222.

## How was this patch tested?

Manual test

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes apache#16284 from wangmiao1981/ks.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants