[SPARK-17157][SPARKR]: Add multiclass logistic regression SparkR Wrapper by wangmiao1981 · Pull Request #15365 · apache/spark

wangmiao1981 · 2016-10-05T23:41:41Z

What changes were proposed in this pull request?

As we discussed in #14818, I added a separate R wrapper spark.logit for logistic regression.

This single interface supports both binary and multinomial logistic regression. It also has "predict" and "summary" for binary logistic regression.

How was this patch tested?

New unit tests are added.

SparkQA · 2016-10-05T23:46:52Z

Test build #66405 has finished for PR 15365 at commit 5bfd132.

This patch fails R style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-06T00:02:04Z

Test build #66407 has finished for PR 15365 at commit 0f4f551.

This patch fails R style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-06T05:10:44Z

Test build #66428 has finished for PR 15365 at commit a172bb4.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

wangmiao1981 · 2016-10-06T18:07:31Z

@felixcheung When run check-cran, there are errors:
Error : requireNamespace("e1071", quietly = TRUE) is not TRUE
Error : requireNamespace("e1071", quietly = TRUE) is not TRUE
Error : requireNamespace("e1071", quietly = TRUE) is not TRUE
Error : requireNamespace("e1071", quietly = TRUE) is not TRUE
Error : requireNamespace("e1071", quietly = TRUE) is not TRUE

I am trying to figure out what is the problem. Any hints?

Thanks!

wangmiao1981 · 2016-10-06T18:18:16Z

From Jekins, I saw the error:
WARNING: There was 1 warning.
NOTE: There were 3 notes.
See
'/home/jenkins/workspace/SparkPullRequestBuilder/R/SparkR.Rcheck/00check.log'
for details.

Had CRAN check errors; see logs.

How can I access the above file?

wangmiao1981 · 2016-10-06T18:40:27Z

I see the error message on local test:

LaTeX errors when creating PDF version.
This typically indicates Rd problems.

checking PDF version of manual without hyperrefs or index ... ERROR
Re-running with no redirection of stdout/stderr.
Hmm ... looks like a package
Error in texi2dvi(file = file, pdf = TRUE, clean = clean, quiet = quiet, :
pdflatex is not available
Error in texi2dvi(file = file, pdf = TRUE, clean = clean, quiet = quiet, :
pdflatex is not available
Error in running tools::texi2pdf()
You may want to clean up by 'rm -rf /var/folders/s_/83b0sgvj2kl2kwq4stvft_pm0000gn/T//RtmpXHJrOk/Rd2pdfac8961d3ab54'
DONE

vectorijk · 2016-10-06T19:29:31Z

@wangmiao1981 I saw the similar error on Jekin. Same with question with you.
Regarding to e1071, I think we only need to install that package locally.

wangmiao1981 · 2016-10-06T20:58:15Z

@vectorijk Thanks for your information! I installed e1071 and installed tex package. I just want to find what causes the error.

wangmiao1981 · 2016-10-06T21:36:29Z

My local tests passed.

Jenkins, retest this please.

wangmiao1981 · 2016-10-06T22:04:10Z

re-test please

SparkQA · 2016-10-06T23:46:38Z

Test build #66468 has finished for PR 15365 at commit 0811fc3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wangmiao1981 · 2016-10-07T02:07:57Z

@felixcheung I fixed the cran errors. It is ready to review now. Thanks!

wangmiao1981 · 2016-10-07T17:26:34Z

cc @sethah @yanboliang

felixcheung · 2016-10-07T18:48:21Z

R/pkg/R/mllib.R

why is suppressWarnings needed here?

I followed other test examples. If it is not necessary, I can remove all of them as a followup JIRA.

Not for all data frame - it is only necessary for data frame with column names containing ., eg. iris. So not for this case.

I see. Thanks for your explanation!

so let's remove this suppressWarnings here.

Done. It has been removed.

suppressWarnings still here?

felixcheung · 2016-10-07T18:49:39Z

It would be great to get some feedback on the name spark.logit
What do folks think about it?

wangmiao1981 · 2016-10-07T22:40:50Z

For the name, as we previously discussed, we can't use glm as interface changes and glm only support binominal logistic regression. We don't use glmnet because current spark.logit only provides logistic regressions which are subset of glmnet.

felixcheung · 2016-10-07T23:27:02Z

Sure- I recall that discussion.

wangmiao1981 · 2016-10-08T02:41:02Z

I just pick the name for simplicity. Hope to receive feedback from the community and I can make changes accordingly. Thanks!

SparkQA · 2016-10-08T05:01:13Z

Test build #66563 has finished for PR 15365 at commit 1921221.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vectorijk · 2016-10-08T06:39:55Z

R/pkg/R/mllib.R

this group of links could be sorted

I will make changes when we agree on the name. Thanks!

vectorijk · 2016-10-08T06:40:09Z

R/pkg/R/mllib.R

felixcheung · 2016-10-09T18:41:53Z

R/pkg/R/mllib.R

roxygen2 is going to trim all the whitespaces, you will need to add formatting with \item and so on

felixcheung · 2016-10-09T18:46:02Z

R/pkg/R/mllib.R

this might be confusing, in particular with parameter matching in R, should this be thresholds when length == 1 for binary and > 1 for multiclass?

I don't quite understand this comment. There is a JIRA SPARK-11543 discussing the expected relationship, but it is not implemented yet. So, we keep both threshold and thresholds on Scala side.

I'm suggesting we take one variant of the name in R, at least. It seems to me that's along the same line of SPARK-11543, where thresholds takes priority

This is significant because -

parameter matching in R

> f <- function(thresholds = NULL) { cat(thresholds) } > f(threshold = "A") A

everything is a vector in R

> a <- 1 > length(a) [1] 1 > is.vector(a) [1] TRUE > a <- c(1) > length(a) [1] 1 > a <- c(1, 2) > length(a) [1] 2

felixcheung · 2016-10-09T18:58:17Z

R/pkg/R/mllib.R

do you have a more R-like representation for (Array(1-p, p))?

felixcheung · 2016-10-09T18:59:53Z

R/pkg/R/mllib.R

Array is not often used in R, could we handle vector?

Modified to c(p, 1-p)

felixcheung · 2016-10-09T19:02:21Z

R/pkg/R/mllib.R

same here for suppressWarnings

felixcheung · 2016-10-09T19:05:58Z

R/pkg/R/mllib.R

add @aliases

felixcheung · 2016-10-09T19:07:17Z

R/pkg/R/mllib.R

instead of return NULL, perhaps we should stop() with a message?

Refactored the code. Stop() if it is loaded and remove if statements in other places.

felixcheung · 2016-10-09T19:10:25Z

mllib/src/main/scala/org/apache/spark/ml/r/LogisticRegressionWrapper.scala

shouldn't set feature col - see checkDataColumns

felixcheung

How is this related to spark.glm(family = "binomial")?

felixcheung · 2016-10-09T19:31:33Z

R/pkg/R/mllib.R

should it support probabilityCol

probabilityCol is added. But collect(select(predict(model, df), "probability")) returns
1 <environment: 0x7fd2af79dd08>
2 <environment: 0x7fd2af654068>
3 <environment: 0x7fd2af659d58>
4 <environment: 0x7fd2af62aff8>
5 <environment: 0x7fd2af630ce8>
It is because each Row is a Vector in the scala side. Any suggestions?

perhaps turning that into a list/array on the Scala side so it becomes list on the R side?

Let me check whether it is feasible to do that, because the trait is shared by other algorithms.

@felixcheung I don't find a native method to convert the Vector to Array when using select on R side.
Optionally, I can add a method getProbability in scala wrapper, which converts the Rows of Vectors to a dataframe. Then, on R side, we can call that method in summary method. At the same time, we remove the probabilityCol from the function definition.
What do you think?

SparkQA · 2016-10-13T00:43:17Z

Test build #66857 has finished for PR 15365 at commit 08babe5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-10-18T18:52:01Z

R/pkg/R/mllib.R

you were using weightCol, shouldn't this be probabilityCol?

SparkQA · 2016-10-20T22:45:57Z

Test build #67293 has finished for PR 15365 at commit a222de7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wangmiao1981 · 2016-10-24T18:34:30Z

ping @felixcheung

felixcheung · 2016-10-24T22:16:38Z

re: #15365 (comment)
This is definitely an issue as Vector/SparseVector/DenseVector is mapped to environment in R, and within the DataFrame, environment is hard to operate with.

It might likely be how serde/type mapping is handling this but from my digging so far I haven't pinpointed where and how we fix this. It could be separated into another JIRA though.

I'll look at the rest of this PR today.

wangmiao1981 · 2016-10-24T22:29:46Z

Thanks for your response! I will do some research on how to bring Vectors back at spare time.

felixcheung · 2016-10-25T05:18:11Z

R/pkg/R/mllib.R

if this is supposed to be numeric, could you change it to 0.0 or 1.0 consistently throughout this comment?

Changed to 0.0 and 1.0 in the comments.

felixcheung · 2016-10-25T05:21:20Z

R/pkg/R/mllib.R

suppressWarnings still here?

felixcheung · 2016-10-25T05:24:52Z

R/pkg/R/mllib.R

This is not the right alias..

suppressWarnings is removed. For the aliases, I followed spark.kmeans. Are there any specific rules?

this is the summary method, not spark.logit method?

Oh. I see. I looked the wrong line number.

fixed. Thanks!

felixcheung · 2016-10-25T05:25:41Z

R/pkg/R/mllib.R

are there reasons these are all in 2 lines?

I thought it could exceed the the line limit. Now, I changed it back to 1 line.

felixcheung · 2016-10-25T05:26:59Z

R/pkg/inst/tests/testthat/test_mllib.R

I would update the example to use thresholds instead

Updated tests and example.

felixcheung · 2016-10-25T05:27:07Z

R/pkg/inst/tests/testthat/test_mllib.R

ditto thresholds

felixcheung · 2016-10-25T05:29:47Z

R/pkg/R/mllib.R

is there supposed to be a numClasses parameter?

There is no numClasses parameter. The algorithm infers the number of classes. I changed to number of classes to avoid confusion.

felixcheung · 2016-10-25T05:31:43Z

mllib/src/main/scala/org/apache/spark/ml/r/LogisticRegressionWrapper.scala

consider another val for logisticRegressionModel.summary.asInstanceOf[BinaryLogisticRegressionSummary] to avoid duplication

Add a new val blrSummary

SparkQA · 2016-10-25T21:15:45Z

Test build #67526 has finished for PR 15365 at commit d0452ae.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-25T22:10:29Z

Test build #67531 has finished for PR 15365 at commit 031cf9b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-10-26T04:38:45Z

LGTM.

Let's see if anyone has any other comments.

Could you open a JIRA on Vector/SparseVector/DenseVector?

wangmiao1981 · 2016-10-26T05:39:11Z

Sure. I will do it. Thanks!

felixcheung · 2016-10-26T23:16:00Z

merged to master.

## What changes were proposed in this pull request? As we discussed in apache#14818, I added a separate R wrapper spark.logit for logistic regression. This single interface supports both binary and multinomial logistic regression. It also has "predict" and "summary" for binary logistic regression. ## How was this patch tested? New unit tests are added. Author: wm624@hotmail.com <wm624@hotmail.com> Closes apache#15365 from wangmiao1981/glm.

wangmiao1981 mentioned this pull request Oct 5, 2016

[SPARK-17157][SPARKR][WIP]: Add multiclass logistic regression SparkR Wrapper #14818

Closed

felixcheung reviewed Oct 7, 2016

View reviewed changes

wangmiao1981 force-pushed the glm branch from 0811fc3 to 1921221 Compare October 8, 2016 04:00

vectorijk reviewed Oct 8, 2016

View reviewed changes

felixcheung reviewed Oct 9, 2016

View reviewed changes

R/pkg/R/mllib.R Outdated

Copy link

Member

felixcheung Oct 9, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you have a more R-like representation for (Array(1-p, p))?

felixcheung reviewed Oct 9, 2016

View reviewed changes

mllib/src/main/scala/org/apache/spark/ml/r/LogisticRegressionWrapper.scala Outdated

Copy link

Member

felixcheung Oct 9, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't set feature col - see checkDataColumns

felixcheung requested changes Oct 9, 2016

View reviewed changes

felixcheung reviewed Oct 9, 2016

View reviewed changes

wangmiao1981 force-pushed the glm branch from 1921221 to 08babe5 Compare October 12, 2016 23:40

felixcheung reviewed Oct 18, 2016

View reviewed changes

R/pkg/R/mllib.R Outdated

Copy link

Member

felixcheung Oct 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you were using weightCol, shouldn't this be probabilityCol?

felixcheung requested changes Oct 25, 2016

View reviewed changes

wangmiao1981 added 9 commits October 25, 2016 10:47

add spark.logit

1264b4c

add unit tests

e264d6d

fix R style

b341d77

fix R style issue

63a3ac2

fix cran warning

c9e1000

remove redudant function call

0b54f46

address review comments

e2ca496

address review comments

558dc20

address review comments

d0452ae

wangmiao1981 force-pushed the glm branch from a222de7 to d0452ae Compare October 25, 2016 20:09

fix aliases for summary

031cf9b

asfgit closed this in 29cea8f Oct 26, 2016

Conversation

wangmiao1981 commented Oct 5, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Oct 5, 2016

Uh oh!

SparkQA commented Oct 6, 2016

Uh oh!

SparkQA commented Oct 6, 2016

Uh oh!

wangmiao1981 commented Oct 6, 2016

Uh oh!

wangmiao1981 commented Oct 6, 2016

Uh oh!

wangmiao1981 commented Oct 6, 2016

Uh oh!

vectorijk commented Oct 6, 2016

Uh oh!

wangmiao1981 commented Oct 6, 2016

Uh oh!

wangmiao1981 commented Oct 6, 2016

Uh oh!

wangmiao1981 commented Oct 6, 2016

Uh oh!

SparkQA commented Oct 6, 2016

Uh oh!

wangmiao1981 commented Oct 7, 2016

Uh oh!

wangmiao1981 commented Oct 7, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felixcheung commented Oct 7, 2016

Uh oh!

wangmiao1981 commented Oct 7, 2016

Uh oh!

felixcheung commented Oct 7, 2016

Uh oh!

wangmiao1981 commented Oct 8, 2016

Uh oh!

SparkQA commented Oct 8, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felixcheung Oct 13, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

felixcheung Oct 13, 2016 •

edited

Loading

wangmiao1981 Oct 12, 2016 •

edited

Loading