Skip to content

[SPARK-17157][SPARKR]: Add multiclass logistic regression SparkR Wrapper#15365

Closed
wangmiao1981 wants to merge 10 commits intoapache:masterfrom
wangmiao1981:glm
Closed

[SPARK-17157][SPARKR]: Add multiclass logistic regression SparkR Wrapper#15365
wangmiao1981 wants to merge 10 commits intoapache:masterfrom
wangmiao1981:glm

Conversation

@wangmiao1981
Copy link
Contributor

What changes were proposed in this pull request?

As we discussed in #14818, I added a separate R wrapper spark.logit for logistic regression.

This single interface supports both binary and multinomial logistic regression. It also has "predict" and "summary" for binary logistic regression.

How was this patch tested?

New unit tests are added.

@SparkQA
Copy link

SparkQA commented Oct 5, 2016

Test build #66405 has finished for PR 15365 at commit 5bfd132.

  • This patch fails R style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 6, 2016

Test build #66407 has finished for PR 15365 at commit 0f4f551.

  • This patch fails R style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 6, 2016

Test build #66428 has finished for PR 15365 at commit a172bb4.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangmiao1981
Copy link
Contributor Author

@felixcheung When run check-cran, there are errors:
Error : requireNamespace("e1071", quietly = TRUE) is not TRUE
Error : requireNamespace("e1071", quietly = TRUE) is not TRUE
Error : requireNamespace("e1071", quietly = TRUE) is not TRUE
Error : requireNamespace("e1071", quietly = TRUE) is not TRUE
Error : requireNamespace("e1071", quietly = TRUE) is not TRUE

I am trying to figure out what is the problem. Any hints?

Thanks!

@wangmiao1981
Copy link
Contributor Author

From Jekins, I saw the error:
WARNING: There was 1 warning.
NOTE: There were 3 notes.
See
'/home/jenkins/workspace/SparkPullRequestBuilder/R/SparkR.Rcheck/00check.log'
for details.

Had CRAN check errors; see logs.

How can I access the above file?

@wangmiao1981
Copy link
Contributor Author

I see the error message on local test:

LaTeX errors when creating PDF version.
This typically indicates Rd problems.

  • checking PDF version of manual without hyperrefs or index ... ERROR
    Re-running with no redirection of stdout/stderr.
    Hmm ... looks like a package
    Error in texi2dvi(file = file, pdf = TRUE, clean = clean, quiet = quiet, :
    pdflatex is not available
    Error in texi2dvi(file = file, pdf = TRUE, clean = clean, quiet = quiet, :
    pdflatex is not available
    Error in running tools::texi2pdf()
    You may want to clean up by 'rm -rf /var/folders/s_/83b0sgvj2kl2kwq4stvft_pm0000gn/T//RtmpXHJrOk/Rd2pdfac8961d3ab54'
  • DONE

@vectorijk
Copy link
Contributor

@wangmiao1981 I saw the similar error on Jekin. Same with question with you.
Regarding to e1071, I think we only need to install that package locally.

@wangmiao1981
Copy link
Contributor Author

@vectorijk Thanks for your information! I installed e1071 and installed tex package. I just want to find what causes the error.

@wangmiao1981
Copy link
Contributor Author

My local tests passed.

Jenkins, retest this please.

@wangmiao1981
Copy link
Contributor Author

re-test please

@SparkQA
Copy link

SparkQA commented Oct 6, 2016

Test build #66468 has finished for PR 15365 at commit 0811fc3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangmiao1981
Copy link
Contributor Author

@felixcheung I fixed the cran errors. It is ready to review now. Thanks!

@wangmiao1981
Copy link
Contributor Author

cc @sethah @yanboliang

R/pkg/R/mllib.R Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is suppressWarnings needed here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed other test examples. If it is not necessary, I can remove all of them as a followup JIRA.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for all data frame - it is only necessary for data frame with column names containing ., eg. iris. So not for this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thanks for your explanation!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so let's remove this suppressWarnings here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. It has been removed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suppressWarnings still here?

@felixcheung
Copy link
Member

It would be great to get some feedback on the name spark.logit
What do folks think about it?

@wangmiao1981
Copy link
Contributor Author

For the name, as we previously discussed, we can't use glm as interface changes and glm only support binominal logistic regression. We don't use glmnet because current spark.logit only provides logistic regressions which are subset of glmnet.

@felixcheung
Copy link
Member

Sure- I recall that discussion.

@wangmiao1981
Copy link
Contributor Author

I just pick the name for simplicity. Hope to receive feedback from the community and I can make changes accordingly. Thanks!

@SparkQA
Copy link

SparkQA commented Oct 8, 2016

Test build #66563 has finished for PR 15365 at commit 1921221.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

R/pkg/R/mllib.R Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this group of links could be sorted

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will make changes when we agree on the name. Thanks!

R/pkg/R/mllib.R Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here.

R/pkg/R/mllib.R Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

roxygen2 is going to trim all the whitespaces, you will need to add formatting with \item and so on

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

R/pkg/R/mllib.R Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might be confusing, in particular with parameter matching in R, should this be thresholds when length == 1 for binary and > 1 for multiclass?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand this comment. There is a JIRA SPARK-11543 discussing the expected relationship, but it is not implemented yet. So, we keep both threshold and thresholds on Scala side.

Copy link
Member

@felixcheung felixcheung Oct 13, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm suggesting we take one variant of the name in R, at least. It seems to me that's along the same line of SPARK-11543, where thresholds takes priority

This is significant because -

  1. parameter matching in R
> f <- function(thresholds = NULL) { cat(thresholds) }
> f(threshold = "A")
A
  1. everything is a vector in R
> a <- 1
> length(a)
[1] 1
> is.vector(a)
[1] TRUE
> a <- c(1)
> length(a)
[1] 1
> a <- c(1, 2)
> length(a)
[1] 2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

R/pkg/R/mllib.R Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you have a more R-like representation for (Array(1-p, p))?

R/pkg/R/mllib.R Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Array is not often used in R, could we handle vector?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified to c(p, 1-p)

R/pkg/R/mllib.R Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here for suppressWarnings

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

R/pkg/R/mllib.R Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add @aliases

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

R/pkg/R/mllib.R Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of return NULL, perhaps we should stop() with a message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored the code. Stop() if it is loaded and remove if statements in other places.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't set feature col - see checkDataColumns

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this related to spark.glm(family = "binomial")?

R/pkg/R/mllib.R Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it support probabilityCol

Copy link
Contributor Author

@wangmiao1981 wangmiao1981 Oct 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probabilityCol is added. But collect(select(predict(model, df), "probability")) returns
1 <environment: 0x7fd2af79dd08>
2 <environment: 0x7fd2af654068>
3 <environment: 0x7fd2af659d58>
4 <environment: 0x7fd2af62aff8>
5 <environment: 0x7fd2af630ce8>
It is because each Row is a Vector in the scala side. Any suggestions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps turning that into a list/array on the Scala side so it becomes list on the R side?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me check whether it is feasible to do that, because the trait is shared by other algorithms.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felixcheung I don't find a native method to convert the Vector to Array when using select on R side.
Optionally, I can add a method getProbability in scala wrapper, which converts the Rows of Vectors to a dataframe. Then, on R side, we can call that method in summary method. At the same time, we remove the probabilityCol from the function definition.
What do you think?

@SparkQA
Copy link

SparkQA commented Oct 13, 2016

Test build #66857 has finished for PR 15365 at commit 08babe5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

R/pkg/R/mllib.R Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you were using weightCol, shouldn't this be probabilityCol?

@SparkQA
Copy link

SparkQA commented Oct 20, 2016

Test build #67293 has finished for PR 15365 at commit a222de7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangmiao1981
Copy link
Contributor Author

ping @felixcheung

@felixcheung
Copy link
Member

re: #15365 (comment)
This is definitely an issue as Vector/SparseVector/DenseVector is mapped to environment in R, and within the DataFrame, environment is hard to operate with.

It might likely be how serde/type mapping is handling this but from my digging so far I haven't pinpointed where and how we fix this. It could be separated into another JIRA though.

I'll look at the rest of this PR today.

@wangmiao1981
Copy link
Contributor Author

Thanks for your response! I will do some research on how to bring Vectors back at spare time.

R/pkg/R/mllib.R Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this is supposed to be numeric, could you change it to 0.0 or 1.0 consistently throughout this comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to 0.0 and 1.0 in the comments.

R/pkg/R/mllib.R Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suppressWarnings still here?

R/pkg/R/mllib.R Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the right alias..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suppressWarnings is removed. For the aliases, I followed spark.kmeans. Are there any specific rules?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the summary method, not spark.logit method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. I see. I looked the wrong line number.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed. Thanks!

R/pkg/R/mllib.R Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there reasons these are all in 2 lines?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it could exceed the the line limit. Now, I changed it back to 1 line.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would update the example to use thresholds instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated tests and example.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto thresholds

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

R/pkg/R/mllib.R Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there supposed to be a numClasses parameter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no numClasses parameter. The algorithm infers the number of classes. I changed to number of classes to avoid confusion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider another val for logisticRegressionModel.summary.asInstanceOf[BinaryLogisticRegressionSummary] to avoid duplication

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a new val blrSummary

@SparkQA
Copy link

SparkQA commented Oct 25, 2016

Test build #67526 has finished for PR 15365 at commit d0452ae.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 25, 2016

Test build #67531 has finished for PR 15365 at commit 031cf9b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

LGTM.

Let's see if anyone has any other comments.

Could you open a JIRA on Vector/SparseVector/DenseVector?

@wangmiao1981
Copy link
Contributor Author

Sure. I will do it. Thanks!

@asfgit asfgit closed this in 29cea8f Oct 26, 2016
@felixcheung
Copy link
Member

merged to master.

robert3005 pushed a commit to palantir/spark that referenced this pull request Nov 1, 2016
## What changes were proposed in this pull request?

As we discussed in apache#14818, I added a separate R wrapper spark.logit for logistic regression.

This single interface supports both binary and multinomial logistic regression. It also has "predict" and "summary" for binary logistic regression.

## How was this patch tested?

New unit tests are added.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes apache#15365 from wangmiao1981/glm.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?

As we discussed in apache#14818, I added a separate R wrapper spark.logit for logistic regression.

This single interface supports both binary and multinomial logistic regression. It also has "predict" and "summary" for binary logistic regression.

## How was this patch tested?

New unit tests are added.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes apache#15365 from wangmiao1981/glm.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants