[SPARK-17157][SPARKR]: Add multiclass logistic regression SparkR Wrapper#15365
[SPARK-17157][SPARKR]: Add multiclass logistic regression SparkR Wrapper#15365wangmiao1981 wants to merge 10 commits intoapache:masterfrom
Conversation
|
Test build #66405 has finished for PR 15365 at commit
|
|
Test build #66407 has finished for PR 15365 at commit
|
|
Test build #66428 has finished for PR 15365 at commit
|
|
@felixcheung When run check-cran, there are errors: I am trying to figure out what is the problem. Any hints? Thanks! |
|
From Jekins, I saw the error: Had CRAN check errors; see logs. How can I access the above file? |
|
I see the error message on local test: LaTeX errors when creating PDF version.
|
|
@wangmiao1981 I saw the similar error on Jekin. Same with question with you. |
|
@vectorijk Thanks for your information! I installed e1071 and installed tex package. I just want to find what causes the error. |
|
My local tests passed. Jenkins, retest this please. |
|
re-test please |
|
Test build #66468 has finished for PR 15365 at commit
|
|
@felixcheung I fixed the cran errors. It is ready to review now. Thanks! |
R/pkg/R/mllib.R
Outdated
There was a problem hiding this comment.
why is suppressWarnings needed here?
There was a problem hiding this comment.
I followed other test examples. If it is not necessary, I can remove all of them as a followup JIRA.
There was a problem hiding this comment.
Not for all data frame - it is only necessary for data frame with column names containing ., eg. iris. So not for this case.
There was a problem hiding this comment.
I see. Thanks for your explanation!
There was a problem hiding this comment.
so let's remove this suppressWarnings here.
There was a problem hiding this comment.
Done. It has been removed.
There was a problem hiding this comment.
suppressWarnings still here?
|
It would be great to get some feedback on the name |
|
For the name, as we previously discussed, we can't use glm as interface changes and glm only support binominal logistic regression. We don't use glmnet because current spark.logit only provides logistic regressions which are subset of glmnet. |
|
Sure- I recall that discussion. |
|
I just pick the name for simplicity. Hope to receive feedback from the community and I can make changes accordingly. Thanks! |
|
Test build #66563 has finished for PR 15365 at commit
|
R/pkg/R/mllib.R
Outdated
There was a problem hiding this comment.
this group of links could be sorted
There was a problem hiding this comment.
I will make changes when we agree on the name. Thanks!
R/pkg/R/mllib.R
Outdated
R/pkg/R/mllib.R
Outdated
There was a problem hiding this comment.
roxygen2 is going to trim all the whitespaces, you will need to add formatting with \item and so on
R/pkg/R/mllib.R
Outdated
There was a problem hiding this comment.
this might be confusing, in particular with parameter matching in R, should this be thresholds when length == 1 for binary and > 1 for multiclass?
There was a problem hiding this comment.
I don't quite understand this comment. There is a JIRA SPARK-11543 discussing the expected relationship, but it is not implemented yet. So, we keep both threshold and thresholds on Scala side.
There was a problem hiding this comment.
I'm suggesting we take one variant of the name in R, at least. It seems to me that's along the same line of SPARK-11543, where thresholds takes priority
This is significant because -
- parameter matching in R
> f <- function(thresholds = NULL) { cat(thresholds) }
> f(threshold = "A")
A
- everything is a vector in R
> a <- 1
> length(a)
[1] 1
> is.vector(a)
[1] TRUE
> a <- c(1)
> length(a)
[1] 1
> a <- c(1, 2)
> length(a)
[1] 2
R/pkg/R/mllib.R
Outdated
There was a problem hiding this comment.
do you have a more R-like representation for (Array(1-p, p))?
R/pkg/R/mllib.R
Outdated
There was a problem hiding this comment.
Array is not often used in R, could we handle vector?
There was a problem hiding this comment.
Modified to c(p, 1-p)
R/pkg/R/mllib.R
Outdated
There was a problem hiding this comment.
same here for suppressWarnings
R/pkg/R/mllib.R
Outdated
R/pkg/R/mllib.R
Outdated
There was a problem hiding this comment.
instead of return NULL, perhaps we should stop() with a message?
There was a problem hiding this comment.
Refactored the code. Stop() if it is loaded and remove if statements in other places.
There was a problem hiding this comment.
shouldn't set feature col - see checkDataColumns
felixcheung
left a comment
There was a problem hiding this comment.
How is this related to spark.glm(family = "binomial")?
R/pkg/R/mllib.R
Outdated
There was a problem hiding this comment.
should it support probabilityCol
There was a problem hiding this comment.
probabilityCol is added. But collect(select(predict(model, df), "probability")) returns
1 <environment: 0x7fd2af79dd08>
2 <environment: 0x7fd2af654068>
3 <environment: 0x7fd2af659d58>
4 <environment: 0x7fd2af62aff8>
5 <environment: 0x7fd2af630ce8>
It is because each Row is a Vector in the scala side. Any suggestions?
There was a problem hiding this comment.
perhaps turning that into a list/array on the Scala side so it becomes list on the R side?
There was a problem hiding this comment.
Let me check whether it is feasible to do that, because the trait is shared by other algorithms.
There was a problem hiding this comment.
@felixcheung I don't find a native method to convert the Vector to Array when using select on R side.
Optionally, I can add a method getProbability in scala wrapper, which converts the Rows of Vectors to a dataframe. Then, on R side, we can call that method in summary method. At the same time, we remove the probabilityCol from the function definition.
What do you think?
|
Test build #66857 has finished for PR 15365 at commit
|
R/pkg/R/mllib.R
Outdated
There was a problem hiding this comment.
you were using weightCol, shouldn't this be probabilityCol?
|
Test build #67293 has finished for PR 15365 at commit
|
|
ping @felixcheung |
|
re: #15365 (comment) It might likely be how serde/type mapping is handling this but from my digging so far I haven't pinpointed where and how we fix this. It could be separated into another JIRA though. I'll look at the rest of this PR today. |
|
Thanks for your response! I will do some research on how to bring Vectors back at spare time. |
R/pkg/R/mllib.R
Outdated
There was a problem hiding this comment.
if this is supposed to be numeric, could you change it to 0.0 or 1.0 consistently throughout this comment?
There was a problem hiding this comment.
Changed to 0.0 and 1.0 in the comments.
R/pkg/R/mllib.R
Outdated
There was a problem hiding this comment.
suppressWarnings still here?
R/pkg/R/mllib.R
Outdated
There was a problem hiding this comment.
This is not the right alias..
There was a problem hiding this comment.
suppressWarnings is removed. For the aliases, I followed spark.kmeans. Are there any specific rules?
There was a problem hiding this comment.
this is the summary method, not spark.logit method?
There was a problem hiding this comment.
Oh. I see. I looked the wrong line number.
There was a problem hiding this comment.
fixed. Thanks!
R/pkg/R/mllib.R
Outdated
There was a problem hiding this comment.
are there reasons these are all in 2 lines?
There was a problem hiding this comment.
I thought it could exceed the the line limit. Now, I changed it back to 1 line.
There was a problem hiding this comment.
I would update the example to use thresholds instead
There was a problem hiding this comment.
Updated tests and example.
R/pkg/R/mllib.R
Outdated
There was a problem hiding this comment.
is there supposed to be a numClasses parameter?
There was a problem hiding this comment.
There is no numClasses parameter. The algorithm infers the number of classes. I changed to number of classes to avoid confusion.
There was a problem hiding this comment.
consider another val for logisticRegressionModel.summary.asInstanceOf[BinaryLogisticRegressionSummary] to avoid duplication
There was a problem hiding this comment.
Add a new val blrSummary
|
Test build #67526 has finished for PR 15365 at commit
|
|
Test build #67531 has finished for PR 15365 at commit
|
|
LGTM. Let's see if anyone has any other comments. Could you open a JIRA on Vector/SparseVector/DenseVector? |
|
Sure. I will do it. Thanks! |
|
merged to master. |
## What changes were proposed in this pull request? As we discussed in apache#14818, I added a separate R wrapper spark.logit for logistic regression. This single interface supports both binary and multinomial logistic regression. It also has "predict" and "summary" for binary logistic regression. ## How was this patch tested? New unit tests are added. Author: wm624@hotmail.com <wm624@hotmail.com> Closes apache#15365 from wangmiao1981/glm.
## What changes were proposed in this pull request? As we discussed in apache#14818, I added a separate R wrapper spark.logit for logistic regression. This single interface supports both binary and multinomial logistic regression. It also has "predict" and "summary" for binary logistic regression. ## How was this patch tested? New unit tests are added. Author: wm624@hotmail.com <wm624@hotmail.com> Closes apache#15365 from wangmiao1981/glm.
What changes were proposed in this pull request?
As we discussed in #14818, I added a separate R wrapper spark.logit for logistic regression.
This single interface supports both binary and multinomial logistic regression. It also has "predict" and "summary" for binary logistic regression.
How was this patch tested?
New unit tests are added.