-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-17499][SparkR][ML][MLLib] make the default params in sparkR spark.mlp consistent with MultilayerPerceptronClassifier #15051
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #65226 has finished for PR 15051 at commit
|
ef572f7 to
8a87b86
Compare
|
Test build #65228 has finished for PR 15051 at commit
|
R/pkg/R/mllib.R
Outdated
| setMethod("spark.mlp", signature(data = "SparkDataFrame"), | ||
| function(data, blockSize = 128, layers = c(3, 5, 2), solver = "l-bfgs", maxIter = 100, | ||
| tol = 0.5, stepSize = 1, seed = 1) { | ||
| function(data, blockSize = 128, layers, solver = "l-bfgs", maxIter = 100, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the preference would be to have layers = c() - it helps to show that it should be a vector of potentially multiple values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its also better to not make an argument required in the middle -- i.e. if we want to make layers a required argument then we should move it before blockSize
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shivaram yeah..but I think change the parameter order may break the API compatibility ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@felixcheung but layers = c() the c() is an invalid value for layers parameter.... so I think it is better not to specify the default value for layers so user must specify this parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the goal is to require layers to have a value (I didn't realize this from our PR description), then we should have layers as the 2nd parameter (after data) without any default value. API compatibility shouldn't be an issue since this is new in Spark 2.1.0 (ie. there hasn't been an release yet)
We should also make sure when layers is later coerced to array that its values are coerced into integer?
> a <- list(1, 2, "a")
> as.integer(a)
[1] 1 2 NA
Warning message:
NAs introduced by coercion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all right, I will add layers parameter validation check and move this parameter to front. thanks!
and, the layers parameters must be set because it determine the structure of this classifier.
if we do not set layers or set it as a null list c(), training will throw exception...
you can check the MultilayerPerceptronClassifierSuite input validation test.
|
thanks - could you add some tests that use these default values? (esp. layers as NULL) |
R/pkg/R/mllib.R
Outdated
| function(data, blockSize = 128, layers = c(3, 5, 2), solver = "l-bfgs", maxIter = 100, | ||
| tol = 0.5, stepSize = 1, seed = 1) { | ||
| function(data, blockSize = 128, layers, solver = "l-bfgs", maxIter = 100, | ||
| tol = 1E-6, stepSize = 0.03, seed = -763139545) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't look like seed default to this value? could you point out where that is specified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, the default seed use ClassName.hashCode so here it is “org.apache.spark.ml.classification.MultilayerPerceptronClassifier”.hashCode() and it equals -763139545
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, this seems rather fragile? do you think there's another way to do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, it is a problem.
now I consider a better way:
we give the seed parameter default value null
MultilayerPerceptronClassifierWrapper.fit add a null check for seed parameter,
if it is null, then do not call MultilayerPerceptronClassifier.setSeed
so it will automatically use the default seed.
how do you think about it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that sounds fine.
|
Test build #65327 has finished for PR 15051 at commit
|
R/pkg/R/mllib.R
Outdated
| function(data, layers, blockSize = 128, solver = "l-bfgs", maxIter = 100, | ||
| tol = 1E-6, stepSize = 0.03, seed = 0x7FFFFFFF) { | ||
| if (length(layers) <= 1) stop("layers vector require length > 0.") | ||
| for (i in 1 : length(layers)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do something like any(lapply(!is.numeric(layers))) instead
|
Test build #65371 has finished for PR 15051 at commit
|
|
Test build #65373 has finished for PR 15051 at commit
|
|
LGTM. |
R/pkg/R/mllib.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps as a small style nit even though it is not flagged by lint-r:
you might want to style this with bracket
if (...) {
}
like in here
R/pkg/R/mllib.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
|
could you update the tests and add more tests for default values as discussed here |
99d0e0c to
ce2c2f7
Compare
|
@felixcheung Now I add some test using default parameter and compare the output prediction with the result generated using scala-side code. |
|
Test build #65542 has finished for PR 15051 at commit
|
R/pkg/R/mllib.R
Outdated
| if (length(layers) <= 1) { | ||
| stop("layers vector require length > 0.") | ||
| } | ||
| if (any(sapply(layers, function(e) !is.numeric(e)))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just double checking - should layers be integer or numeric?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
layers should be integer, but in R it seems we can't distinguish numeric or integer vector ?
to layers<-c(1,2) or layers<-c(1.0, 2.0), is.integer(layers[i]) both return false and as.integer(layers) both return true,
so is there some good way to check it is an integer vector but not a numeric vector ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use numToInt from https://github.com/apache/spark/blob/master/R/pkg/R/utils.R#L368 -- It'll print a warning if its not an integer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, its a clever way using as.intege(x) != x to check whether it is an integer.
here the mlp require layers to be integer vector,
is it better to force user pass integer vector, if not call stop, or just print a warning ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's because 1 by itself is actually a numeric value, whereas 1L is integer
> is.integer(1)
[1] FALSE
> is.integer(1L)
[1] TRUE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@felixcheung Now I update code using as.intege(x) != x to check whether the layers is an integer vector, it works fine. thanks!
|
just a question above and this: would 0x7FFFFFFF be a good placeholder value - is it possible to set seed to this in Scala? |
|
@felixcheung |
|
Test build #65558 has finished for PR 15051 at commit
|
R/pkg/R/mllib.R
Outdated
| if (length(layers) <= 1) { | ||
| stop("layers vector require length > 0.") | ||
| } | ||
| if (any(sapply(layers, function(e) as.integer(e) != e))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This way layers = c(1.0, 2.0) would pass the as.integer(e) != e test.
One possible issue is with how we are handling this on the Scala side. Since we are passing it as as.array(layers), this could end up as double in JVM - would it handle that correctly?
There are other ways to do this but generally coercing to integer is a reasonable approach.
One alternative implementation is this:
layers <- as.integer(na.omit(layers))
if (length(layers) <= 1) {
stop("layers vector must be integer values with length > 0.")
}
Please add some test for this
|
well, we don't have a good solution for 64bit integer in R. |
|
@felixcheung and the seed default value currently I use "", not NULL, because the R-side args serializer cannot handle and I change the another problem: thanks! |
|
You can look for example of
expect_error
|
|
@felixcheung negative test added, thanks! |
|
Test build #65573 has finished for PR 15051 at commit
|
|
Test build #65577 has finished for PR 15051 at commit
|
| expect_equal(head(mlpPredictions$prediction, 10), c(1, 1, 1, 1, 0, 1, 2, 2, 1, 0)) | ||
|
|
||
| # Test illegal parameter | ||
| expect_error(spark.mlp(df, layers = NULL)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be preferred to add the error string to check for (the message in stop()) - please see our other tests on how to do this.
R/pkg/R/mllib.R
Outdated
| setMethod("spark.mlp", signature(data = "SparkDataFrame"), | ||
| function(data, layers, blockSize = 128, solver = "l-bfgs", maxIter = 100, | ||
| tol = 1E-6, stepSize = 0.03, seed = 0x7FFFFFFF) { | ||
| tol = 1E-6, stepSize = 0.03, seed = "") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a bit unusual in R, also could cause confusion as to what should be passed in (string? integer?) - how about we default to NULL and then check inside spark.mlp to map seed NULL to "" (only that)?
also if seed is not NULL it should be coerce into integer first to be safe as you might get double or other stuff (as.character(as.integer(seed)))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should have the Rspark.mlp taking an integer seed parameter but converting that into string before passing to JVM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@felixcheung
Oh...here is the problem, if the parameter is integer passed in, some integer exceed MAX_32bit_INT may turn into numeric, for example 1000000000000000L became 1e+15.
so how to resolve this problem properly ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@felixcheung OK. just discard 64bit integer, only support 32bit seed in sparkR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's ok. We have similar restrictions in our cases.
| df <- read.df("data/mllib/sample_multiclass_classification_data.txt", source = "libsvm") | ||
| model <- spark.mlp(df, blockSize = 128, layers = c(4, 5, 4, 3), solver = "l-bfgs", maxIter = 100, | ||
| tol = 0.5, stepSize = 1, seed = 1) | ||
| tol = 0.5, stepSize = 1, seed = "1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto - this should still be integer
|
Test build #65596 has finished for PR 15051 at commit
|
|
could you add a test for passing seed and a test for not passing seed - I'm not sure if we have a good way to check for specific result (likely there is since we choose the seed value) |
|
LGTM - waiting for test results. Thanks! |
|
Test build #65718 has finished for PR 15051 at commit
|
What changes were proposed in this pull request?
update
MultilayerPerceptronClassifierWrapper.fitparamter type:layers: Array[Int]seed: Stringupdate several default params in sparkR
spark.mlp:tol--> 1e-6stepSize--> 0.03seed--> NULL ( when seed == NULL, the scala-side wrapper regard it as anullvalue and the seed will use the default one )r-side
seedonly support 32bit integer.remove
layersdefault value, and move it in front of those parameters with default value.add
layersparameter validation check.How was this patch tested?
tests added.