Skip to content

Conversation

@WeichenXu123
Copy link
Contributor

@WeichenXu123 WeichenXu123 commented Sep 11, 2016

What changes were proposed in this pull request?

update MultilayerPerceptronClassifierWrapper.fit paramter type:
layers: Array[Int]
seed: String

update several default params in sparkR spark.mlp:
tol --> 1e-6
stepSize --> 0.03
seed --> NULL ( when seed == NULL, the scala-side wrapper regard it as a null value and the seed will use the default one )
r-side seed only support 32bit integer.

remove layers default value, and move it in front of those parameters with default value.
add layers parameter validation check.

How was this patch tested?

tests added.

@SparkQA
Copy link

SparkQA commented Sep 11, 2016

Test build #65226 has finished for PR 15051 at commit ef572f7.

  • This patch fails some tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 11, 2016

Test build #65228 has finished for PR 15051 at commit 8a87b86.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

R/pkg/R/mllib.R Outdated
setMethod("spark.mlp", signature(data = "SparkDataFrame"),
function(data, blockSize = 128, layers = c(3, 5, 2), solver = "l-bfgs", maxIter = 100,
tol = 0.5, stepSize = 1, seed = 1) {
function(data, blockSize = 128, layers, solver = "l-bfgs", maxIter = 100,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the preference would be to have layers = c() - it helps to show that it should be a vector of potentially multiple values

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its also better to not make an argument required in the middle -- i.e. if we want to make layers a required argument then we should move it before blockSize

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shivaram yeah..but I think change the parameter order may break the API compatibility ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felixcheung but layers = c() the c() is an invalid value for layers parameter.... so I think it is better not to specify the default value for layers so user must specify this parameter.

Copy link
Member

@felixcheung felixcheung Sep 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the goal is to require layers to have a value (I didn't realize this from our PR description), then we should have layers as the 2nd parameter (after data) without any default value. API compatibility shouldn't be an issue since this is new in Spark 2.1.0 (ie. there hasn't been an release yet)

We should also make sure when layers is later coerced to array that its values are coerced into integer?

> a <- list(1, 2, "a")
> as.integer(a)
[1]  1  2 NA
Warning message:
NAs introduced by coercion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all right, I will add layers parameter validation check and move this parameter to front. thanks!

and, the layers parameters must be set because it determine the structure of this classifier.
if we do not set layers or set it as a null list c(), training will throw exception...
you can check the MultilayerPerceptronClassifierSuite input validation test.

@felixcheung
Copy link
Member

thanks - could you add some tests that use these default values? (esp. layers as NULL)

R/pkg/R/mllib.R Outdated
function(data, blockSize = 128, layers = c(3, 5, 2), solver = "l-bfgs", maxIter = 100,
tol = 0.5, stepSize = 1, seed = 1) {
function(data, blockSize = 128, layers, solver = "l-bfgs", maxIter = 100,
tol = 1E-6, stepSize = 0.03, seed = -763139545) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't look like seed default to this value? could you point out where that is specified?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, the default seed use ClassName.hashCode so here it is “org.apache.spark.ml.classification.MultilayerPerceptronClassifier”.hashCode() and it equals -763139545

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, this seems rather fragile? do you think there's another way to do this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it is a problem.
now I consider a better way:

we give the seed parameter default value null
MultilayerPerceptronClassifierWrapper.fit add a null check for seed parameter,
if it is null, then do not call MultilayerPerceptronClassifier.setSeed
so it will automatically use the default seed.

how do you think about it ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that sounds fine.

@SparkQA
Copy link

SparkQA commented Sep 13, 2016

Test build #65327 has finished for PR 15051 at commit 50bf5e8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

R/pkg/R/mllib.R Outdated
function(data, layers, blockSize = 128, solver = "l-bfgs", maxIter = 100,
tol = 1E-6, stepSize = 0.03, seed = 0x7FFFFFFF) {
if (length(layers) <= 1) stop("layers vector require length > 0.")
for (i in 1 : length(layers)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do something like any(lapply(!is.numeric(layers))) instead

@SparkQA
Copy link

SparkQA commented Sep 14, 2016

Test build #65371 has finished for PR 15051 at commit c3e9bf3.

  • This patch fails R style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@WeichenXu123 WeichenXu123 changed the title [SPARK-17499][ML][MLLib] make the default params in sparkR spark.mlp consistent with MultilayerPerceptronClassifier [SPARK-17499][SparkR][ML][MLLib] make the default params in sparkR spark.mlp consistent with MultilayerPerceptronClassifier Sep 14, 2016
@SparkQA
Copy link

SparkQA commented Sep 14, 2016

Test build #65373 has finished for PR 15051 at commit 99d0e0c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@thunterdb
Copy link
Contributor

LGTM.

R/pkg/R/mllib.R Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps as a small style nit even though it is not flagged by lint-r:
you might want to style this with bracket

if (...) {
}

like in here

R/pkg/R/mllib.R Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

@felixcheung
Copy link
Member

could you update the tests and add more tests for default values as discussed here

@WeichenXu123
Copy link
Contributor Author

@felixcheung Now I add some test using default parameter and compare the output prediction with the result generated using scala-side code.
thanks!

@SparkQA
Copy link

SparkQA commented Sep 17, 2016

Test build #65542 has finished for PR 15051 at commit ce2c2f7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

R/pkg/R/mllib.R Outdated
if (length(layers) <= 1) {
stop("layers vector require length > 0.")
}
if (any(sapply(layers, function(e) !is.numeric(e)))) {
Copy link
Member

@felixcheung felixcheung Sep 17, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just double checking - should layers be integer or numeric?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

layers should be integer, but in R it seems we can't distinguish numeric or integer vector ?
to layers<-c(1,2) or layers<-c(1.0, 2.0), is.integer(layers[i]) both return false and as.integer(layers) both return true,
so is there some good way to check it is an integer vector but not a numeric vector ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use numToInt from https://github.com/apache/spark/blob/master/R/pkg/R/utils.R#L368 -- It'll print a warning if its not an integer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, its a clever way using as.intege(x) != x to check whether it is an integer.
here the mlp require layers to be integer vector,
is it better to force user pass integer vector, if not call stop, or just print a warning ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's because 1 by itself is actually a numeric value, whereas 1L is integer

> is.integer(1)
[1] FALSE
> is.integer(1L)
[1] TRUE

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felixcheung Now I update code using as.intege(x) != x to check whether the layers is an integer vector, it works fine. thanks!

@felixcheung
Copy link
Member

felixcheung commented Sep 17, 2016

just a question above and this: would 0x7FFFFFFF be a good placeholder value - is it possible to set seed to this in Scala?

@WeichenXu123
Copy link
Contributor Author

@felixcheung
yeah, in fact 0x7FFFFFFF is not ideal because itself also a valid seed.
and there is another problem, in scala, seed is long type,
but in R side, it seems there is no long type, so the seed value range in R-side is already smaller than scala-side.
but I think it is a trivial problem, because int range seed is large enough to be used.

@SparkQA
Copy link

SparkQA commented Sep 18, 2016

Test build #65558 has finished for PR 15051 at commit 160caf1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

R/pkg/R/mllib.R Outdated
if (length(layers) <= 1) {
stop("layers vector require length > 0.")
}
if (any(sapply(layers, function(e) as.integer(e) != e))) {
Copy link
Member

@felixcheung felixcheung Sep 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way layers = c(1.0, 2.0) would pass the as.integer(e) != e test.
One possible issue is with how we are handling this on the Scala side. Since we are passing it as as.array(layers), this could end up as double in JVM - would it handle that correctly?

There are other ways to do this but generally coercing to integer is a reasonable approach.

One alternative implementation is this:

layers <- as.integer(na.omit(layers))
if (length(layers) <= 1) {
    stop("layers vector must be integer values with length > 0.")
}

Please add some test for this

@felixcheung
Copy link
Member

well, we don't have a good solution for 64bit integer in R.
one possible approach is to pass seed as a string to the Scala wrapper - this way at least we could take the full 32-bit value of seed + NULL as default value in the R function signature and that on the Scala side we could set with default value

@WeichenXu123
Copy link
Contributor Author

WeichenXu123 commented Sep 18, 2016

@felixcheung
Now I update the scala-side wapper args type as following:
layers: Array[Int],
seed: String

and the seed default value currently I use "", not NULL, because the R-side args serializer cannot handle as.character(NULL) object correctly.

and I change the layers parameter validation as the following way:
layers <- as.integer(na.omit(layers)) if (length(layers) <= 1) { stop("layers vector must be integer values with length > 1.") }
it should be noted that the layers array should satisfy length > 1, not length > 0.

another problem:
how to test R code which contains stop activity ? I want to test a piece of R code to make sure it will call stop indeed.

thanks!

@felixcheung
Copy link
Member

felixcheung commented Sep 18, 2016 via email

@WeichenXu123
Copy link
Contributor Author

@felixcheung negative test added, thanks!

@SparkQA
Copy link

SparkQA commented Sep 18, 2016

Test build #65573 has finished for PR 15051 at commit 8952a36.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 18, 2016

Test build #65577 has finished for PR 15051 at commit 36cccb6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

expect_equal(head(mlpPredictions$prediction, 10), c(1, 1, 1, 1, 0, 1, 2, 2, 1, 0))

# Test illegal parameter
expect_error(spark.mlp(df, layers = NULL))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be preferred to add the error string to check for (the message in stop()) - please see our other tests on how to do this.

R/pkg/R/mllib.R Outdated
setMethod("spark.mlp", signature(data = "SparkDataFrame"),
function(data, layers, blockSize = 128, solver = "l-bfgs", maxIter = 100,
tol = 1E-6, stepSize = 0.03, seed = 0x7FFFFFFF) {
tol = 1E-6, stepSize = 0.03, seed = "") {
Copy link
Member

@felixcheung felixcheung Sep 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a bit unusual in R, also could cause confusion as to what should be passed in (string? integer?) - how about we default to NULL and then check inside spark.mlp to map seed NULL to "" (only that)?

also if seed is not NULL it should be coerce into integer first to be safe as you might get double or other stuff (as.character(as.integer(seed)))

Copy link
Member

@felixcheung felixcheung Sep 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should have the Rspark.mlp taking an integer seed parameter but converting that into string before passing to JVM.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felixcheung
Oh...here is the problem, if the parameter is integer passed in, some integer exceed MAX_32bit_INT may turn into numeric, for example 1000000000000000L became 1e+15.
so how to resolve this problem properly ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felixcheung OK. just discard 64bit integer, only support 32bit seed in sparkR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's ok. We have similar restrictions in our cases.

df <- read.df("data/mllib/sample_multiclass_classification_data.txt", source = "libsvm")
model <- spark.mlp(df, blockSize = 128, layers = c(4, 5, 4, 3), solver = "l-bfgs", maxIter = 100,
tol = 0.5, stepSize = 1, seed = 1)
tol = 0.5, stepSize = 1, seed = "1")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto - this should still be integer

@SparkQA
Copy link

SparkQA commented Sep 19, 2016

Test build #65596 has finished for PR 15051 at commit c9d3dc4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

could you add a test for passing seed and a test for not passing seed - I'm not sure if we have a good way to check for specific result (likely there is since we choose the seed value)
but we should at least make sure passing seed get us a different result as not passing seed (default seed value)?
we have had some problem with that in the past so it'd be better to test for them.

@felixcheung
Copy link
Member

LGTM - waiting for test results. Thanks!

@SparkQA
Copy link

SparkQA commented Sep 21, 2016

Test build #65718 has finished for PR 15051 at commit decbf6c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@asfgit asfgit closed this in f89808b Sep 23, 2016
@WeichenXu123 WeichenXu123 deleted the update_py_mlp_default branch April 24, 2019 21:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants