[SPARK-17499][SparkR][ML][MLLib] make the default params in sparkR spark.mlp consistent with MultilayerPerceptronClassifier #15051

WeichenXu123 · 2016-09-11T16:06:54Z

What changes were proposed in this pull request?

update MultilayerPerceptronClassifierWrapper.fit paramter type:
layers: Array[Int]
seed: String

update several default params in sparkR spark.mlp:
tol --> 1e-6
stepSize --> 0.03
seed --> NULL ( when seed == NULL, the scala-side wrapper regard it as a null value and the seed will use the default one )
r-side seed only support 32bit integer.

remove layers default value, and move it in front of those parameters with default value.
add layers parameter validation check.

How was this patch tested?

tests added.

SparkQA · 2016-09-11T16:07:47Z

Test build #65226 has finished for PR 15051 at commit ef572f7.

This patch fails some tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-11T17:28:43Z

Test build #65228 has finished for PR 15051 at commit 8a87b86.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-09-11T19:35:07Z

R/pkg/R/mllib.R

 setMethod("spark.mlp", signature(data = "SparkDataFrame"),
-          function(data, blockSize = 128, layers = c(3, 5, 2), solver = "l-bfgs", maxIter = 100,
-                   tol = 0.5, stepSize = 1, seed = 1) {
+          function(data, blockSize = 128, layers, solver = "l-bfgs", maxIter = 100,


I think the preference would be to have layers = c() - it helps to show that it should be a vector of potentially multiple values

Its also better to not make an argument required in the middle -- i.e. if we want to make layers a required argument then we should move it before blockSize

@shivaram yeah..but I think change the parameter order may break the API compatibility ?

@felixcheung but layers = c() the c() is an invalid value for layers parameter.... so I think it is better not to specify the default value for layers so user must specify this parameter.

If the goal is to require layers to have a value (I didn't realize this from our PR description), then we should have layers as the 2nd parameter (after data) without any default value. API compatibility shouldn't be an issue since this is new in Spark 2.1.0 (ie. there hasn't been an release yet)

We should also make sure when layers is later coerced to array that its values are coerced into integer?

> a <- list(1, 2, "a") > as.integer(a) [1] 1 2 NA Warning message: NAs introduced by coercion

all right, I will add layers parameter validation check and move this parameter to front. thanks!

and, the layers parameters must be set because it determine the structure of this classifier.
if we do not set layers or set it as a null list c(), training will throw exception...
you can check the MultilayerPerceptronClassifierSuite input validation test.

felixcheung · 2016-09-11T19:35:43Z

thanks - could you add some tests that use these default values? (esp. layers as NULL)

felixcheung · 2016-09-11T19:38:20Z

R/pkg/R/mllib.R

-          function(data, blockSize = 128, layers = c(3, 5, 2), solver = "l-bfgs", maxIter = 100,
-                   tol = 0.5, stepSize = 1, seed = 1) {
+          function(data, blockSize = 128, layers, solver = "l-bfgs", maxIter = 100,
+                   tol = 1E-6, stepSize = 0.03, seed = -763139545) {


doesn't look like seed default to this value? could you point out where that is specified?

oh, the default seed use ClassName.hashCode so here it is “org.apache.spark.ml.classification.MultilayerPerceptronClassifier”.hashCode() and it equals -763139545

hmm, this seems rather fragile? do you think there's another way to do this?

yeah, it is a problem.
now I consider a better way:

we give the seed parameter default value null
MultilayerPerceptronClassifierWrapper.fit add a null check for seed parameter,
if it is null, then do not call MultilayerPerceptronClassifier.setSeed
so it will automatically use the default seed.

how do you think about it ?

Yeah that sounds fine.

SparkQA · 2016-09-13T18:05:26Z

Test build #65327 has finished for PR 15051 at commit 50bf5e8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-09-13T18:32:14Z

R/pkg/R/mllib.R

+          function(data, layers, blockSize = 128, solver = "l-bfgs", maxIter = 100,
+                   tol = 1E-6, stepSize = 0.03, seed = 0x7FFFFFFF) {
+            if (length(layers) <= 1) stop("layers vector require length > 0.")
+            for (i in 1 : length(layers)) {


do something like any(lapply(!is.numeric(layers))) instead

SparkQA · 2016-09-14T14:17:51Z

Test build #65371 has finished for PR 15051 at commit c3e9bf3.

This patch fails R style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-14T15:36:17Z

Test build #65373 has finished for PR 15051 at commit 99d0e0c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

thunterdb · 2016-09-14T18:56:53Z

LGTM.

felixcheung · 2016-09-16T08:22:10Z

R/pkg/R/mllib.R

Perhaps as a small style nit even though it is not flagged by lint-r:
you might want to style this with bracket

if (...) { }

like in here

felixcheung · 2016-09-16T08:22:44Z

R/pkg/R/mllib.R

felixcheung · 2016-09-16T08:25:19Z

could you update the tests and add more tests for default values as discussed here

WeichenXu123 · 2016-09-17T16:39:09Z

@felixcheung Now I add some test using default parameter and compare the output prediction with the result generated using scala-side code.
thanks!

SparkQA · 2016-09-17T17:35:40Z

Test build #65542 has finished for PR 15051 at commit ce2c2f7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-09-17T23:45:47Z

R/pkg/R/mllib.R

+            if (length(layers) <= 1) {
+              stop("layers vector require length > 0.")
+            }
+            if (any(sapply(layers, function(e) !is.numeric(e)))) {


just double checking - should layers be integer or numeric?

layers should be integer, but in R it seems we can't distinguish numeric or integer vector ?
to layers<-c(1,2) or layers<-c(1.0, 2.0), is.integer(layers[i]) both return false and as.integer(layers) both return true,
so is there some good way to check it is an integer vector but not a numeric vector ?

You can use numToInt from https://github.com/apache/spark/blob/master/R/pkg/R/utils.R#L368 -- It'll print a warning if its not an integer

oh, its a clever way using as.intege(x) != x to check whether it is an integer.
here the mlp require layers to be integer vector,
is it better to force user pass integer vector, if not call stop, or just print a warning ?

it's because 1 by itself is actually a numeric value, whereas 1L is integer

> is.integer(1) [1] FALSE > is.integer(1L) [1] TRUE

@felixcheung Now I update code using as.intege(x) != x to check whether the layers is an integer vector, it works fine. thanks!

felixcheung · 2016-09-17T23:47:27Z

just a question above and this: would 0x7FFFFFFF be a good placeholder value - is it possible to set seed to this in Scala?

WeichenXu123 · 2016-09-18T02:33:48Z

@felixcheung
yeah, in fact 0x7FFFFFFF is not ideal because itself also a valid seed.
and there is another problem, in scala, seed is long type,
but in R side, it seems there is no long type, so the seed value range in R-side is already smaller than scala-side.
but I think it is a trivial problem, because int range seed is large enough to be used.

SparkQA · 2016-09-18T08:47:54Z

Test build #65558 has finished for PR 15051 at commit 160caf1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-09-18T11:29:02Z

R/pkg/R/mllib.R

+            if (length(layers) <= 1) {
+              stop("layers vector require length > 0.")
+            }
+            if (any(sapply(layers, function(e) as.integer(e) != e))) {


This way layers = c(1.0, 2.0) would pass the as.integer(e) != e test.
One possible issue is with how we are handling this on the Scala side. Since we are passing it as as.array(layers), this could end up as double in JVM - would it handle that correctly?

There are other ways to do this but generally coercing to integer is a reasonable approach.

One alternative implementation is this:

layers <- as.integer(na.omit(layers)) if (length(layers) <= 1) { stop("layers vector must be integer values with length > 0.") }

Please add some test for this

felixcheung · 2016-09-18T11:31:30Z

well, we don't have a good solution for 64bit integer in R.
one possible approach is to pass seed as a string to the Scala wrapper - this way at least we could take the full 32-bit value of seed + NULL as default value in the R function signature and that on the Scala side we could set with default value

WeichenXu123 · 2016-09-18T13:25:20Z

@felixcheung
Now I update the scala-side wapper args type as following:
layers: Array[Int],
seed: String

and the seed default value currently I use "", not NULL, because the R-side args serializer cannot handle as.character(NULL) object correctly.

and I change the layers parameter validation as the following way:
layers <- as.integer(na.omit(layers)) if (length(layers) <= 1) { stop("layers vector must be integer values with length > 1.") }
it should be noted that the layers array should satisfy length > 1, not length > 0.

another problem:
how to test R code which contains stop activity ? I want to test a piece of R code to make sure it will call stop indeed.

thanks!

felixcheung · 2016-09-18T13:51:44Z

You can look for example of expect_error

WeichenXu123 · 2016-09-18T14:26:03Z

@felixcheung negative test added, thanks!

SparkQA · 2016-09-18T14:29:18Z

Test build #65573 has finished for PR 15051 at commit 8952a36.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-18T15:43:53Z

Test build #65577 has finished for PR 15051 at commit 36cccb6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-09-19T09:40:41Z

R/pkg/inst/tests/testthat/test_mllib.R

  expect_equal(head(mlpPredictions$prediction, 10), c(1, 1, 1, 1, 0, 1, 2, 2, 1, 0))
+
+  # Test illegal parameter
+  expect_error(spark.mlp(df, layers = NULL))


it would be preferred to add the error string to check for (the message in stop()) - please see our other tests on how to do this.

felixcheung · 2016-09-19T09:40:56Z

R/pkg/R/mllib.R

 setMethod("spark.mlp", signature(data = "SparkDataFrame"),
          function(data, layers, blockSize = 128, solver = "l-bfgs", maxIter = 100,
-                   tol = 1E-6, stepSize = 0.03, seed = 0x7FFFFFFF) {
+                   tol = 1E-6, stepSize = 0.03, seed = "") {


this is a bit unusual in R, also could cause confusion as to what should be passed in (string? integer?) - how about we default to NULL and then check inside spark.mlp to map seed NULL to "" (only that)?

also if seed is not NULL it should be coerce into integer first to be safe as you might get double or other stuff (as.character(as.integer(seed)))

we should have the Rspark.mlp taking an integer seed parameter but converting that into string before passing to JVM.

@felixcheung
Oh...here is the problem, if the parameter is integer passed in, some integer exceed MAX_32bit_INT may turn into numeric, for example 1000000000000000L became 1e+15.
so how to resolve this problem properly ?

@felixcheung OK. just discard 64bit integer, only support 32bit seed in sparkR.

I think that's ok. We have similar restrictions in our cases.

felixcheung · 2016-09-19T09:42:09Z

R/pkg/inst/tests/testthat/test_mllib.R

  df <- read.df("data/mllib/sample_multiclass_classification_data.txt", source = "libsvm")
  model <- spark.mlp(df, blockSize = 128, layers = c(4, 5, 4, 3), solver = "l-bfgs", maxIter = 100,
-                     tol = 0.5, stepSize = 1, seed = 1)
+                     tol = 0.5, stepSize = 1, seed = "1")


ditto - this should still be integer

SparkQA · 2016-09-19T13:49:24Z

Test build #65596 has finished for PR 15051 at commit c9d3dc4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-09-20T21:20:21Z

could you add a test for passing seed and a test for not passing seed - I'm not sure if we have a good way to check for specific result (likely there is since we choose the seed value)
but we should at least make sure passing seed get us a different result as not passing seed (default seed value)?
we have had some problem with that in the past so it'd be better to test for them.

felixcheung · 2016-09-21T14:16:33Z

LGTM - waiting for test results. Thanks!

SparkQA · 2016-09-21T15:14:41Z

Test build #65718 has finished for PR 15051 at commit decbf6c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

update.

8a87b86

WeichenXu123 force-pushed the update_py_mlp_default branch from ef572f7 to 8a87b86 Compare September 11, 2016 16:54

felixcheung reviewed Sep 11, 2016
View reviewed changes

update.

50bf5e8

felixcheung reviewed Sep 13, 2016
View reviewed changes

update_py_mlp_default

c3e9bf3

WeichenXu123 changed the title ~~[SPARK-17499][ML][MLLib] make the default params in sparkR spark.mlp consistent with MultilayerPerceptronClassifier~~ [SPARK-17499][SparkR][ML][MLLib] make the default params in sparkR spark.mlp consistent with MultilayerPerceptronClassifier Sep 14, 2016

felixcheung reviewed Sep 16, 2016

View reviewed changes

R/pkg/R/mllib.R Outdated

Copy link

Member

felixcheung Sep 16, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

WeichenXu123 added 2 commits September 17, 2016 15:51

update.

ce2c2f7

improve integer check

160caf1

WeichenXu123 force-pushed the update_py_mlp_default branch from 99d0e0c to ce2c2f7 Compare September 17, 2016 16:35

felixcheung reviewed Sep 17, 2016

View reviewed changes

felixcheung reviewed Sep 18, 2016

View reviewed changes

update scala-wrapper arg type & R-side code

8952a36

add negative test

36cccb6

update.

c9d3dc4

felixcheung reviewed Sep 19, 2016

View reviewed changes

add test for rand seed

decbf6c

asfgit closed this in f89808b Sep 23, 2016

WeichenXu123 deleted the update_py_mlp_default branch April 24, 2019 21:19

[SPARK-17499][SparkR][ML][MLLib] make the default params in sparkR spark.mlp consistent with MultilayerPerceptronClassifier #15051

[SPARK-17499][SparkR][ML][MLLib] make the default params in sparkR spark.mlp consistent with MultilayerPerceptronClassifier #15051

Uh oh!

Conversation

WeichenXu123 commented Sep 11, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Sep 11, 2016

Uh oh!

SparkQA commented Sep 11, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felixcheung Sep 12, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felixcheung commented Sep 11, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 13, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 14, 2016

Uh oh!

SparkQA commented Sep 14, 2016

Uh oh!

thunterdb commented Sep 14, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felixcheung commented Sep 16, 2016

Uh oh!

WeichenXu123 commented Sep 17, 2016

Uh oh!

SparkQA commented Sep 17, 2016

Uh oh!

felixcheung Sep 17, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felixcheung commented Sep 17, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WeichenXu123 commented Sep 18, 2016

Uh oh!

SparkQA commented Sep 18, 2016

Uh oh!

WeichenXu123 commented Sep 11, 2016 •

edited

Loading

felixcheung Sep 12, 2016 •

edited

Loading

felixcheung Sep 17, 2016 •

edited

Loading

felixcheung commented Sep 17, 2016 •

edited

Loading

felixcheung Sep 18, 2016 •

edited

Loading

WeichenXu123 commented Sep 18, 2016 •

edited

Loading

felixcheung Sep 19, 2016 •

edited

Loading

felixcheung Sep 19, 2016 •

edited

Loading