[SPARK-19781][ML] Handle NULLs as well as NaNs in Bucketizer when handleInvalid is on #17123

crackcell · 2017-03-01T15:26:02Z

What changes were proposed in this pull request?

The original Bucketizer can put NaNs into a special bucket when handleInvalid is on. but leave NULLs untouched.
This PR unify behaviours of processing of NULLs and NaNs.

BTW, this is my first commit to Spark code. I'm not sure whether my code or the way of doing things is appropriate. Plz point it out if I'm doing anything wrong. :-)

How was this patch tested?

new unit tests

SparkQA · 2017-03-01T23:52:05Z

Test build #3590 has finished for PR 17123 at commit b3f98b6.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

crackcell · 2017-03-02T03:43:28Z

Fixed style errors during the unit tests.

imatiach-msft · 2017-03-02T15:24:29Z

mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala

+    val bucketizer: UserDefinedFunction = udf { (row: Row) =>
+      Bucketizer.binarySearchForBuckets(
+        $(splits),
+        row.getAs[java.lang.Double]($(inputCol)),


can you use Double instead of java.lang.Double? It should be the scala Double type.

Hi, Scala's Double will convert null to zero. Say:

scala> val a: Double = null.asInstanceOf[Double]
a: Double = 0.0

So I use Java's Double instead to hold NULLs.

Ideally we should use row.getDouble(index) and row.isNullAt(index) together to get values for primitive types, but technically Row is just a Array[Object], so there is no performance penalty by using java.lang.Double.(this may change in the future, if possible we should prefer isNullAt and getDouble)

imatiach-msft · 2017-03-02T15:25:15Z

mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala

+        throw new SparkException("Bucketizer encountered NaN/NULL values. " +
+          "To handle or skip NaNs/NULLs, try setting Bucketizer.handleInvalid.")
      }
    } else if (feature == splits.last) {


could you please add some tests to validate that NULL values can now be handled in addition to NaN values by the bucketizer?

My fault! I'll do it now!

imatiach-msft · 2017-03-02T15:25:28Z

mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala

  private[feature] def binarySearchForBuckets(
      splits: Array[Double],
-      feature: Double,
+      feature: java.lang.Double,


Double here as well

Also change to Option[Double] here.

imatiach-msft · 2017-03-02T15:26:24Z

@crackcell thank you for the nice fix. I've added a few comments. Please add a test case(s) for the change.

crackcell · 2017-03-03T14:44:00Z

@imatiach-msft Hi, Ilya. I have added two tests based on the original tests for NaN data. Please review my code again. Thanks for your time. :-)

crackcell · 2017-03-03T18:15:39Z

@srowen @cloud-fan Please review my code. Thanks. :-)

crackcell · 2017-03-05T03:20:49Z

@imatiach-msft @cloud-fan I updated the code, replaced java.lang.Double with isNullAt() and getDouble().

cloud-fan · 2017-03-06T08:29:44Z

mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala

      }
    }

-    val bucketizer: UserDefinedFunction = udf { (feature: Double) =>


actually, can we just use java.lang.Double as the type for feature? Then we don't need to change https://github.com/apache/spark/pull/17123/files#diff-37f2c93b88c73b91cdc9e40fc8c45fc5R121

Use both Java and Scala types seems less graceful. Instead, is it better a way to pass a Row to bucketizer() and then check NULLs with isNullAt() and getDouble() ?

see the document of ScalaUDF, if you don't like mixing java and scala types, you can use Option[Double]

Thanks a lot. Option[Double] is much better. :-)

As @cloud-fan suggested, Option[Double] is better. :-)

crackcell · 2017-03-10T05:15:17Z

@cloud-fan Would you please review my code again? I'm now using Option to handle NULLs. :-)

imatiach-msft · 2017-03-16T05:04:13Z

mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala

    }

-    val bucketizer: UserDefinedFunction = udf { (feature: Double) =>
+    val bucketizer: UserDefinedFunction = udf { (row: Row) =>


I believe you should try to avoid using a udf on a row because the serialization costs will be more expensive... hmm how could we make this perform well and handle nulls? Does it work with Option[Double] instead of Row?

Thanks for pointing out the performace problem. Maybe my original code will work better to use java.lang.Double instead of scala's Double to hold NULLs.

imatiach-msft · 2017-03-16T05:07:32Z

mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala

+      feature: Option[Double],
      keepInvalid: Boolean): Double = {
-    if (feature.isNaN) {
+    if (feature.getOrElse(Double.NaN).isNaN) {


I think you can equivalently write this as:
if (feature.isEmpty) { ....

imatiach-msft · 2017-03-16T05:09:25Z

@crackcell I'm not sure about changing the UDF to be on a row instead of a column, I've found that the serialization costs are much higher and the spark code performs much less. Maybe an expert like @cloud-fan can comment more here? Can you keep the UDF on a column instead of a row?

cloud-fan · 2018-01-19T03:45:42Z

cc @WeichenXu123

WeichenXu123

Confirmed with @jkbradley offline. It would be nice to fix. Thanks!

WeichenXu123 · 2018-01-19T18:57:15Z

mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala

      }
    }

-    val bucketizer: UserDefinedFunction = udf { (feature: Double) =>


As @cloud-fan suggested, Option[Double] is better. :-)

WeichenXu123 · 2018-01-19T18:57:33Z

mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala

  private[feature] def binarySearchForBuckets(
      splits: Array[Double],
-      feature: Double,
+      feature: java.lang.Double,


Also change to Option[Double] here.

WeichenXu123 · 2018-01-19T19:05:29Z

But, pls resolve conflicts first. :) Bucketizer add multiple column support so the code is different now.

crackcell · 2018-01-21T15:05:39Z

@WeichenXu123 sorry to miss the message for two days, I'm working on it.

viirya · 2018-01-22T08:52:12Z

mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala

   * otherwise, values outside the splits specified will be treated as errors.
   *
-   * See also [[handleInvalid]], which can optionally create an additional bucket for NaN values.
+   * See also [[handleInvalid]], which can optionally create an additional bucket for NaN/NULL


This sounds like a behavior change, we should add an item in migration guide of ML docs.

@viirya done.

crackcell · 2018-01-22T09:44:19Z

@WeichenXu123 I have finished my work, plz review it. Any suggestion is welcome. :-)

viirya · 2018-01-22T11:29:36Z

docs/ml-guide.md

 We are now setting the default parallelism used in `OneVsRest` to be 1 (i.e. serial), in 2.2 and earlier version,
 the `OneVsRest` parallelism would be parallelism of the default threadpool in scala.
+* [SPARK-19781](https://issues.apache.org/jira/browse/SPARK-19781):
+ `Bucketizer` handles NULL values the same way as NaN when handleInvalid is skip or keep.


hmm, I think for skip, dataset.na.drop drops NULL before. We didn't change its behavior.

Yep, you are right. :-p

AmplabJenkins · 2019-09-16T18:26:32Z

Can one of the admins verify this patch?

github-actions · 2020-01-16T00:08:35Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

crackcell added 2 commits March 1, 2017 20:37

add support for null values in Bucketizer

2b07514

fix a typo

b3f98b6

fix code style errors

b452cc2

imatiach-msft reviewed Mar 2, 2017

View reviewed changes

add unit tests

ced8984

simplify code

12a0f7f

cloud-fan reviewed Mar 6, 2017

View reviewed changes

MLnick mentioned this pull request Mar 6, 2017

[SPARK-19714][ML] Bucketizer.handleInvalid docs improved #17169

Closed

Menglong TAN added 2 commits March 7, 2017 09:56

use Option[Double] to work with NULL values

cb6c338

merge

b017774

imatiach-msft reviewed Mar 16, 2017

View reviewed changes

Menglong TAN added 2 commits March 16, 2017 17:01

use java.lang.Double instead of Row to reduce serilization cost

83e2045

simplify changes

2efbebd

WeichenXu123 reviewed Jan 19, 2018

View reviewed changes

crackcell added 2 commits January 22, 2018 01:55

merge latest master

bdbb3ee

fix compile errors

138c7f4

viirya reviewed Jan 22, 2018

View reviewed changes

crackcell added 4 commits January 22, 2018 17:26

all test cases passed

6f22b6e

update migration guides

3e023e7

remove useless code

c0800d6

simplify code

b954526

change doc location

db2eb8e

viirya reviewed Jan 22, 2018

View reviewed changes

up

063327c

dongjoon-hyun added the MLLIB label Jun 14, 2019

github-actions bot added the Stale label Jan 16, 2020

github-actions bot closed this Jan 17, 2020

[SPARK-19781][ML] Handle NULLs as well as NaNs in Bucketizer when handleInvalid is on #17123

[SPARK-19781][ML] Handle NULLs as well as NaNs in Bucketizer when handleInvalid is on #17123

Uh oh!

Conversation

crackcell commented Mar 1, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Mar 1, 2017

Uh oh!

crackcell commented Mar 2, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crackcell Mar 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

imatiach-msft commented Mar 2, 2017

Uh oh!

crackcell commented Mar 3, 2017

Uh oh!

crackcell commented Mar 3, 2017

Uh oh!

crackcell commented Mar 5, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crackcell commented Mar 10, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

imatiach-msft commented Mar 16, 2017

Uh oh!

cloud-fan commented Jan 19, 2018

Uh oh!

WeichenXu123 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WeichenXu123 commented Jan 19, 2018

Uh oh!

crackcell commented Jan 21, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crackcell commented Jan 22, 2018

Uh oh!

viirya Jan 22, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crackcell commented Mar 1, 2017 •

edited

Loading

crackcell Mar 2, 2017 •

edited

Loading

viirya Jan 22, 2018 •

edited

Loading