-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-19781][ML] Handle NULLs as well as NaNs in Bucketizer when handleInvalid is on #17123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #3590 has finished for PR 17123 at commit
|
|
Fixed style errors during the unit tests. |
| val bucketizer: UserDefinedFunction = udf { (row: Row) => | ||
| Bucketizer.binarySearchForBuckets( | ||
| $(splits), | ||
| row.getAs[java.lang.Double]($(inputCol)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use Double instead of java.lang.Double? It should be the scala Double type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, Scala's Double will convert null to zero. Say:
scala> val a: Double = null.asInstanceOf[Double]
a: Double = 0.0
So I use Java's Double instead to hold NULLs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we should use row.getDouble(index) and row.isNullAt(index) together to get values for primitive types, but technically Row is just a Array[Object], so there is no performance penalty by using java.lang.Double.(this may change in the future, if possible we should prefer isNullAt and getDouble)
| throw new SparkException("Bucketizer encountered NaN/NULL values. " + | ||
| "To handle or skip NaNs/NULLs, try setting Bucketizer.handleInvalid.") | ||
| } | ||
| } else if (feature == splits.last) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you please add some tests to validate that NULL values can now be handled in addition to NaN values by the bucketizer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My fault! I'll do it now!
| private[feature] def binarySearchForBuckets( | ||
| splits: Array[Double], | ||
| feature: Double, | ||
| feature: java.lang.Double, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double here as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also change to Option[Double] here.
|
@crackcell thank you for the nice fix. I've added a few comments. Please add a test case(s) for the change. |
|
@imatiach-msft Hi, Ilya. I have added two tests based on the original tests for NaN data. Please review my code again. Thanks for your time. :-) |
|
@srowen @cloud-fan Please review my code. Thanks. :-) |
|
@imatiach-msft @cloud-fan I updated the code, replaced java.lang.Double with isNullAt() and getDouble(). |
| } | ||
| } | ||
|
|
||
| val bucketizer: UserDefinedFunction = udf { (feature: Double) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, can we just use java.lang.Double as the type for feature? Then we don't need to change https://github.com/apache/spark/pull/17123/files#diff-37f2c93b88c73b91cdc9e40fc8c45fc5R121
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use both Java and Scala types seems less graceful. Instead, is it better a way to pass a Row to bucketizer() and then check NULLs with isNullAt() and getDouble() ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see the document of ScalaUDF, if you don't like mixing java and scala types, you can use Option[Double]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot. Option[Double] is much better. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @cloud-fan suggested, Option[Double] is better. :-)
|
@cloud-fan Would you please review my code again? I'm now using |
| } | ||
|
|
||
| val bucketizer: UserDefinedFunction = udf { (feature: Double) => | ||
| val bucketizer: UserDefinedFunction = udf { (row: Row) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe you should try to avoid using a udf on a row because the serialization costs will be more expensive... hmm how could we make this perform well and handle nulls? Does it work with Option[Double] instead of Row?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing out the performace problem. Maybe my original code will work better to use java.lang.Double instead of scala's Double to hold NULLs.
| feature: Option[Double], | ||
| keepInvalid: Boolean): Double = { | ||
| if (feature.isNaN) { | ||
| if (feature.getOrElse(Double.NaN).isNaN) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can equivalently write this as:
if (feature.isEmpty) { ....
|
@crackcell I'm not sure about changing the UDF to be on a row instead of a column, I've found that the serialization costs are much higher and the spark code performs much less. Maybe an expert like @cloud-fan can comment more here? Can you keep the UDF on a column instead of a row? |
WeichenXu123
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed with @jkbradley offline. It would be nice to fix. Thanks!
| } | ||
| } | ||
|
|
||
| val bucketizer: UserDefinedFunction = udf { (feature: Double) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @cloud-fan suggested, Option[Double] is better. :-)
| private[feature] def binarySearchForBuckets( | ||
| splits: Array[Double], | ||
| feature: Double, | ||
| feature: java.lang.Double, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also change to Option[Double] here.
|
But, pls resolve conflicts first. :) Bucketizer add multiple column support so the code is different now. |
|
@WeichenXu123 sorry to miss the message for two days, I'm working on it. |
| * otherwise, values outside the splits specified will be treated as errors. | ||
| * | ||
| * See also [[handleInvalid]], which can optionally create an additional bucket for NaN values. | ||
| * See also [[handleInvalid]], which can optionally create an additional bucket for NaN/NULL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds like a behavior change, we should add an item in migration guide of ML docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya done.
|
@WeichenXu123 I have finished my work, plz review it. Any suggestion is welcome. :-) |
docs/ml-guide.md
Outdated
| We are now setting the default parallelism used in `OneVsRest` to be 1 (i.e. serial), in 2.2 and earlier version, | ||
| the `OneVsRest` parallelism would be parallelism of the default threadpool in scala. | ||
| * [SPARK-19781](https://issues.apache.org/jira/browse/SPARK-19781): | ||
| `Bucketizer` handles NULL values the same way as NaN when handleInvalid is skip or keep. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, I think for skip, dataset.na.drop drops NULL before. We didn't change its behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, you are right. :-p
|
Can one of the admins verify this patch? |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
The original Bucketizer can put NaNs into a special bucket when handleInvalid is on. but leave NULLs untouched.
This PR unify behaviours of processing of NULLs and NaNs.
BTW, this is my first commit to Spark code. I'm not sure whether my code or the way of doing things is appropriate. Plz point it out if I'm doing anything wrong. :-)
How was this patch tested?
new unit tests