Skip to content

Conversation

@crackcell
Copy link

@crackcell crackcell commented Mar 1, 2017

What changes were proposed in this pull request?

The original Bucketizer can put NaNs into a special bucket when handleInvalid is on. but leave NULLs untouched.
This PR unify behaviours of processing of NULLs and NaNs.

BTW, this is my first commit to Spark code. I'm not sure whether my code or the way of doing things is appropriate. Plz point it out if I'm doing anything wrong. :-)

How was this patch tested?

new unit tests

@SparkQA
Copy link

SparkQA commented Mar 1, 2017

Test build #3590 has finished for PR 17123 at commit b3f98b6.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@crackcell
Copy link
Author

Fixed style errors during the unit tests.

val bucketizer: UserDefinedFunction = udf { (row: Row) =>
Bucketizer.binarySearchForBuckets(
$(splits),
row.getAs[java.lang.Double]($(inputCol)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use Double instead of java.lang.Double? It should be the scala Double type.

Copy link
Author

@crackcell crackcell Mar 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, Scala's Double will convert null to zero. Say:

scala> val a: Double = null.asInstanceOf[Double]
a: Double = 0.0

So I use Java's Double instead to hold NULLs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we should use row.getDouble(index) and row.isNullAt(index) together to get values for primitive types, but technically Row is just a Array[Object], so there is no performance penalty by using java.lang.Double.(this may change in the future, if possible we should prefer isNullAt and getDouble)

throw new SparkException("Bucketizer encountered NaN/NULL values. " +
"To handle or skip NaNs/NULLs, try setting Bucketizer.handleInvalid.")
}
} else if (feature == splits.last) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please add some tests to validate that NULL values can now be handled in addition to NaN values by the bucketizer?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My fault! I'll do it now!

private[feature] def binarySearchForBuckets(
splits: Array[Double],
feature: Double,
feature: java.lang.Double,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double here as well

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also change to Option[Double] here.

@imatiach-msft
Copy link
Contributor

@crackcell thank you for the nice fix. I've added a few comments. Please add a test case(s) for the change.

@crackcell
Copy link
Author

@imatiach-msft Hi, Ilya. I have added two tests based on the original tests for NaN data. Please review my code again. Thanks for your time. :-)

@crackcell
Copy link
Author

@srowen @cloud-fan Please review my code. Thanks. :-)

@crackcell
Copy link
Author

@imatiach-msft @cloud-fan I updated the code, replaced java.lang.Double with isNullAt() and getDouble().

}
}

val bucketizer: UserDefinedFunction = udf { (feature: Double) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, can we just use java.lang.Double as the type for feature? Then we don't need to change https://github.com/apache/spark/pull/17123/files#diff-37f2c93b88c73b91cdc9e40fc8c45fc5R121

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use both Java and Scala types seems less graceful. Instead, is it better a way to pass a Row to bucketizer() and then check NULLs with isNullAt() and getDouble() ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see the document of ScalaUDF, if you don't like mixing java and scala types, you can use Option[Double]

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot. Option[Double] is much better. :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @cloud-fan suggested, Option[Double] is better. :-)

@crackcell
Copy link
Author

@cloud-fan Would you please review my code again? I'm now using Option to handle NULLs. :-)

}

val bucketizer: UserDefinedFunction = udf { (feature: Double) =>
val bucketizer: UserDefinedFunction = udf { (row: Row) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you should try to avoid using a udf on a row because the serialization costs will be more expensive... hmm how could we make this perform well and handle nulls? Does it work with Option[Double] instead of Row?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing out the performace problem. Maybe my original code will work better to use java.lang.Double instead of scala's Double to hold NULLs.

feature: Option[Double],
keepInvalid: Boolean): Double = {
if (feature.isNaN) {
if (feature.getOrElse(Double.NaN).isNaN) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can equivalently write this as:
if (feature.isEmpty) { ....

@imatiach-msft
Copy link
Contributor

@crackcell I'm not sure about changing the UDF to be on a row instead of a column, I've found that the serialization costs are much higher and the spark code performs much less. Maybe an expert like @cloud-fan can comment more here? Can you keep the UDF on a column instead of a row?

@cloud-fan
Copy link
Contributor

cc @WeichenXu123

Copy link
Contributor

@WeichenXu123 WeichenXu123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed with @jkbradley offline. It would be nice to fix. Thanks!

}
}

val bucketizer: UserDefinedFunction = udf { (feature: Double) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @cloud-fan suggested, Option[Double] is better. :-)

private[feature] def binarySearchForBuckets(
splits: Array[Double],
feature: Double,
feature: java.lang.Double,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also change to Option[Double] here.

@WeichenXu123
Copy link
Contributor

But, pls resolve conflicts first. :) Bucketizer add multiple column support so the code is different now.

@crackcell
Copy link
Author

@WeichenXu123 sorry to miss the message for two days, I'm working on it.

* otherwise, values outside the splits specified will be treated as errors.
*
* See also [[handleInvalid]], which can optionally create an additional bucket for NaN values.
* See also [[handleInvalid]], which can optionally create an additional bucket for NaN/NULL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like a behavior change, we should add an item in migration guide of ML docs.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya done.

@crackcell
Copy link
Author

@WeichenXu123 I have finished my work, plz review it. Any suggestion is welcome. :-)

docs/ml-guide.md Outdated
We are now setting the default parallelism used in `OneVsRest` to be 1 (i.e. serial), in 2.2 and earlier version,
the `OneVsRest` parallelism would be parallelism of the default threadpool in scala.
* [SPARK-19781](https://issues.apache.org/jira/browse/SPARK-19781):
`Bucketizer` handles NULL values the same way as NaN when handleInvalid is skip or keep.
Copy link
Member

@viirya viirya Jan 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I think for skip, dataset.na.drop drops NULL before. We didn't change its behavior.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, you are right. :-p

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Jan 16, 2020
@github-actions github-actions bot closed this Jan 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants