-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-14891][ML] Add schema validation for ALS #12762
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it seems like the only other numeric type we support is Long, maybe it would be better to say that? Someone might try and pass in BigInts or Doubles and expect this work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We "support" all numeric types in the sense that the input col can be any numeric type. But it is cast to Int. It is a "safe" cast though, if it is > Int.MaxValue or < Int.MinValue it throws an exception. "Safe" in the sense that it won't mangle the user's input ids (e.g. if Longs are passed in they will now get a failure on fit rather than a silent cast of those Long ids into Ints).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, I didn't notice the first cast from input type to Long - it seems like that would be OK[ish] most of the time (except with floats/doubles), but also with certain BigDecimal you could end up throwing away the high bits when going to a Long and a very out of range value would pass the range check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok. Could cast to double or float here... I was just concerned about any
storage / performance impact, but if everything is pipelines through the
cast -> udf then no problem
On Thu, 28 Apr 2016 at 21:27, Holden Karau notifications@github.com wrote:
In mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala
#12762 (comment):/** @group getParam */
def getUserCol: String = $(userCol)/**
- * Param for the column name for item ids.
- * Param for the column name for item ids. Ids must be integers. Other
- * numeric types are supported for this column, but will be cast to integers as long as they
Ah yes, I didn't notice the first cast from input type to Long - it seems
like that would be OK[ish] most of the time (except with floats/doubles),
but also with certain BigDecimal you could end up throwing away the high
bits when going to a Long and a very out of range value would pass the
range check.—
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
https://github.com/apache/spark/pull/12762/files/73ea0b62f1c0ae6a9897ec83f5c8dfedea86f3f9#r61487974
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think keeping cast to Integer is good for performance - but maybe just avoiding supporting input types that we might silently fail on/produce junk results for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @jkbradley
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally as per the related JIRA, I would actually like to support Int, Long and String for ids in ALS (with appropriate warnings about performance impact for Long/String ids). For the vast majority of use cases I believe the user-friendliness of supporting String in particular outweighs the performance impact. For those users who need performance at scale, they can stick to Int.
But for now, since only Int ids are supported in the DF API, some validation is better than nothing. I am actually slightly more in favor of only supporting Int or Long for the id columns in this PR, since the real-world occurrence of a Double or other more esoteric numeric type for the id column is, IMO, highly unlikely, and in that case requiring users to do the cast explicitly themselves is ok I would say.
So we can support only Int and Longs (within Integer range) as a simpler alternative here - it would just require to update the type checks in transformSchema and the tests.
@jkbradley @srowen @holdenk thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping @mengxr also in case you have a chance to take a look, and consider this question of whether to only support Int/Long for ids or support all numeric types (with "safe" cast to Int in both cases)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the slow answer. I like supporting all Numeric types, with checking. I agree we should support String IDs at some point, with automatic indexing; that can be part of this discussion: [https://issues.apache.org/jira/browse/SPARK-11106]
|
Test build #57267 has finished for PR 12762 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure this is used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yes that's a left-over, will remove.
|
LGTM except for a few minors. |
|
Test build #57319 has finished for PR 12762 at commit
|
|
This looks good to me. Thanks! |
fb443ef to
2a7bb9f
Compare
|
@yanboliang @holdenk @BenFradet @jkbradley I went ahead and just cast user/item col to double before checked cast to support all numeric types for user/item ids. |
|
Test build #57759 has finished for PR 12762 at commit
|
70a29f5 to
f0ecde7
Compare
|
Test build #58395 has finished for PR 12762 at commit
|
|
LGTM, except for maybe the generics in |
|
Test build #58455 has finished for PR 12762 at commit
|
150321f to
1fd1f87
Compare
|
Test build #58633 has finished for PR 12762 at commit
|
|
jenkins retest this please |
|
Test build #58776 has finished for PR 12762 at commit
|
This PR adds schema validation to `ml`'s ALS and ALSModel. Currently, no schema validation was performed as `transformSchema` was never called in `ALS.fit` or `ALSModel.transform`. Furthermore, due to no schema validation, if users passed in Long (or Float etc) ids, they would be silently cast to Int with no warning or error thrown. With this PR, ALS now supports all numeric types for `user`, `item`, and `rating` columns. The rating column is cast to `Float` and the user and item cols are cast to `Int` (as is the case currently) - however for user/item, the cast throws an error if the value is outside integer range. Behavior for rating col is unchanged (as it is not an issue). ## How was this patch tested? New test cases in `ALSSuite`. Author: Nick Pentreath <nickp@za.ibm.com> Closes #12762 from MLnick/SPARK-14891-als-validate-schema. (cherry picked from commit e8b79af) Signed-off-by: Nick Pentreath <nickp@za.ibm.com>
|
Merged to master/branch-2.0 |
| .join(userFactors, dataset($(userCol)) === userFactors("id"), "left") | ||
| .join(itemFactors, dataset($(itemCol)) === itemFactors("id"), "left") | ||
| .join(userFactors, | ||
| checkedCast(dataset($(userCol)).cast(DoubleType)) === userFactors("id"), "left") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this and the next line's cast both use IntegerType?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jkbradley the existing code did the cast to Int - that means passing in Long or Double (say) would silently cast and potentially lose precison and give weird results, with no exception or warning. That's why here we cast to DoubleType and use the checkedCast udf to do a safe cast to Int if the value is within Int value range. If not we throw an exception with a helpful message.
This is so we can allow any numeric type for the user/item columns (providing some form of "backward compatability" with the old version that didn't check types at all), but we can still only support actual values that are Ints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, makes sense, thanks!
This PR adds schema validation to
ml's ALS and ALSModel. Currently, no schema validation was performed astransformSchemawas never called inALS.fitorALSModel.transform. Furthermore, due to no schema validation, if users passed in Long (or Float etc) ids, they would be silently cast to Int with no warning or error thrown.With this PR, ALS now supports all numeric types for
user,item, andratingcolumns. The rating column is cast toFloatand the user and item cols are cast toInt(as is the case currently) - however for user/item, the cast throws an error if the value is outside integer range. Behavior for rating col is unchanged (as it is not an issue).How was this patch tested?
New test cases in
ALSSuite.