Skip to content

Conversation

@MLnick
Copy link
Contributor

@MLnick MLnick commented Apr 28, 2016

This PR adds schema validation to ml's ALS and ALSModel. Currently, no schema validation was performed as transformSchema was never called in ALS.fit or ALSModel.transform. Furthermore, due to no schema validation, if users passed in Long (or Float etc) ids, they would be silently cast to Int with no warning or error thrown.

With this PR, ALS now supports all numeric types for user, item, and rating columns. The rating column is cast to Float and the user and item cols are cast to Int (as is the case currently) - however for user/item, the cast throws an error if the value is outside integer range. Behavior for rating col is unchanged (as it is not an issue).

How was this patch tested?

New test cases in ALSSuite.

@MLnick
Copy link
Contributor Author

MLnick commented Apr 28, 2016

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it seems like the only other numeric type we support is Long, maybe it would be better to say that? Someone might try and pass in BigInts or Doubles and expect this work.

Copy link
Contributor Author

@MLnick MLnick Apr 28, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We "support" all numeric types in the sense that the input col can be any numeric type. But it is cast to Int. It is a "safe" cast though, if it is > Int.MaxValue or < Int.MinValue it throws an exception. "Safe" in the sense that it won't mangle the user's input ids (e.g. if Longs are passed in they will now get a failure on fit rather than a silent cast of those Long ids into Ints).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, I didn't notice the first cast from input type to Long - it seems like that would be OK[ish] most of the time (except with floats/doubles), but also with certain BigDecimal you could end up throwing away the high bits when going to a Long and a very out of range value would pass the range check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok. Could cast to double or float here... I was just concerned about any
storage / performance impact, but if everything is pipelines through the
cast -> udf then no problem
On Thu, 28 Apr 2016 at 21:27, Holden Karau notifications@github.com wrote:

In mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala
#12762 (comment):

/** @group getParam */
def getUserCol: String = $(userCol)

/**

  • * Param for the column name for item ids.
  • * Param for the column name for item ids. Ids must be integers. Other
  • * numeric types are supported for this column, but will be cast to integers as long as they

Ah yes, I didn't notice the first cast from input type to Long - it seems
like that would be OK[ish] most of the time (except with floats/doubles),
but also with certain BigDecimal you could end up throwing away the high
bits when going to a Long and a very out of range value would pass the
range check.


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
https://github.com/apache/spark/pull/12762/files/73ea0b62f1c0ae6a9897ec83f5c8dfedea86f3f9#r61487974

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think keeping cast to Integer is good for performance - but maybe just avoiding supporting input types that we might silently fail on/produce junk results for.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@MLnick MLnick Apr 29, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally as per the related JIRA, I would actually like to support Int, Long and String for ids in ALS (with appropriate warnings about performance impact for Long/String ids). For the vast majority of use cases I believe the user-friendliness of supporting String in particular outweighs the performance impact. For those users who need performance at scale, they can stick to Int.

But for now, since only Int ids are supported in the DF API, some validation is better than nothing. I am actually slightly more in favor of only supporting Int or Long for the id columns in this PR, since the real-world occurrence of a Double or other more esoteric numeric type for the id column is, IMO, highly unlikely, and in that case requiring users to do the cast explicitly themselves is ok I would say.

So we can support only Int and Longs (within Integer range) as a simpler alternative here - it would just require to update the type checks in transformSchema and the tests.

@jkbradley @srowen @holdenk thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping @mengxr also in case you have a chance to take a look, and consider this question of whether to only support Int/Long for ids or support all numeric types (with "safe" cast to Int in both cases)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the slow answer. I like supporting all Numeric types, with checking. I agree we should support String IDs at some point, with automatic indexing; that can be part of this discussion: [https://issues.apache.org/jira/browse/SPARK-11106]

@SparkQA
Copy link

SparkQA commented Apr 28, 2016

Test build #57267 has finished for PR 12762 at commit 73ea0b6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure this is used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yes that's a left-over, will remove.

@BenFradet
Copy link
Contributor

LGTM except for a few minors.

@SparkQA
Copy link

SparkQA commented Apr 29, 2016

Test build #57319 has finished for PR 12762 at commit fb443ef.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor

This looks good to me. Thanks!

@MLnick MLnick force-pushed the SPARK-14891-als-validate-schema branch from fb443ef to 2a7bb9f Compare May 4, 2016 13:02
@MLnick
Copy link
Contributor Author

MLnick commented May 4, 2016

@yanboliang @holdenk @BenFradet @jkbradley I went ahead and just cast user/item col to double before checked cast to support all numeric types for user/item ids.

@SparkQA
Copy link

SparkQA commented May 4, 2016

Test build #57759 has finished for PR 12762 at commit 70a29f5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MLnick MLnick force-pushed the SPARK-14891-als-validate-schema branch from 70a29f5 to f0ecde7 Compare May 11, 2016 19:10
@SparkQA
Copy link

SparkQA commented May 11, 2016

Test build #58395 has finished for PR 12762 at commit f0ecde7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BenFradet
Copy link
Contributor

LGTM, except for maybe the generics in checkNumericTypesALS but that's really minor.

@SparkQA
Copy link

SparkQA commented May 12, 2016

Test build #58455 has finished for PR 12762 at commit 150321f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MLnick MLnick force-pushed the SPARK-14891-als-validate-schema branch from 150321f to 1fd1f87 Compare May 16, 2016 09:10
@SparkQA
Copy link

SparkQA commented May 16, 2016

Test build #58633 has finished for PR 12762 at commit 1fd1f87.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MLnick
Copy link
Contributor Author

MLnick commented May 18, 2016

jenkins retest this please

@SparkQA
Copy link

SparkQA commented May 18, 2016

Test build #58776 has finished for PR 12762 at commit 1fd1f87.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request May 18, 2016
This PR adds schema validation to `ml`'s ALS and ALSModel. Currently, no schema validation was performed as `transformSchema` was never called in `ALS.fit` or `ALSModel.transform`. Furthermore, due to no schema validation, if users passed in Long (or Float etc) ids, they would be silently cast to Int with no warning or error thrown.

With this PR, ALS now supports all numeric types for `user`, `item`, and `rating` columns. The rating column is cast to `Float` and the user and item cols are cast to `Int` (as is the case currently) - however for user/item, the cast throws an error if the value is outside integer range. Behavior for rating col is unchanged (as it is not an issue).

## How was this patch tested?
New test cases in `ALSSuite`.

Author: Nick Pentreath <nickp@za.ibm.com>

Closes #12762 from MLnick/SPARK-14891-als-validate-schema.

(cherry picked from commit e8b79af)
Signed-off-by: Nick Pentreath <nickp@za.ibm.com>
@MLnick
Copy link
Contributor Author

MLnick commented May 18, 2016

Merged to master/branch-2.0

@asfgit asfgit closed this in e8b79af May 18, 2016
.join(userFactors, dataset($(userCol)) === userFactors("id"), "left")
.join(itemFactors, dataset($(itemCol)) === itemFactors("id"), "left")
.join(userFactors,
checkedCast(dataset($(userCol)).cast(DoubleType)) === userFactors("id"), "left")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this and the next line's cast both use IntegerType?

Copy link
Contributor Author

@MLnick MLnick May 27, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkbradley the existing code did the cast to Int - that means passing in Long or Double (say) would silently cast and potentially lose precison and give weird results, with no exception or warning. That's why here we cast to DoubleType and use the checkedCast udf to do a safe cast to Int if the value is within Int value range. If not we throw an exception with a helpful message.

This is so we can allow any numeric type for the user/item columns (providing some form of "backward compatability" with the old version that didn't check types at all), but we can still only support actual values that are Ints.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, makes sense, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants