-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-20307][ML][SPARKR][FOLLOW-UP] RFormula should handle invalid for both features and label column. #18613
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #79561 has finished for PR 18613 at commit
|
| model <- spark.randomForest(traindf, clicked ~ ., type = "classification", | ||
| maxDepth = 10, maxBins = 10, numTrees = 10, | ||
| handleInvalid = "skip") | ||
| handleInvalid = "keep") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of R always forceIndexLabel which will index label whether it is numeric or string type, this leads to 0.0 and 0 in R label are different. If we choose skip, it will make all labels unseen. I think this is a bug, maybe we should fix it in a separate PR.
|
in #18496 we discuss the behavior of the output prediction (#18496 (comment)), similar in #18613 (comment), I'd suggest we step back and review how handleInvalid should work in Scala first. I think we can still make progress in this PR and #18605, but likely need some changes in Scala. |
|
Test build #79566 has finished for PR 18613 at commit
|
|
@felixcheung I agree. We should make changes in Scala side. |
|
@felixcheung @wangmiao1981 In Scala, we set |
| assert(result1.collect() === expected1.collect()) | ||
| assert(result2.collect() === expected2.collect()) | ||
|
|
||
| // Handle unseen labels. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following test cases is failed before this PR.
|
isn't it confusing to silently drop features? |
|
@felixcheung We don't silently drop features, we use |
|
@yanboliang that's what I mean. to elaborate, I get that part on #18496 - I asked actually https://github.com/apache/spark/pull/18496/files#r125154606 - thought it was confusing. ok, I agree with your assessment on starting with the same policy. |
|
Merged into master. Thanks for all reviewing. |
What changes were proposed in this pull request?
RFormulashould handle invalid for both features and label column.#18496 only handle invalid values in features column. This PR add handling invalid values for label column and test cases.
How was this patch tested?
Add test cases.