-
Notifications
You must be signed in to change notification settings - Fork 29k
SPARK-14139 Dataset loses nullability in operations with RowEncoder #11980
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SPARK-14139 Dataset loses nullability in operations with RowEncoder #11980
Conversation
|
Test build #54264 has finished for PR 11980 at commit
|
|
Test build #54265 has finished for PR 11980 at commit
|
|
cc @cloud-fan |
| dataType: DataType, | ||
| arguments: Seq[Expression] = Nil) extends Expression with NonSQLExpression { | ||
| arguments: Seq[Expression] = Nil, | ||
| val nullable: Boolean = true) extends Expression with NonSQLExpression { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks tricky to me, invoking a method on an object should be always nullable, as we have no idea what that method does. To fix the nullability lose issue, how about we create a new expression called GetExternalRowField? We can setup corrected nullability there and generate simpler code for this case.
|
@cloud-fan i tried to do that, but i don't think i am familiar enough with the code gen, because it breaks other unit tests. it seems to me i am messing up with internal vs external types. |
…test RowEncoderSuite:encode/decode:Product
|
@cloud-fan i pushed at attempt at this, but i am having trouble with RowEncoderSuite encode/decode: Product |
|
Test build #54457 has finished for PR 11980 at commit
|
| f.dataType | ||
| ) | ||
| if (f.nullable) { | ||
| If( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can do the null check inside GetExternalRowField
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i had that before (basically without the If, since GetExternalRowField already does null checks inside), but the issue is that extractorsFor code path then also runs for nulls , and that causes some weird runtime errors. i will try again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we do the null check inside GetExternalRowField then the code for serializerFor also needs to be pushed into it (to be inside the null check), and i can not figure out how to do that...
|
Test build #54809 has finished for PR 11980 at commit
|
|
(This is trivial but might be better if the title follows |
## What changes were proposed in this pull request? The problem is: In `RowEncoder`, we use `Invoke` to get the field of an external row, which lose the nullability information. This PR creates a `GetExternalRowField` expression, so that we can preserve the nullability info. TODO: simplify the null handling logic in `RowEncoder`, to remove so many if branches, in follow-up PR. ## How was this patch tested? new tests in `RowEncoderSuite` Note that, This PR takes over #11980, with a little simplification, so all credits should go to koertkuipers Author: Wenchen Fan <wenchen@databricks.com> Author: Koert Kuipers <koert@tresata.com> Closes #12364 from cloud-fan/nullable. (cherry picked from commit 55cc1c9) Signed-off-by: Cheng Lian <lian@databricks.com>
## What changes were proposed in this pull request? The problem is: In `RowEncoder`, we use `Invoke` to get the field of an external row, which lose the nullability information. This PR creates a `GetExternalRowField` expression, so that we can preserve the nullability info. TODO: simplify the null handling logic in `RowEncoder`, to remove so many if branches, in follow-up PR. ## How was this patch tested? new tests in `RowEncoderSuite` Note that, This PR takes over apache#11980, with a little simplification, so all credits should go to koertkuipers Author: Wenchen Fan <wenchen@databricks.com> Author: Koert Kuipers <koert@tresata.com> Closes apache#12364 from cloud-fan/nullable.
|
@koertkuipers this can be closed now right? |
What changes were proposed in this pull request?
RowEncoder now respects nullability for struct fields when creating extractor expressions in the extractorsFor method.
Note that to get the correct value for nullable for the returned expression i chose to drop the If statement checking for nulls if the field has nullable=false. If this is undesired because we should defensively be checking for nulls anyhow with the If statement then that can be achieved as well, by modifying the If class, however to me that solution seems less clear/elegant.
How was this patch tested?
Added new unit test in DataFrameSuite for the bug described in the jira issue.