-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-15192][SQL] null check for SparkSession.createDataFrame #13008
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #58147 has finished for PR 13008 at commit
|
…Encoder ## What changes were proposed in this pull request? SPARK-15241: We now support java decimal and catalyst decimal in external row, it makes sense to also support scala decimal. SPARK-15242: This is a long-standing bug, and is exposed after #12364, which eliminate the `If` expression if the field is not nullable: ``` val fieldValue = serializerFor( GetExternalRowField(inputObject, i, externalDataTypeForInput(f.dataType)), f.dataType) if (f.nullable) { If( Invoke(inputObject, "isNullAt", BooleanType, Literal(i) :: Nil), Literal.create(null, f.dataType), fieldValue) } else { fieldValue } ``` Previously, we always use `DecimalType.SYSTEM_DEFAULT` as the output type of converted decimal field, which is wrong as it doesn't match the real decimal type. However, it works well because we always put converted field into `If` expression to do the null check, and `If` use its `trueValue`'s data type as its output type. Now if we have a not nullable decimal field, then the converted field's output type will be `DecimalType.SYSTEM_DEFAULT`, and we will write wrong data into unsafe row. The fix is simple, just use the given decimal type as the output type of converted decimal field. These 2 issues was found at #13008 ## How was this patch tested? new tests in RowEncoderSuite Author: Wenchen Fan <wenchen@databricks.com> Closes #13019 from cloud-fan/encoder-decimal. (cherry picked from commit d8935db) Signed-off-by: Davies Liu <davies.liu@gmail.com>
| case BooleanType | ByteType | ShortType | IntegerType | LongType | | ||
| FloatType | DoubleType | BinaryType => true | ||
| case NullType | BooleanType | ByteType | ShortType | IntegerType | LongType | | ||
| FloatType | DoubleType | BinaryType | CalendarIntervalType => true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why CalendarIntervalType?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we don't have an external representation of it
|
Test build #58437 has finished for PR 13008 at commit
|
| StructField("f", StructType(Seq( | ||
| StructField("a", StringType, nullable = true), | ||
| StructField("b", IntegerType, nullable = false) | ||
| StructField("b", IntegerType, nullable = true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why change this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the new null check, we will trigger error earlier than this test expected. This test is testing the AssertNotNull expression, which is used for converting nullable column to not-nullable object field(like primitive int).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. so the new test (row nullability mismatch) is effectively covering this case? Then, should we change the name of this test? Will we hit the exception that is checked by this test in any other cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(just want to make sure we are not losing test coverage)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, row nullability mismatch checks the error that we pass in a null column while this column is declared as not nullable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
|
test this please |
|
LGTM pending jenkins |
|
Test build #58450 has finished for PR 13008 at commit
|
|
legitimate issue? |
|
yea, the new commit should fixed it. |
|
what is the cause of those failed tests? |
|
Unlike |
|
I see. Seems like a API change that at least we need to document. Is there any performance implication? also cc @mengxr |
|
looks like we haven't documented what kind of field object types is allowed in a |
|
Oh actually we did document it in the java doc of |
|
Test build #58451 has finished for PR 13008 at commit
|
|
Test build #58461 has finished for PR 13008 at commit
|
|
Test build #58462 has finished for PR 13008 at commit
|
|
Test build #58680 has finished for PR 13008 at commit
|
|
Test build #58678 has finished for PR 13008 at commit
|
|
retest this please |
|
Test build #58695 has finished for PR 13008 at commit
|
|
Test build #58741 has finished for PR 13008 at commit
|
|
Test build #58771 has finished for PR 13008 at commit
|
| val schema = StructType(fields) | ||
| val rowDataRDD = model.freqItemsets.map { x => | ||
| Row(x.items, x.freq) | ||
| Row(x.items.toSeq, x.freq) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to call toSeq at here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need. This is a special case, FPGrowthModel has a type parameter and we use FPGrowthModel[_] here. So x.items returns Object[] instead of T[] as we expected and doesn't match the schema.
|
Thanks! Merging to master and branch 2.0. |
## What changes were proposed in this pull request? This PR adds null check in `SparkSession.createDataFrame`, so that we can make sure the passed in rows matches the given schema. ## How was this patch tested? new tests in `DatasetSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #13008 from cloud-fan/row-encoder. (cherry picked from commit ebfe3a1) Signed-off-by: Yin Huai <yhuai@databricks.com>
What changes were proposed in this pull request?
This PR adds null check in
SparkSession.createDataFrame, so that we can make sure the passed in rows matches the given schema.How was this patch tested?
new tests in
DatasetSuite