[SPARK-15192][SQL] null check for SparkSession.createDataFrame #13008

cloud-fan · 2016-05-09T15:11:00Z

What changes were proposed in this pull request?

This PR adds null check in SparkSession.createDataFrame, so that we can make sure the passed in rows matches the given schema.

How was this patch tested?

new tests in DatasetSuite

SparkQA · 2016-05-09T15:43:36Z

Test build #58147 has finished for PR 13008 at commit 8f0a0bf.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

…Encoder ## What changes were proposed in this pull request? SPARK-15241: We now support java decimal and catalyst decimal in external row, it makes sense to also support scala decimal. SPARK-15242: This is a long-standing bug, and is exposed after #12364, which eliminate the `If` expression if the field is not nullable: ``` val fieldValue = serializerFor( GetExternalRowField(inputObject, i, externalDataTypeForInput(f.dataType)), f.dataType) if (f.nullable) { If( Invoke(inputObject, "isNullAt", BooleanType, Literal(i) :: Nil), Literal.create(null, f.dataType), fieldValue) } else { fieldValue } ``` Previously, we always use `DecimalType.SYSTEM_DEFAULT` as the output type of converted decimal field, which is wrong as it doesn't match the real decimal type. However, it works well because we always put converted field into `If` expression to do the null check, and `If` use its `trueValue`'s data type as its output type. Now if we have a not nullable decimal field, then the converted field's output type will be `DecimalType.SYSTEM_DEFAULT`, and we will write wrong data into unsafe row. The fix is simple, just use the given decimal type as the output type of converted decimal field. These 2 issues was found at #13008 ## How was this patch tested? new tests in RowEncoderSuite Author: Wenchen Fan <wenchen@databricks.com> Closes #13019 from cloud-fan/encoder-decimal. (cherry picked from commit d8935db) Signed-off-by: Davies Liu <davies.liu@gmail.com>

yhuai · 2016-05-11T22:37:43Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala

-    case BooleanType | ByteType | ShortType | IntegerType | LongType |
-         FloatType | DoubleType | BinaryType => true
+    case NullType | BooleanType | ByteType | ShortType | IntegerType | LongType |
+         FloatType | DoubleType | BinaryType | CalendarIntervalType => true


Why CalendarIntervalType?

Because we don't have an external representation of it

SparkQA · 2016-05-12T03:22:16Z

Test build #58437 has finished for PR 13008 at commit 7419a52.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2016-05-12T04:31:00Z

sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala

      StructField("f", StructType(Seq(
        StructField("a", StringType, nullable = true),
-        StructField("b", IntegerType, nullable = false)
+        StructField("b", IntegerType, nullable = true)


Why change this?

With the new null check, we will trigger error earlier than this test expected. This test is testing the AssertNotNull expression, which is used for converting nullable column to not-nullable object field(like primitive int).

ok. so the new test (row nullability mismatch) is effectively covering this case? Then, should we change the name of this test? Will we hit the exception that is checked by this test in any other cases?

(just want to make sure we are not losing test coverage)

yea, row nullability mismatch checks the error that we pass in a null column while this column is declared as not nullable.

yhuai · 2016-05-12T05:04:24Z

test this please

yhuai · 2016-05-12T05:04:29Z

LGTM pending jenkins

SparkQA · 2016-05-12T05:37:45Z

Test build #58450 has finished for PR 13008 at commit 7419a52.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2016-05-12T05:42:39Z

legitimate issue?

cloud-fan · 2016-05-12T05:45:14Z

yea, the new commit should fixed it.

yhuai · 2016-05-12T05:47:26Z

what is the cause of those failed tests?

cloud-fan · 2016-05-12T06:12:47Z

Unlike CatalystConverter, RowEncoder is stricter about the input external type, e.g. users must use Seq for ArrayType, but CatalystConvert also allows Array.

yhuai · 2016-05-12T06:14:44Z

I see. Seems like a API change that at least we need to document.

Is there any performance implication?

also cc @mengxr

cloud-fan · 2016-05-12T06:20:34Z

looks like we haven't documented what kind of field object types is allowed in a Row, let me find a place to document it.

cloud-fan · 2016-05-12T06:24:37Z

Oh actually we did document it in the java doc of Row, and says users should use Seq for array type. see https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala#L139-L154

SparkQA · 2016-05-12T06:40:14Z

Test build #58451 has finished for PR 13008 at commit 0915a71.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-05-12T08:01:29Z

Test build #58461 has finished for PR 13008 at commit 114f362.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-05-12T09:29:51Z

Test build #58462 has finished for PR 13008 at commit 3acf24f.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-05-17T09:22:52Z

Test build #58680 has finished for PR 13008 at commit 225128d.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-05-17T09:45:19Z

Test build #58678 has finished for PR 13008 at commit 0fd8a90.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-05-17T13:53:54Z

retest this please

SparkQA · 2016-05-17T14:27:03Z

Test build #58695 has finished for PR 13008 at commit 225128d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-05-18T05:31:39Z

Test build #58741 has finished for PR 13008 at commit 57efddb.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-05-18T11:44:02Z

Test build #58771 has finished for PR 13008 at commit f533188.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2016-05-18T15:32:22Z

mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala

      val schema = StructType(fields)
      val rowDataRDD = model.freqItemsets.map { x =>
-        Row(x.items, x.freq)
+        Row(x.items.toSeq, x.freq)


Do we need to call toSeq at here?

We need. This is a special case, FPGrowthModel has a type parameter and we use FPGrowthModel[_] here. So x.items returns Object[] instead of T[] as we expected and doesn't match the schema.

yhuai · 2016-05-19T01:05:15Z

Thanks! Merging to master and branch 2.0.

## What changes were proposed in this pull request? This PR adds null check in `SparkSession.createDataFrame`, so that we can make sure the passed in rows matches the given schema. ## How was this patch tested? new tests in `DatasetSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #13008 from cloud-fan/row-encoder. (cherry picked from commit ebfe3a1) Signed-off-by: Yin Huai <yhuai@databricks.com>

null check for SparkSession.createDataFrame

8f0a0bf

cloud-fan changed the title ~~[SPARK-15192][] null check for SparkSession.createDataFrame~~ [SPARK-15192][SQL] null check for SparkSession.createDataFrame May 9, 2016

cloud-fan mentioned this pull request May 10, 2016

[SPARK-15241][SPARK-15242][SQL] fix 2 decimal-related issues in RowEncoder #13019

Closed

yhuai reviewed May 11, 2016
View reviewed changes

cloud-fan added 2 commits May 12, 2016 10:46

Merge remote-tracking branch 'origin/master' into row-encoder

a5815e5

minor cleanup

7419a52

yhuai reviewed May 12, 2016
View reviewed changes

fix mllib

0915a71

fix R

3acf24f

cloud-fan force-pushed the row-encoder branch from 114f362 to 3acf24f Compare May 12, 2016 08:04

cloud-fan added 3 commits May 16, 2016 22:52

Merge remote-tracking branch 'origin/master' into row-encoder

0c81f39

Merge remote-tracking branch 'origin/master' into row-encoder

4a84982

rebase

225128d

cloud-fan force-pushed the row-encoder branch from 0fd8a90 to 225128d Compare May 17, 2016 09:08

fix ml

57efddb

fix R

f533188

cloud-fan mentioned this pull request May 18, 2016

[SPARK-11827] [SQL] Adding java.math.BigInteger support in Java type inference for POJOs and Java collections #10125

Closed

yhuai reviewed May 18, 2016
View reviewed changes

asfgit closed this in ebfe3a1 May 19, 2016

srowen mentioned this pull request Feb 6, 2017

SPARK-16636 Add CalendarIntervalType to documentation #16747

Closed

[SPARK-15192][SQL] null check for SparkSession.createDataFrame #13008

[SPARK-15192][SQL] null check for SparkSession.createDataFrame #13008

Uh oh!

Conversation

cloud-fan commented May 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented May 9, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 12, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yhuai commented May 12, 2016

Uh oh!

yhuai commented May 12, 2016

Uh oh!

SparkQA commented May 12, 2016

Uh oh!

yhuai commented May 12, 2016

Uh oh!

cloud-fan commented May 12, 2016

Uh oh!

yhuai commented May 12, 2016

Uh oh!

cloud-fan commented May 12, 2016

Uh oh!

yhuai commented May 12, 2016

Uh oh!

cloud-fan commented May 12, 2016

Uh oh!

cloud-fan commented May 12, 2016

Uh oh!

SparkQA commented May 12, 2016

Uh oh!

SparkQA commented May 12, 2016

Uh oh!

SparkQA commented May 12, 2016

Uh oh!

SparkQA commented May 17, 2016

Uh oh!

SparkQA commented May 17, 2016

Uh oh!

cloud-fan commented May 17, 2016

Uh oh!

SparkQA commented May 17, 2016

Uh oh!

SparkQA commented May 18, 2016

Uh oh!

SparkQA commented May 18, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yhuai commented May 19, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

cloud-fan commented May 9, 2016 •

edited

Loading