Skip to content

Conversation

@cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented May 9, 2016

What changes were proposed in this pull request?

This PR adds null check in SparkSession.createDataFrame, so that we can make sure the passed in rows matches the given schema.

How was this patch tested?

new tests in DatasetSuite

@cloud-fan cloud-fan changed the title [SPARK-15192][] null check for SparkSession.createDataFrame [SPARK-15192][SQL] null check for SparkSession.createDataFrame May 9, 2016
@SparkQA
Copy link

SparkQA commented May 9, 2016

Test build #58147 has finished for PR 13008 at commit 8f0a0bf.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request May 11, 2016
…Encoder

## What changes were proposed in this pull request?

SPARK-15241: We now support java decimal and catalyst decimal in external row, it makes sense to also support scala decimal.

SPARK-15242: This is a long-standing bug, and is exposed after #12364, which eliminate the `If` expression if the field is not nullable:
```
val fieldValue = serializerFor(
  GetExternalRowField(inputObject, i, externalDataTypeForInput(f.dataType)),
  f.dataType)
if (f.nullable) {
  If(
    Invoke(inputObject, "isNullAt", BooleanType, Literal(i) :: Nil),
    Literal.create(null, f.dataType),
    fieldValue)
} else {
  fieldValue
}
```

Previously, we always use `DecimalType.SYSTEM_DEFAULT` as the output type of converted decimal field, which is wrong as it doesn't match the real decimal type. However, it works well because we always put converted field into `If` expression to do the null check, and `If` use its `trueValue`'s data type as its output type.
Now if we have a not nullable decimal field, then the converted field's output type will be `DecimalType.SYSTEM_DEFAULT`, and we will write wrong data into unsafe row.

The fix is simple, just use the given decimal type as the output type of converted decimal field.

These 2 issues was found at #13008

## How was this patch tested?

new tests in RowEncoderSuite

Author: Wenchen Fan <wenchen@databricks.com>

Closes #13019 from cloud-fan/encoder-decimal.

(cherry picked from commit d8935db)
Signed-off-by: Davies Liu <davies.liu@gmail.com>
case BooleanType | ByteType | ShortType | IntegerType | LongType |
FloatType | DoubleType | BinaryType => true
case NullType | BooleanType | ByteType | ShortType | IntegerType | LongType |
FloatType | DoubleType | BinaryType | CalendarIntervalType => true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why CalendarIntervalType?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we don't have an external representation of it

@SparkQA
Copy link

SparkQA commented May 12, 2016

Test build #58437 has finished for PR 13008 at commit 7419a52.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

StructField("f", StructType(Seq(
StructField("a", StringType, nullable = true),
StructField("b", IntegerType, nullable = false)
StructField("b", IntegerType, nullable = true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the new null check, we will trigger error earlier than this test expected. This test is testing the AssertNotNull expression, which is used for converting nullable column to not-nullable object field(like primitive int).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. so the new test (row nullability mismatch) is effectively covering this case? Then, should we change the name of this test? Will we hit the exception that is checked by this test in any other cases?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(just want to make sure we are not losing test coverage)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, row nullability mismatch checks the error that we pass in a null column while this column is declared as not nullable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

@yhuai
Copy link
Contributor

yhuai commented May 12, 2016

test this please

@yhuai
Copy link
Contributor

yhuai commented May 12, 2016

LGTM pending jenkins

@SparkQA
Copy link

SparkQA commented May 12, 2016

Test build #58450 has finished for PR 13008 at commit 7419a52.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yhuai
Copy link
Contributor

yhuai commented May 12, 2016

legitimate issue?

@cloud-fan
Copy link
Contributor Author

yea, the new commit should fixed it.

@yhuai
Copy link
Contributor

yhuai commented May 12, 2016

what is the cause of those failed tests?

@cloud-fan
Copy link
Contributor Author

Unlike CatalystConverter, RowEncoder is stricter about the input external type, e.g. users must use Seq for ArrayType, but CatalystConvert also allows Array.

@yhuai
Copy link
Contributor

yhuai commented May 12, 2016

I see. Seems like a API change that at least we need to document.

Is there any performance implication?

also cc @mengxr

@cloud-fan
Copy link
Contributor Author

looks like we haven't documented what kind of field object types is allowed in a Row, let me find a place to document it.

@cloud-fan
Copy link
Contributor Author

Oh actually we did document it in the java doc of Row, and says users should use Seq for array type. see https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala#L139-L154

@SparkQA
Copy link

SparkQA commented May 12, 2016

Test build #58451 has finished for PR 13008 at commit 0915a71.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 12, 2016

Test build #58461 has finished for PR 13008 at commit 114f362.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 12, 2016

Test build #58462 has finished for PR 13008 at commit 3acf24f.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 17, 2016

Test build #58680 has finished for PR 13008 at commit 225128d.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 17, 2016

Test build #58678 has finished for PR 13008 at commit 0fd8a90.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented May 17, 2016

Test build #58695 has finished for PR 13008 at commit 225128d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 18, 2016

Test build #58741 has finished for PR 13008 at commit 57efddb.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 18, 2016

Test build #58771 has finished for PR 13008 at commit f533188.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val schema = StructType(fields)
val rowDataRDD = model.freqItemsets.map { x =>
Row(x.items, x.freq)
Row(x.items.toSeq, x.freq)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to call toSeq at here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need. This is a special case, FPGrowthModel has a type parameter and we use FPGrowthModel[_] here. So x.items returns Object[] instead of T[] as we expected and doesn't match the schema.

@yhuai
Copy link
Contributor

yhuai commented May 19, 2016

Thanks! Merging to master and branch 2.0.

asfgit pushed a commit that referenced this pull request May 19, 2016
## What changes were proposed in this pull request?

This PR adds null check in `SparkSession.createDataFrame`, so that we can make sure the passed in rows matches the given schema.

## How was this patch tested?

new tests in `DatasetSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #13008 from cloud-fan/row-encoder.

(cherry picked from commit ebfe3a1)
Signed-off-by: Yin Huai <yhuai@databricks.com>
@asfgit asfgit closed this in ebfe3a1 May 19, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants