[SPARK-18407] Inferred partition columns cause assertion error in StructuredStreaming #15942

brkyvz · 2016-11-20T03:34:52Z

What changes were proposed in this pull request?

It turns out we are a bit enthusiastic when providing users partition columns when they read the data even if they didn't specify it in their schema. This causes an assertion error in Streaming jobs, because the Attributes of a given trigger don't match the Attributes returned by the DataSource. The DataSource returns additional partition columns all the time.

While this is weird behavior for batch as well IMHO, because someone asked for a specific schema, but we returned them something else, apparently this behavior existed since Spark 1.6. I didn't try older versions. Anyway, I tried fixing this by not enforcing a strict size check, but by picking out the columns that we want from the batch DataSource.

How was this patch tested?

Regression test

try fix fix

SparkQA · 2016-11-20T05:13:47Z

Test build #68899 has finished for PR 15942 at commit b4efee9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

brkyvz · 2016-11-21T23:40:09Z

Closing this in favor of #15951

…Types ## What changes were proposed in this pull request? ### The Issue If I specify my schema when doing ```scala spark.read .schema(someSchemaWherePartitionColumnsAreStrings) ``` but if the partition inference can infer it as IntegerType or I assume LongType or DoubleType (basically fixed size types), then once UnsafeRows are generated, your data will be corrupted. ### Proposed solution The partition handling code path is kind of a mess. In my fix I'm probably adding to the mess, but at least trying to standardize the code path. The real issue is that a user that uses the `spark.read` code path can never clearly specify what the partition columns are. If you try to specify the fields in `schema`, we practically ignore what the user provides, and fall back to our inferred data types. What happens in the end is data corruption. My solution tries to fix this by always trying to infer partition columns the first time you specify the table. Once we find what the partition columns are, we try to find them in the user specified schema and use the dataType provided there, or fall back to the smallest common data type. We will ALWAYS append partition columns to the user's schema, even if they didn't ask for it. We will only use the data type they provided if they specified it. While this is confusing, this has been the behavior since Spark 1.6, and I didn't want to change this behavior in the QA period of Spark 2.1. We may revisit this decision later. A side effect of this PR is that we won't need #15942 if this PR goes in. ## How was this patch tested? Regression tests Author: Burak Yavuz <brkyvz@gmail.com> Closes #15951 from brkyvz/partition-corruption. (cherry picked from commit 0d1bf2b) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

…Types ## What changes were proposed in this pull request? ### The Issue If I specify my schema when doing ```scala spark.read .schema(someSchemaWherePartitionColumnsAreStrings) ``` but if the partition inference can infer it as IntegerType or I assume LongType or DoubleType (basically fixed size types), then once UnsafeRows are generated, your data will be corrupted. ### Proposed solution The partition handling code path is kind of a mess. In my fix I'm probably adding to the mess, but at least trying to standardize the code path. The real issue is that a user that uses the `spark.read` code path can never clearly specify what the partition columns are. If you try to specify the fields in `schema`, we practically ignore what the user provides, and fall back to our inferred data types. What happens in the end is data corruption. My solution tries to fix this by always trying to infer partition columns the first time you specify the table. Once we find what the partition columns are, we try to find them in the user specified schema and use the dataType provided there, or fall back to the smallest common data type. We will ALWAYS append partition columns to the user's schema, even if they didn't ask for it. We will only use the data type they provided if they specified it. While this is confusing, this has been the behavior since Spark 1.6, and I didn't want to change this behavior in the QA period of Spark 2.1. We may revisit this decision later. A side effect of this PR is that we won't need #15942 if this PR goes in. ## How was this patch tested? Regression tests Author: Burak Yavuz <brkyvz@gmail.com> Closes #15951 from brkyvz/partition-corruption.

…Types ## What changes were proposed in this pull request? ### The Issue If I specify my schema when doing ```scala spark.read .schema(someSchemaWherePartitionColumnsAreStrings) ``` but if the partition inference can infer it as IntegerType or I assume LongType or DoubleType (basically fixed size types), then once UnsafeRows are generated, your data will be corrupted. ### Proposed solution The partition handling code path is kind of a mess. In my fix I'm probably adding to the mess, but at least trying to standardize the code path. The real issue is that a user that uses the `spark.read` code path can never clearly specify what the partition columns are. If you try to specify the fields in `schema`, we practically ignore what the user provides, and fall back to our inferred data types. What happens in the end is data corruption. My solution tries to fix this by always trying to infer partition columns the first time you specify the table. Once we find what the partition columns are, we try to find them in the user specified schema and use the dataType provided there, or fall back to the smallest common data type. We will ALWAYS append partition columns to the user's schema, even if they didn't ask for it. We will only use the data type they provided if they specified it. While this is confusing, this has been the behavior since Spark 1.6, and I didn't want to change this behavior in the QA period of Spark 2.1. We may revisit this decision later. A side effect of this PR is that we won't need apache#15942 if this PR goes in. ## How was this patch tested? Regression tests Author: Burak Yavuz <brkyvz@gmail.com> Closes apache#15951 from brkyvz/partition-corruption.

brkyvz added 7 commits November 18, 2016 17:15

save

ca0bf68

try fix fix

fixed

6578cc3

better debug message

ed2c3f9

ready for review

8465aca

make test a bit more complex

c2c2cd5

make test a bit more complex

879c6e1

remove debug

b4efee9

brkyvz mentioned this pull request Nov 20, 2016

[SPARK-18510] Fix data corruption from inferred partition column dataTypes #15951

Closed

brkyvz closed this Nov 21, 2016

brkyvz deleted the filesource-part-bug branch February 3, 2019 20:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-18407] Inferred partition columns cause assertion error in StructuredStreaming #15942

[SPARK-18407] Inferred partition columns cause assertion error in StructuredStreaming #15942

Uh oh!

brkyvz commented Nov 20, 2016

Uh oh!

SparkQA commented Nov 20, 2016

Uh oh!

brkyvz commented Nov 21, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-18407] Inferred partition columns cause assertion error in StructuredStreaming #15942

[SPARK-18407] Inferred partition columns cause assertion error in StructuredStreaming #15942

Uh oh!

Conversation

brkyvz commented Nov 20, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Nov 20, 2016

Uh oh!

brkyvz commented Nov 21, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants