Skip to content

Conversation

@liancheng
Copy link
Contributor

This PR fixes strict mode schema merging. To merge two PrimitiveType t1 and t2, they must satisfy the following conditions:

  1. t1 and t2 have the same primitive type name
  2. t1 and t2 either
    • don't have original type, or
    • have the same original type
  3. If t1 and t2 are both FIXED_LEN_BYTE_ARRAY, they should have the same length

Also, merged schema now preserves original name if there's any.

@liancheng
Copy link
Contributor Author

Not quite sure whether we should also deprecate the strict argument of MessageType.union. I tend to do so since Parquet itself never uses non-strict schema merging internally. Also, non-strict schema merging doesn't make sense and may lead to wrong query result as discussed in PARQUET-379.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since originalType is an Enum, seems we can just do like:

if (getOriginalType() != toMerge.getOriginalType())
  reportSchemaMergeError(toMerge);

which covers:

  1. if both are null, the condition evals to false
  2. if one is null and the other is not null, the condition evals to true
  3. if both are not null, then the condition evals to false if they take different values, and evals to true otherwise.

@liancheng what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Being in the Scala world for so long a time, I really need to re-learn Java again :)

@lw-lin
Copy link
Contributor

lw-lin commented Jan 29, 2016

Having examined all existing 20 OriginalTypes, and 2 OriginalTypes to be added, I think this PR works well.

So:
+1, LGTM.

@julienledem
Copy link
Member

+1

@julienledem
Copy link
Member

@liancheng can you remove the square brackets in your pr title? That trips up the merge script.

@liancheng liancheng changed the title [PARQUET-385][PARQUET-379] Fixes strict schema merging PARQUET-385 PARQUET-379: Fixes strict schema merging Feb 3, 2016
@liancheng
Copy link
Contributor Author

@julienledem Done.

@asfgit asfgit closed this in c26fa78 Feb 6, 2016
piyushnarang pushed a commit to piyushnarang/parquet-mr that referenced this pull request Jun 15, 2016
This PR fixes strict mode schema merging. To merge two `PrimitiveType` `t1` and `t2`, they must satisfy the following conditions:

1. `t1` and `t2` have the same primitive type name
1. `t1` and `t2` either
   - don't have original type, or
   - have the same original type
1. If `t1` and `t2` are both `FIXED_LEN_BYTE_ARRAY`, they should have the same length

Also, merged schema now preserves original name if there's any.

Author: Cheng Lian <lian@databricks.com>

Closes apache#315 from liancheng/fix-strict-schema-merge and squashes the following commits:

a29138c [Cheng Lian] Addresses PR comment
1ac804e [Cheng Lian] Fixes strict schema merging
rdblue pushed a commit to rdblue/parquet-mr that referenced this pull request Jan 6, 2017
This PR fixes strict mode schema merging. To merge two `PrimitiveType` `t1` and `t2`, they must satisfy the following conditions:

1. `t1` and `t2` have the same primitive type name
1. `t1` and `t2` either
   - don't have original type, or
   - have the same original type
1. If `t1` and `t2` are both `FIXED_LEN_BYTE_ARRAY`, they should have the same length

Also, merged schema now preserves original name if there's any.

Author: Cheng Lian <lian@databricks.com>

Closes apache#315 from liancheng/fix-strict-schema-merge and squashes the following commits:

a29138c [Cheng Lian] Addresses PR comment
1ac804e [Cheng Lian] Fixes strict schema merging
@liancheng liancheng deleted the fix-strict-schema-merge branch February 24, 2017 20:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants