-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-9170][SQL] Use OrcStructInspector to be case preserving when writing ORC files #7520
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #37794 has finished for PR 7520 at commit
|
|
cc @zhzhan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see any usage for this variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed it. Thanks.
|
Test build #38300 has finished for PR 7520 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this call will create a new object for each row written instead of reuse reusableOutputBuffer. Is it a concern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. It will be a concern. I will update this part to create an OrcStruct first and reuse it.
|
LGTM with the comments answered or resolved. |
|
Test build #38327 has finished for PR 7520 at commit
|
Conflicts: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala
|
Test build #38423 has finished for PR 7520 at commit
|
Conflicts: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala
|
Test build #38505 has finished for PR 7520 at commit
|
|
The failed test |
|
retest this please. |
|
Test build #127 has finished for PR 7520 at commit
|
|
Test build #38628 has finished for PR 7520 at commit
|
|
ping @zhzhan @liancheng |
Conflicts: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala
|
LGTM. Will let @liancheng take a final look. |
|
Test build #39643 has finished for PR 7520 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious if this change we have to make.
Previously create an reusable array and associated with the compatible StructObjectInspector.
Now, we create an reusable OrcStruct object and also attached with the OrcStructInspector,
Seems no differences to a orc serializer, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StandardStructObjectInspector will implicitly lowercase column names. Otherwise, it uses OrcStruct. I think there are no other significant differences.
Conflicts: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala
|
cc @liancheng as this patch is lgtm by @zhzhan for a while. Is it ok to merge this now? |
|
@zhzhan Looks like it should work. |
|
Also we need to change |
|
@zhzhan @chenghao-intel Thanks for comment. I've updated the PR title and codes. Please check if it is ok for you. |
|
@liancheng have more insights on this part. |
|
Ok. Thanks. Wait for @liancheng's review. |
|
Test build #41677 has finished for PR 7520 at commit
|
|
One thing to note is that, case sensitivity of Spark SQL is configurable (see here). So I don't think we should make If I understand this issue correctly, the root problem here is that, while writing schema information to physical ORC files, our current approach isn't case preserving. As suggested by @chenghao-intel, when saving a DataFrame as Hive metastore tables using ORC, Spark SQL 1.5 now saves it in a Hive compatible approach, so that we can read the data back using Hive. This implies that, changes made in this PR should also be compatible with Hive. After investigating Hive's behavior for a while, I got some interesting findings. Snippets below were executed against Hive 1.2.1 (with a PostgreSQL metastore) and Spark SQL 1.5-SNAPSHOT (revision 0eeee5c). Firstly, let's prepare a Hive ORC table: So Hive is neither case sensitive nor case preserving. We can further prove this by checking metastore table (I cleared my local Hive warehouse, so the only column record here is the one created above.) Now let's read the physical ORC files directly using Spark: Huh? Why it's Surprise! So, when writing ORC files, Hive doesn't even preserve the column names. Conclusions:
Because case sensitivity is configurable in Spark SQL. I further verified this by creating ORC files using Spark SQL and then import them into Hive ORC tables. Didn't bother posting the results because of limited space. And I think this is the task this PR should aim. |
|
@liancheng Thanks for the clear investigation and explanation. If I understand it correctly, it means that the original direction of this PR is correct. |
|
@viirya Yeah, I agree with you. |
|
Test build #41686 has finished for PR 7520 at commit
|
|
@liancheng Is this ready to merge? |
|
@viirya Oh sorry. It would be nice if you ping me after you update your PR next time :) |
|
@liancheng Thanks. So is there any concern or review for this patch? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
o => struct
|
@viirya This LGTM now except for a few minor issues. I can merge this once those are fixed. Thanks again for working on this and your patience! |
|
@liancheng Thanks for reviewing. I've fixed them. Waiting for the test to pass. |
|
Test build #42097 has finished for PR 7520 at commit
|
|
ping @liancheng |
|
@viirya Thanks! Merging to master. |
JIRA: https://issues.apache.org/jira/browse/SPARK-9170
StandardStructObjectInspectorwill implicitly lowercase column names. But I think Orc format doesn't have such requirement. In fact, there is aOrcStructInspectorspecified for Orc format. We should use it when serialize rows to Orc file. It can be case preserving when writing ORC files.