[SPARK-9170][SQL] Use OrcStructInspector to be case preserving when writing ORC files #7520

viirya · 2015-07-20T02:35:57Z

JIRA: https://issues.apache.org/jira/browse/SPARK-9170

StandardStructObjectInspector will implicitly lowercase column names. But I think Orc format doesn't have such requirement. In fact, there is a OrcStructInspector specified for Orc format. We should use it when serialize rows to Orc file. It can be case preserving when writing ORC files.

…Orc format.

SparkQA · 2015-07-20T04:17:18Z

Test build #37794 has finished for PR 7520 at commit c51394f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2015-07-23T06:19:10Z

cc @liancheng @marmbrus

liancheng · 2015-07-23T06:27:53Z

cc @zhzhan

zhzhan · 2015-07-23T15:53:57Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala

I didn't see any usage for this variable.

Removed it. Thanks.

SparkQA · 2015-07-24T04:35:29Z

Test build #38300 has finished for PR 7520 at commit e827e49.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class ChangePrecision(child: Expression) extends UnaryExpression
- abstract class AlgebraicAggregate extends AggregateFunction2 with Serializable with Unevaluable
- abstract class AggregateFunction1 extends LeafExpression with Serializable
- case class DecimalType(precision: Int, scale: Int) extends FractionalType
- case class DecimalConversion(precision: Int, scale: Int) extends JDBCConversion

zhzhan · 2015-07-24T05:06:39Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala

Looks like this call will create a new object for each row written instead of reuse reusableOutputBuffer. Is it a concern?

Agreed. It will be a concern. I will update this part to create an OrcStruct first and reuse it.

zhzhan · 2015-07-24T05:28:06Z

LGTM with the comments answered or resolved.

SparkQA · 2015-07-24T08:31:45Z

Test build #38327 has finished for PR 7520 at commit 96796da.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Conflicts: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala

SparkQA · 2015-07-25T09:42:23Z

Test build #38423 has finished for PR 7520 at commit 4e40931.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Conflicts: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala

SparkQA · 2015-07-27T06:19:25Z

Test build #38505 has finished for PR 7520 at commit ab7fb08.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2015-07-27T06:52:47Z

The failed test CTAS with serde in org.apache.spark.sql.hive.execution.SQLQuerySuite is also failed with current master branch. Maybe caused by other commits. I will check it later.

viirya · 2015-07-28T02:22:43Z

retest this please.

SparkQA · 2015-07-28T03:58:18Z

Test build #127 has finished for PR 7520 at commit ab7fb08.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-28T04:02:42Z

Test build #38628 has finished for PR 7520 at commit ab7fb08.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2015-07-28T09:55:05Z

ping @zhzhan @liancheng

Conflicts: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala

zhzhan · 2015-08-04T03:29:06Z

LGTM. Will let @liancheng take a final look.

SparkQA · 2015-08-04T04:44:14Z

Test build #39643 has finished for PR 7520 at commit d4676a7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

chenghao-intel · 2015-08-07T08:08:37Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala

Just curious if this change we have to make.
Previously create an reusable array and associated with the compatible StructObjectInspector.
Now, we create an reusable OrcStruct object and also attached with the OrcStructInspector,

Seems no differences to a orc serializer, isn't it?

StandardStructObjectInspector will implicitly lowercase column names. Otherwise, it uses OrcStruct. I think there are no other significant differences.

Conflicts: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala

viirya · 2015-08-26T07:01:16Z

cc @liancheng as this patch is lgtm by @zhzhan for a while. Is it ok to merge this now?

viirya · 2015-08-27T07:43:57Z

@zhzhan Looks like it should work.

zhzhan · 2015-08-27T07:44:58Z

Also we need to change
private lazy val nameToField: Map[String, StructField] = fields.map(f => f.name.toLowerCase -> f).toMap

viirya · 2015-08-27T07:56:27Z

@zhzhan @chenghao-intel Thanks for comment. I've updated the PR title and codes. Please check if it is ok for you.

zhzhan · 2015-08-27T08:01:32Z

@liancheng have more insights on this part.

viirya · 2015-08-27T08:05:10Z

Ok. Thanks. Wait for @liancheng's review.

SparkQA · 2015-08-27T10:24:22Z

Test build #41677 has finished for PR 7520 at commit a389746.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2015-08-27T11:22:50Z

One thing to note is that, case sensitivity of Spark SQL is configurable (see here). So I don't think we should make StructType completely case insensitive (yet case preserving).

If I understand this issue correctly, the root problem here is that, while writing schema information to physical ORC files, our current approach isn't case preserving. As suggested by @chenghao-intel, when saving a DataFrame as Hive metastore tables using ORC, Spark SQL 1.5 now saves it in a Hive compatible approach, so that we can read the data back using Hive. This implies that, changes made in this PR should also be compatible with Hive. After investigating Hive's behavior for a while, I got some interesting findings.

Snippets below were executed against Hive 1.2.1 (with a PostgreSQL metastore) and Spark SQL 1.5-SNAPSHOT (revision 0eeee5c). Firstly, let's prepare a Hive ORC table:

hive> CREATE TABLE orc_test STORED AS ORC AS SELECT 1 AS CoL;
...
hive> SELECT col FROM orc_test;
OK
1
Time taken: 0.056 seconds, Fetched: 1 row(s)

hive> SELECT COL FROM orc_test;
OK
1
Time taken: 0.056 seconds, Fetched: 1 row(s)

hive> DESC orc_test;
OK
col                     int
Time taken: 0.047 seconds, Fetched: 1 row(s)

So Hive is neither case sensitive nor case preserving. We can further prove this by checking metastore table COLUMN_V2:

metastore_hive121> SELECT * FROM "COLUMNS_V2"
+---------+-----------+---------------+-------------+---------------+
|   CD_ID |   COMMENT | COLUMN_NAME   | TYPE_NAME   |   INTEGER_IDX |
|---------+-----------+---------------+-------------+---------------|
|      22 |    <null> | col           | int         |             0 |
+---------+-----------+---------------+-------------+---------------+

(I cleared my local Hive warehouse, so the only column record here is the one created above.)

Now let's read the physical ORC files directly using Spark:

scala> sqlContext.read.orc("hdfs://localhost:9000/user/hive/warehouse_hive121/orc_test").printSchema()
root
 |-- _col0: integer (nullable = true)

scala> sqlContext.read.orc("hdfs://localhost:9000/user/hive/warehouse_hive121/orc_test").show()
+-----+
|_col0|
+-----+
|    1|
+-----+

Huh? Why it's _col0 instead of col? Let's inspect the physical ORC file written by Hive:

$ hive --orcfiledump /user/hive/warehouse_hive121/orc_test/000000_0

Structure for /user/hive/warehouse_hive121/orc_test/000000_0
File Version: 0.12 with HIVE_8732
15/08/27 19:07:15 INFO orc.ReaderImpl: Reading ORC rows from /user/hive/warehouse_hive121/orc_test/000000_0 with {include: null, offset: 0, length: 9223372036854775807}
15/08/27 19:07:15 INFO orc.RecordReaderFactory: Schema is not specified on read. Using file schema.
Rows: 1
Compression: ZLIB
Compression size: 262144
Type: struct<_col0:int>         <---- !!!
...

Surprise! So, when writing ORC files, Hive doesn't even preserve the column names.

Conclusions:

Making StructType completely case insensitive is unacceptable.

Because case sensitivity is configurable in Spark SQL.
2. Concrete column names written into ORC files by Spark SQL don't affect interoperability with Hive.

I further verified this by creating ORC files using Spark SQL and then import them into Hive ORC tables. Didn't bother posting the results because of limited space.
3. It would be good for Spark SQL to be case preserving when writing ORC files.

And I think this is the task this PR should aim.

viirya · 2015-08-27T13:15:36Z

@liancheng Thanks for the clear investigation and explanation.

If I understand it correctly, it means that the original direction of this PR is correct.

liancheng · 2015-08-27T13:47:45Z

@viirya Yeah, I agree with you.

SparkQA · 2015-08-27T15:38:37Z

Test build #41686 has finished for PR 7520 at commit dc8bd26.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2015-09-03T12:55:40Z

@liancheng Is this ready to merge?

liancheng · 2015-09-04T12:52:54Z

@viirya Oh sorry. It would be nice if you ping me after you update your PR next time :)

viirya · 2015-09-06T06:21:09Z

@liancheng Thanks. So is there any concern or review for this patch?

liancheng · 2015-09-07T12:52:13Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala

o => struct

liancheng · 2015-09-07T12:58:13Z

@viirya This LGTM now except for a few minor issues. I can merge this once those are fixed.

Thanks again for working on this and your patience!

viirya · 2015-09-07T14:34:10Z

@liancheng Thanks for reviewing. I've fixed them. Waiting for the test to pass.

SparkQA · 2015-09-07T16:32:19Z

Test build #42097 has finished for PR 7520 at commit 0d582b2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2015-09-08T02:44:18Z

ping @liancheng

liancheng · 2015-09-08T15:00:48Z

@viirya Thanks! Merging to master.

Instead of StandardStructObjectInspector, use OrcStructInspector for …

c51394f

…Orc format.

zhzhan reviewed Jul 23, 2015
View reviewed changes

viirya added 2 commits July 24, 2015 10:54

Merge remote-tracking branch 'upstream/master' into use_orcstruct

ff9ccc5

Remove unnecessary variable.

e827e49

zhzhan reviewed Jul 24, 2015
View reviewed changes

Reuse an OrcStruct.

96796da

Merge remote-tracking branch 'upstream/master' into use_orcstruct

4e40931

Conflicts: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala

Merge remote-tracking branch 'upstream/master' into use_orcstruct

ab7fb08

Conflicts: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala

viirya mentioned this pull request Jul 27, 2015

[SPARK-9378][SQL][HotFix] Remove improper and failed test that checks schema stored by Hive #7693

Closed

Merge remote-tracking branch 'upstream/master' into use_orcstruct

d4676a7

Conflicts: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala

chenghao-intel reviewed Aug 7, 2015
View reviewed changes

Merge remote-tracking branch 'upstream/master' into use_orcstruct

56fafef

Conflicts: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala

viirya changed the title ~~[SPARK-9170][SQL] Instead of StandardStructObjectInspector, use OrcStructInspector for Orc format~~ [SPARK-9170][SQL] User-provided columns should work for both lowercase and uppercase Aug 27, 2015

Merge remote-tracking branch 'upstream/master' into use_orcstruct

dc8bd26

viirya force-pushed the use_orcstruct branch from a389746 to dc8bd26 Compare August 27, 2015 13:15

viirya changed the title ~~[SPARK-9170][SQL] User-provided columns should work for both lowercase and uppercase~~ [SPARK-9170][SQL] Use OrcStructInspector to be case preserving when writing ORC files Aug 27, 2015

liancheng reviewed Sep 7, 2015
View reviewed changes

sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala Outdated

Copy link

Contributor

liancheng Sep 7, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

o => struct

viirya added 2 commits September 7, 2015 22:07

Merge remote-tracking branch 'upstream/master' into use_orcstruct

66fbe9d

For comment.

0d582b2

asfgit closed this in 990c9f7 Sep 8, 2015

viirya deleted the use_orcstruct branch December 27, 2023 18:32

[SPARK-9170][SQL] Use OrcStructInspector to be case preserving when writing ORC files #7520

[SPARK-9170][SQL] Use OrcStructInspector to be case preserving when writing ORC files #7520

Uh oh!

Conversation

viirya commented Jul 20, 2015

Uh oh!

SparkQA commented Jul 20, 2015

Uh oh!

viirya commented Jul 23, 2015

Uh oh!

liancheng commented Jul 23, 2015

Uh oh!

zhzhan Jul 23, 2015

Choose a reason for hiding this comment

Uh oh!

viirya Jul 24, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 24, 2015

Uh oh!

zhzhan Jul 24, 2015

Choose a reason for hiding this comment

Uh oh!

viirya Jul 24, 2015

Choose a reason for hiding this comment

Uh oh!

zhzhan commented Jul 24, 2015

Uh oh!

SparkQA commented Jul 24, 2015

Uh oh!

SparkQA commented Jul 25, 2015

Uh oh!

SparkQA commented Jul 27, 2015

Uh oh!

viirya commented Jul 27, 2015

Uh oh!

viirya commented Jul 28, 2015

Uh oh!

SparkQA commented Jul 28, 2015

Uh oh!

SparkQA commented Jul 28, 2015

Uh oh!

viirya commented Jul 28, 2015

Uh oh!

zhzhan commented Aug 4, 2015

Uh oh!

SparkQA commented Aug 4, 2015

Uh oh!

chenghao-intel Aug 7, 2015

Choose a reason for hiding this comment

Uh oh!

viirya Aug 7, 2015

Choose a reason for hiding this comment

Uh oh!

viirya commented Aug 26, 2015

Uh oh!

viirya commented Aug 27, 2015

Uh oh!

zhzhan commented Aug 27, 2015

Uh oh!

viirya commented Aug 27, 2015

Uh oh!

zhzhan commented Aug 27, 2015

Uh oh!

viirya commented Aug 27, 2015

Uh oh!

SparkQA commented Aug 27, 2015

Uh oh!

liancheng commented Aug 27, 2015

Uh oh!

viirya commented Aug 27, 2015

Uh oh!

liancheng commented Aug 27, 2015

Uh oh!

SparkQA commented Aug 27, 2015

Uh oh!

viirya commented Sep 3, 2015

Uh oh!

liancheng commented Sep 4, 2015

Uh oh!