Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Jul 20, 2015

JIRA: https://issues.apache.org/jira/browse/SPARK-9170

StandardStructObjectInspector will implicitly lowercase column names. But I think Orc format doesn't have such requirement. In fact, there is a OrcStructInspector specified for Orc format. We should use it when serialize rows to Orc file. It can be case preserving when writing ORC files.

@SparkQA
Copy link

SparkQA commented Jul 20, 2015

Test build #37794 has finished for PR 7520 at commit c51394f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Jul 23, 2015

cc @liancheng @marmbrus

@liancheng
Copy link
Contributor

cc @zhzhan

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see any usage for this variable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed it. Thanks.

@SparkQA
Copy link

SparkQA commented Jul 24, 2015

Test build #38300 has finished for PR 7520 at commit e827e49.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class ChangePrecision(child: Expression) extends UnaryExpression
    • abstract class AlgebraicAggregate extends AggregateFunction2 with Serializable with Unevaluable
    • abstract class AggregateFunction1 extends LeafExpression with Serializable
    • case class DecimalType(precision: Int, scale: Int) extends FractionalType
    • case class DecimalConversion(precision: Int, scale: Int) extends JDBCConversion

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this call will create a new object for each row written instead of reuse reusableOutputBuffer. Is it a concern?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. It will be a concern. I will update this part to create an OrcStruct first and reuse it.

@zhzhan
Copy link
Contributor

zhzhan commented Jul 24, 2015

LGTM with the comments answered or resolved.

@SparkQA
Copy link

SparkQA commented Jul 24, 2015

Test build #38327 has finished for PR 7520 at commit 96796da.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Conflicts:
	sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala
@SparkQA
Copy link

SparkQA commented Jul 25, 2015

Test build #38423 has finished for PR 7520 at commit 4e40931.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Conflicts:
	sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala
@SparkQA
Copy link

SparkQA commented Jul 27, 2015

Test build #38505 has finished for PR 7520 at commit ab7fb08.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Jul 27, 2015

The failed test CTAS with serde in org.apache.spark.sql.hive.execution.SQLQuerySuite is also failed with current master branch. Maybe caused by other commits. I will check it later.

@viirya
Copy link
Member Author

viirya commented Jul 28, 2015

retest this please.

@SparkQA
Copy link

SparkQA commented Jul 28, 2015

Test build #127 has finished for PR 7520 at commit ab7fb08.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 28, 2015

Test build #38628 has finished for PR 7520 at commit ab7fb08.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Jul 28, 2015

ping @zhzhan @liancheng

Conflicts:
	sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala
@zhzhan
Copy link
Contributor

zhzhan commented Aug 4, 2015

LGTM. Will let @liancheng take a final look.

@SparkQA
Copy link

SparkQA commented Aug 4, 2015

Test build #39643 has finished for PR 7520 at commit d4676a7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious if this change we have to make.
Previously create an reusable array and associated with the compatible StructObjectInspector.
Now, we create an reusable OrcStruct object and also attached with the OrcStructInspector,

Seems no differences to a orc serializer, isn't it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StandardStructObjectInspector will implicitly lowercase column names. Otherwise, it uses OrcStruct. I think there are no other significant differences.

Conflicts:
	sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala
@viirya
Copy link
Member Author

viirya commented Aug 26, 2015

cc @liancheng as this patch is lgtm by @zhzhan for a while. Is it ok to merge this now?

@viirya
Copy link
Member Author

viirya commented Aug 27, 2015

@zhzhan Looks like it should work.

@zhzhan
Copy link
Contributor

zhzhan commented Aug 27, 2015

Also we need to change
private lazy val nameToField: Map[String, StructField] = fields.map(f => f.name.toLowerCase -> f).toMap

@viirya viirya changed the title [SPARK-9170][SQL] Instead of StandardStructObjectInspector, use OrcStructInspector for Orc format [SPARK-9170][SQL] User-provided columns should work for both lowercase and uppercase Aug 27, 2015
@viirya
Copy link
Member Author

viirya commented Aug 27, 2015

@zhzhan @chenghao-intel Thanks for comment. I've updated the PR title and codes. Please check if it is ok for you.

@zhzhan
Copy link
Contributor

zhzhan commented Aug 27, 2015

@liancheng have more insights on this part.

@viirya
Copy link
Member Author

viirya commented Aug 27, 2015

Ok. Thanks. Wait for @liancheng's review.

@SparkQA
Copy link

SparkQA commented Aug 27, 2015

Test build #41677 has finished for PR 7520 at commit a389746.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor

One thing to note is that, case sensitivity of Spark SQL is configurable (see here). So I don't think we should make StructType completely case insensitive (yet case preserving).

If I understand this issue correctly, the root problem here is that, while writing schema information to physical ORC files, our current approach isn't case preserving. As suggested by @chenghao-intel, when saving a DataFrame as Hive metastore tables using ORC, Spark SQL 1.5 now saves it in a Hive compatible approach, so that we can read the data back using Hive. This implies that, changes made in this PR should also be compatible with Hive. After investigating Hive's behavior for a while, I got some interesting findings.

Snippets below were executed against Hive 1.2.1 (with a PostgreSQL metastore) and Spark SQL 1.5-SNAPSHOT (revision 0eeee5c). Firstly, let's prepare a Hive ORC table:

hive> CREATE TABLE orc_test STORED AS ORC AS SELECT 1 AS CoL;
...
hive> SELECT col FROM orc_test;
OK
1
Time taken: 0.056 seconds, Fetched: 1 row(s)

hive> SELECT COL FROM orc_test;
OK
1
Time taken: 0.056 seconds, Fetched: 1 row(s)

hive> DESC orc_test;
OK
col                     int
Time taken: 0.047 seconds, Fetched: 1 row(s)

So Hive is neither case sensitive nor case preserving. We can further prove this by checking metastore table COLUMN_V2:

metastore_hive121> SELECT * FROM "COLUMNS_V2"
+---------+-----------+---------------+-------------+---------------+
|   CD_ID |   COMMENT | COLUMN_NAME   | TYPE_NAME   |   INTEGER_IDX |
|---------+-----------+---------------+-------------+---------------|
|      22 |    <null> | col           | int         |             0 |
+---------+-----------+---------------+-------------+---------------+

(I cleared my local Hive warehouse, so the only column record here is the one created above.)

Now let's read the physical ORC files directly using Spark:

scala> sqlContext.read.orc("hdfs://localhost:9000/user/hive/warehouse_hive121/orc_test").printSchema()
root
 |-- _col0: integer (nullable = true)

scala> sqlContext.read.orc("hdfs://localhost:9000/user/hive/warehouse_hive121/orc_test").show()
+-----+
|_col0|
+-----+
|    1|
+-----+

Huh? Why it's _col0 instead of col? Let's inspect the physical ORC file written by Hive:

$ hive --orcfiledump /user/hive/warehouse_hive121/orc_test/000000_0

Structure for /user/hive/warehouse_hive121/orc_test/000000_0
File Version: 0.12 with HIVE_8732
15/08/27 19:07:15 INFO orc.ReaderImpl: Reading ORC rows from /user/hive/warehouse_hive121/orc_test/000000_0 with {include: null, offset: 0, length: 9223372036854775807}
15/08/27 19:07:15 INFO orc.RecordReaderFactory: Schema is not specified on read. Using file schema.
Rows: 1
Compression: ZLIB
Compression size: 262144
Type: struct<_col0:int>         <---- !!!
...

Surprise! So, when writing ORC files, Hive doesn't even preserve the column names.

Conclusions:

  1. Making StructType completely case insensitive is unacceptable.

Because case sensitivity is configurable in Spark SQL.
2. Concrete column names written into ORC files by Spark SQL don't affect interoperability with Hive.

I further verified this by creating ORC files using Spark SQL and then import them into Hive ORC tables. Didn't bother posting the results because of limited space.
3. It would be good for Spark SQL to be case preserving when writing ORC files.

And I think this is the task this PR should aim.

@viirya
Copy link
Member Author

viirya commented Aug 27, 2015

@liancheng Thanks for the clear investigation and explanation.

If I understand it correctly, it means that the original direction of this PR is correct.

@viirya viirya changed the title [SPARK-9170][SQL] User-provided columns should work for both lowercase and uppercase [SPARK-9170][SQL] Use OrcStructInspector to be case preserving when writing ORC files Aug 27, 2015
@liancheng
Copy link
Contributor

@viirya Yeah, I agree with you.

@SparkQA
Copy link

SparkQA commented Aug 27, 2015

Test build #41686 has finished for PR 7520 at commit dc8bd26.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Sep 3, 2015

@liancheng Is this ready to merge?

@liancheng
Copy link
Contributor

@viirya Oh sorry. It would be nice if you ping me after you update your PR next time :)

@viirya
Copy link
Member Author

viirya commented Sep 6, 2015

@liancheng Thanks. So is there any concern or review for this patch?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

o => struct

@liancheng
Copy link
Contributor

@viirya This LGTM now except for a few minor issues. I can merge this once those are fixed.

Thanks again for working on this and your patience!

@viirya
Copy link
Member Author

viirya commented Sep 7, 2015

@liancheng Thanks for reviewing. I've fixed them. Waiting for the test to pass.

@SparkQA
Copy link

SparkQA commented Sep 7, 2015

Test build #42097 has finished for PR 7520 at commit 0d582b2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Sep 8, 2015

ping @liancheng

@liancheng
Copy link
Contributor

@viirya Thanks! Merging to master.

@asfgit asfgit closed this in 990c9f7 Sep 8, 2015
@viirya viirya deleted the use_orcstruct branch December 27, 2023 18:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants