Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Sep 14, 2017

What changes were proposed in this pull request?

This PR includes #14471 to enable spark.sql.hive.convertMetastoreOrc for SPARK-19459 (ORC tables cannot be read when they contain char/varchar columns). All credits should go to @rajeshbalamohan .

For SPARK-19459, the padding is handled by Hive-side HiveCharWritable via HiveBaseChar.java on read. So, ORCFileFormat can handle this unlikely from ParquetFileFormat. For Parquet, please see SPARK-21997.

How was this patch tested?

Pass the newly added test cases.

@gatorsmile
Copy link
Member

I think this is not a fix. What is the root cause?

@SparkQA
Copy link

SparkQA commented Sep 14, 2017

Test build #81789 has finished for PR 19235 at commit 57332c3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Sep 14, 2017

Won't this cause Hive's parquet code to be used by default instead of Spark's? I'm not sure that's what we want.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Sep 14, 2017

Thank you for review, @gatorsmile and @vanzin . Yes, this is not a way where Apache Spark is heading. It's slow.

Here, I wanted to raise this issue of convertMetastoreXXX things since convertMetastoreXXX seems not to be tested completly for both ORC/Parquet. Especially, as you see here, 14 failures means the test cases depend implicitly on that option. I think we need to test for both true and false for all those cases. Otherwise, we need to add withSQLConf(true) explicitly.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Sep 14, 2017

I'm not sure where I hit this before from some another JIRA issue. But, I updated 'WHERE' clause examples. Maybe, there was discussion on this difference.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-21997][SQL] Turn off spark.sql.hive.convertMetastoreParquet by default [SPARK-21997][SQL][WIP] Turn off spark.sql.hive.convertMetastoreParquet by default Sep 14, 2017
@vanzin
Copy link
Contributor

vanzin commented Sep 14, 2017

I wanted to raise this issue of convertMetastoreXXX things since convertMetastoreXXX seems not to be tested completly

That's fine, but that's not what your change is proposing. If you look at the test failures, at least the one I looked at is because the wrong exception is being thrown, not because the feature is broken. That doesn't excuse the fact that it's not tested, but maybe the situation is not that dire.

But as Xiao has mentioned, you're not fixing the problem you describe in the change summary; you haven't even root caused the problem. You just found out that using Hive classes to read the parquet data works around the problem, but that's not an acceptable fix for the problem - it's just a workaround that those affected by the issue can use.

@dongjoon-hyun
Copy link
Member Author

I agree with you. Okay. I'll refocus this.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Sep 15, 2017

@gatorsmile and @vanzin .

I'm comparing with ORC now. Previously, ORC fails with another reason. I'll make another PR for that. I found that #14471 is enough for ORC.

In case of ORC, ORC itself handles truncations on write. The padding is handled by Hive side HiveCharWritable via HiveBaseChar.java on read. In case of Parquet, I guess Parquet is the same, but there is no such a padding logic like HiveCharWritable in Spark.

@dongjoon-hyun
Copy link
Member Author

Since this PR is invalid, I'll reuse this PR instead of creating new one.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-21997][SQL][WIP] Turn off spark.sql.hive.convertMetastoreParquet by default [SPARK-14387][SPARK-19459][SQL] Enable Hive-1.x ORC compatibility with spark.sql.hive.convertMetastoreOrc Sep 15, 2017
checkAnswer(spark.table("hive_orc"), result)
checkAnswer(spark.table("spark_orc"), result)
Seq("false", "true").foreach { value =>
withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> value) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a test case for ORC file format. Based on #14471, I'm enabling this.
For Parquet, I think we can proceed separately if ORC is finished.

@SparkQA
Copy link

SparkQA commented Sep 15, 2017

Test build #81806 has finished for PR 19235 at commit c6d2c35.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun dongjoon-hyun deleted the SPARK-21997 branch March 23, 2018 04:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants