Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-45265][SQL][WIP] Supporting Hive 4.0 metastore #43064

Closed
wants to merge 4 commits into from

Conversation

attilapiros
Copy link
Contributor

@attilapiros attilapiros commented Sep 22, 2023

What changes were proposed in this pull request?

Supporting Hive 4.0 metastore where partition filters even for CHAR and a VARCHAR types can be pushed down.

Hive 4.0 is still beta! This is why this is a work on progress PR.

Why are the changes needed?

Supporting more Hive versions (with extra performance improvement) is good for our users.

Does this PR introduce any user-facing change?

Yes. Regarding supporting Hive 4.0 metastore the documentation is updated accordingly.

How was this patch tested?

Manually

I used the docker image of apache/hive:4.0.0-beta-1 for starting a metastore and a hiveserver2 (along with a hadoop3 docker image).

Created a table:

CREATE EXTERNAL TABLE testTable1 ( 
  column1 String 
) PARTITIONED BY (partColumn1 CHAR(30), partColumn2 VARCHAR(30)) LOCATION 'hdfs://hadoop3:8020/tmp/hive_external/';

Inserted some values in beeline:

insert into table testtable1 values ("column1_v1", "partcolumn1_v1", "partcolumn2_v1"), ("column1_v2", "partcolumn1_v2", "partcolumn2_v2");

Started my spark in the hiveserver2 container as:

./bin/spark-shell --conf spark.sql.hive.metastore.version=4.0.0 --conf spark.sql.hive.metastore.jars="/opt/hive/lib/*"

Run the query as:

scala> sql("select * from testtable1 where partcolumn1 = 'partcolumn1_v1' and partcolumn2 = 'partcolumn2_v1'").show
Hive Session ID = 6846fe0e-968a-474d-afec-4f67b3a2a274
+----------+--------------------+--------------+
|   column1|         partcolumn1|   partcolumn2|
+----------+--------------------+--------------+
|column1_v1|partcolumn1_v1   ...|partcolumn2_v1|
+----------+--------------------+--------------+

And check the HMS calls in the metastore container in the file /tmp/hive/hive.log:

...
2023-09-22T21:06:34,293  INFO [Metastore-Handler-Pool: Thread-1356] HiveMetaStore.audit: ugi=hive       ip=172.30.0.5   cmd=source:172.30.0.5 get_partitions_by_filter : tbl=hive.default.testtable1
...

Which contains the expected get_partitions_by_filter.

Was this patch authored or co-authored using generative AI tooling?

No.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. I have a few comments.

  • Are you using the current beta-1?
  • Is there a timeline for Hive 4.0 GA?
  • Although I know that you filed this as Bug for some old releases, but I believe this PR should be a subtask for Apache Spark 4.0.0 because there is no existing Spark users with Apache Hive 4.0.0 Megastore.

@attilapiros
Copy link
Contributor Author

@dongjoon-hyun

Thanks!

Are you using the current beta-1?

Yes.

Is there a timeline for Hive 4.0 GA?

I will ask around but as I know they still have some blockers.

Although I know that you filed this as Bug for some old releases, but I believe this PR should be a subtask for Apache Spark 4.0.0 because there is no existing Spark users with Apache Hive 4.0.0 Megastore.

Sorry that was a mistake of mine thanks for fixing that in Jira.

@dongjoon-hyun
Copy link
Member

Thank you. And, if you are fine with Apache Spark 4.0, that's great! I was worried. 😄

@HyukjinKwon
Copy link
Member

cc @wangyum too

@attilapiros
Copy link
Contributor Author

@dongjoon-hyun
Regarding Hive 4.0 there is the Test with the TPC-DS benchmark to be done but when the release is out I will update this PR.

@dongjoon-hyun
Copy link
Member

Thank you so much for keeping us up-to-date, @attilapiros !

Regarding Hive 4.0 there is the Test with the TPC-DS benchmark to be done but when the release is out I will update this PR.

@dongjoon-hyun dongjoon-hyun marked this pull request as draft October 1, 2023 23:38
@dongjoon-hyun
Copy link
Member

Is there any update for Apache Hive 4.0, @attilapiros ?

@attilapiros
Copy link
Contributor Author

Is there any update for Apache Hive 4.0, @attilapiros ?

@dongjoon-hyun they still having some more issues to solve (as I see some TPC-DS queries performance issues):
https://lists.apache.org/thread/3okjgw3y6tso7l2rg3hhy8lccp6d6mmy

@dongjoon-hyun
Copy link
Member

Thank you for the updates and the link, @attilapiros .

Copy link

github-actions bot commented Mar 1, 2024

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Mar 1, 2024
@github-actions github-actions bot closed this Mar 2, 2024
dongjoon-hyun pushed a commit that referenced this pull request Nov 14, 2024
### What changes were proposed in this pull request?

This PR continues the work from #43064 and #45801 to support Hive Metastore Server 4.0. CHAR/VARCHAR type partition filter pushdown is not included in this PR, as it requires further investment.

### Why are the changes needed?

Enhance the multiple hive metastore server support feature

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?

Passing HiveClient*Suites w/ 4.0

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #48823 from yaooqinn/SPARK-45265.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants