Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-7209][BEAM-9351][BEAM-9428] Upgrade Hive to version 3.1.3 #17749

Merged
merged 5 commits into from
Jun 15, 2022

Conversation

Abacn
Copy link
Contributor

@Abacn Abacn commented May 24, 2022

  • This eliminated the pentaho dependency

Please add a meaningful description for your change here


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests

See CI.md for more information about GitHub Actions CI.

* This eliminated the pentaho dependency
@asf-ci
Copy link

asf-ci commented May 24, 2022

Can one of the admins verify this patch?

2 similar comments
@asf-ci
Copy link

asf-ci commented May 24, 2022

Can one of the admins verify this patch?

@asf-ci
Copy link

asf-ci commented May 24, 2022

Can one of the admins verify this patch?

@Abacn Abacn changed the title [Do not merge] Test: Upgrade Hive to version 3.1.2 [BEAM-7209][Do not merge] Test: Upgrade Hive to version 3.1.3 May 24, 2022
@Abacn
Copy link
Contributor Author

Abacn commented Jun 9, 2022

Finally able to make the integration test run:

image

The current integration test is written in a way that needs both metastore and hiveserver2 services running and essentially operating the same database. Did not find a way to make both services run locally (in local embedded mode, metastore or hiveserver2 will occupy the same Derby database and cannot be setup at the same time.) The test is conducted by manually doing setup of HCataogIOIT (thus commenting out these parts in HCataogIOIT.java), running a metastore service and then run the integration test.

@Abacn
Copy link
Contributor Author

Abacn commented Jun 9, 2022

retest this please

@Abacn Abacn changed the title [BEAM-7209][Do not merge] Test: Upgrade Hive to version 3.1.3 [BEAM-7209][BEAM-9351][BEAM-9428] Upgrade Hive to version 3.1.3 Jun 9, 2022
@Abacn Abacn marked this pull request as ready for review June 9, 2022 13:50
@Abacn
Copy link
Contributor Author

Abacn commented Jun 9, 2022

@codecov
Copy link

codecov bot commented Jun 9, 2022

Codecov Report

Merging #17749 (0b038cb) into master (5bb3127) will increase coverage by 0.04%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #17749      +/-   ##
==========================================
+ Coverage   74.01%   74.05%   +0.04%     
==========================================
  Files         695      696       +1     
  Lines       91798    92195     +397     
==========================================
+ Hits        67941    68274     +333     
- Misses      22611    22675      +64     
  Partials     1246     1246              
Flag Coverage Δ
python 83.75% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...ache_beam/runners/dataflow/dataflow_job_service.py 50.00% <0.00%> (-12.17%) ⬇️
...s/interactive/dataproc/dataproc_cluster_manager.py 77.41% <0.00%> (-6.80%) ⬇️
...eam/transforms/py_dataflow_distribution_counter.py 91.42% <0.00%> (-4.87%) ⬇️
.../python/apache_beam/testing/test_stream_service.py 88.09% <0.00%> (-4.77%) ⬇️
sdks/python/apache_beam/dataframe/io.py 88.78% <0.00%> (-3.26%) ⬇️
...eam/runners/portability/fn_api_runner/fn_runner.py 87.51% <0.00%> (-2.50%) ⬇️
sdks/python/apache_beam/utils/interactive_utils.py 95.12% <0.00%> (-2.44%) ⬇️
...nners/portability/fn_api_runner/worker_handlers.py 77.89% <0.00%> (-1.45%) ⬇️
sdks/python/apache_beam/utils/counters.py 85.39% <0.00%> (-1.36%) ⬇️
...hon/apache_beam/runners/direct/test_stream_impl.py 93.28% <0.00%> (-0.75%) ⬇️
... and 44 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5bb3127...0b038cb. Read the comment docs.

@Abacn
Copy link
Contributor Author

Abacn commented Jun 9, 2022

#19554 #20102 #19991

@Abacn
Copy link
Contributor Author

Abacn commented Jun 9, 2022

Run GoPortable PreCommit

@Abacn
Copy link
Contributor Author

Abacn commented Jun 9, 2022

Known build flake "warning: Memory constraints are impeding performance; please increase max heap size." unrelated to this change.

@TheNeuralBit
Copy link
Member

Do we expose hive in any of our public APIs? I was under the impression that we do, so we can't just bump a major version without risking breaking users.

@Abacn
Copy link
Contributor Author

Abacn commented Jun 9, 2022

Do we expose hive in any of our public APIs? I was under the impression that we do, so we can't just bump a major version without risking breaking users.

Yes we do. The HCatalog read and write transforms is in the type of org.apache.hive.hcatalog.data.HCatRecord. However the user are allowed to provide their own version of Hive as part of Beam (hive dependencies are configured as "provided" in gradle build), and the source code of HCatalogIO (other than test and test helper codes) are unchanged.

One of the motivation of this change is the old pentaho dependencies that hive 2.x depended on has been removed from repos gradually and generated build issues to the Beam several times ([BEAM-6811]; #17734 (BEAM-14298)) and this upgrade may be considered to increase its priority.

@TheNeuralBit
Copy link
Member

Do we expose hive in any of our public APIs? I was under the impression that we do, so we can't just bump a major version without risking breaking users.

Yes we do. The HCatalog read and write transforms is in the type of org.apache.hive.hcatalog.data.HCatRecord. However the user are allowed to provide their own version of Hive as part of Beam (hive dependencies are configured as "provided" in gradle build), and the source code of HCatalogIO (other than test and test helper codes) are unchanged.

I see, please also update CHANGES.md to note this as a breaking change and recommend users override the Hive version if they need to.

A couple more questions:

@Abacn
Copy link
Contributor Author

Abacn commented Jun 9, 2022

  • I see some code changes, will that code still work with Hive 2.x?

The changes are only involved in unit tests. Unit tests needs to run with Hive 3.x. The artifact built should not be affected.

EmbeddedMetastoreService.java: this file is only used in unit test. The file itself is a trimmed-down version of a test in Hive source, as recorded in its comment. The change updated the file content to that in Hive 3.1.3 release.

resources/hive-site.xml: hive config file used for EmbeddedMetastoreService which is only used in unit test. Also updated the file content to that in Hive 3.1.3 release.

HCatalogIOTest.java and BeamSqlHiveSchemaTest.java are unit tests. The change reflects a breaking change of Hive 3 which removed CommandNeedRetryException and thus removed from EmbeddedMetastoreService.executeQuery's signature.

  • Do we need to release another vendored Calcite? Note that release process is separate from the rest of Beam

Probably not. Hive 2.1.0 used Calcite 1.6.0 and now Hive 3.1.3 uses Calcite 1.16.0 as dependencies. Looked at Beam release history vendored Calcite started on v1.20.0 since the release of Beam 2.16.0 and the reason of vendoring were not related to Hive. Will confirm with others.

@Abacn
Copy link
Contributor Author

Abacn commented Jun 10, 2022

retest this please

@Abacn
Copy link
Contributor Author

Abacn commented Jun 10, 2022

commenting retest this please not triggering retest, trying do a push

@johnjcasey
Copy link
Contributor

is it possible to move hive-site.xml to the test resources folder to make sure all hive dependencies are in the test side of the world?

@Abacn
Copy link
Contributor Author

Abacn commented Jun 10, 2022

is it possible to move hive-site.xml to the test resources folder to make sure all hive dependencies are in the test side of the world?

Agree. Those files were marked as @Internal and should not be used outside. It was in main module because :sdks:java:extensions:sql:hcatalog project unit tests also used it. However this can be avoided by claiming test deps in the latter.

@Abacn
Copy link
Contributor Author

Abacn commented Jun 14, 2022

retest this please

@Abacn
Copy link
Contributor Author

Abacn commented Jun 14, 2022

R: @johnjcasey

@johnjcasey
Copy link
Contributor

LGTM

@TheNeuralBit
Copy link
Member

Run PythonLint PreCommit

@TheNeuralBit TheNeuralBit merged commit 9adec2a into apache:master Jun 15, 2022
@Abacn Abacn deleted the hive3.1 branch June 17, 2022 14:32
bullet03 pushed a commit to akvelon/beam that referenced this pull request Jun 20, 2022
…he#17749)

* [BEAM-9351] Upgrade Hive to version 3.1.2

* This eliminated the pentaho dependency

* fix auth issue in test

* Add change log

* move internal test only files to test

* clean up original workaround: Hive 3.1.3 upgraded to log4j 2.17.1
prodriguezdefino pushed a commit to prodriguezdefino/beam-pabs that referenced this pull request Jun 21, 2022
…he#17749)

* [BEAM-9351] Upgrade Hive to version 3.1.2

* This eliminated the pentaho dependency

* fix auth issue in test

* Add change log

* move internal test only files to test

* clean up original workaround: Hive 3.1.3 upgraded to log4j 2.17.1
Abacn added a commit to Abacn/beam that referenced this pull request Jul 14, 2022
…he#17749)

* [BEAM-9351] Upgrade Hive to version 3.1.2

* This eliminated the pentaho dependency

* fix auth issue in test

* Add change log

* move internal test only files to test

* clean up original workaround: Hive 3.1.3 upgraded to log4j 2.17.1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants