Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Hive MetaStore support to kafka-connect-s3 #572

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

frankgrimes97
Copy link

This PR attempts to address #237
We would very much like to work towards having this contributed back upstream rather than maintain our own fork.
Feedback is most welcome!

Was getting the following output:
  '[INFO] Tests run: 0, Failures: 0, Errors: 0, Skipped: 0'

After updating the maven-surefire-plugin to add the surefire-junit4
dependency the unit tests are now being executed
Unrelated to my changes, but unnecessary noise that should be fixed.
These two tests seem to have been failing on upstream master since
the following commit/merge:

confluentinc@c633f08

Updated the test expectations to match the current code.
Currently supports Avro and Parquet formats only.

The functionality was ported over from kafka-connect-hdfs
with the following simplications:

1) No listing of files in storage to determine missing partitions on
   startup
2) No WAL used for that same purpose

Those features were deemed too complex to port over relative to their
added benefits.
While there likely is a small window where some partitions may not be
added in the case of a crash or shutdown, we believe that any missing
partitions in the Hive MetaStore can be corrected/added out-of-band
without both the code complexity and potential startup costs of
reconciling those discrepancies.

N.B. A major overhaul of dependencies was required to
avoid conflicts due to Hadoop/Hive jars containing
non-shaded copies of misc. dependencies.
@frankgrimes97 frankgrimes97 requested a review from a team as a code owner October 19, 2022 14:43
@frankgrimes97
Copy link
Author

Hi, it's not clear to me that the Jenkins-public-CI integration test failures are due to my code changes.
"AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied;"
Is that CI check broken? (I see most other outside-contributed PRs have similar failed checks)

@frankgrimes97 frankgrimes97 force-pushed the feature/add-hive-metastore-support-hive-3.1.3 branch from 1632914 to 27b04fd Compare November 14, 2022 17:30
It uses the maven-shade-plugin to prune out Hive/Hadoop related
duplicate classes so that we don't get version mismatches at runtime

e.g. usage
`mvn package -Phive`
@frankgrimes97 frankgrimes97 force-pushed the feature/add-hive-metastore-support-hive-3.1.3 branch from 27b04fd to ed5abc3 Compare November 14, 2022 19:26
@mattssll
Copy link

hi @frankgrimes97 , don't want to bother you,
but wanted to ask anyways:
do you think this integration will also work with the glue data catalog (glue data catalog is an implementation of the hive metastore)? That would be useful so there's no need to deploy a separate hive metastore service and users would be able to use the glue fully managed one.

thanks for the work in this feature

@frankgrimes97
Copy link
Author

@mattssll I'm not very familiar with AWS's Glue Data Catalog but in my brief searching/reading I found the following:

The AWS Glue Data Catalog is a fully managed, Apache Hive Metastore compatible, metadata repository.
https://aws.amazon.com/about-aws/whats-new/2019/02/source-code-for-the-aws-glue-data-catalog-client-for-apache-hive-metatore-is-now-available-for-download/

Looking at the code, it's not clear whether a stock Apache Hive client can actually talk to Glue Data Catalog: https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore/tree/master/aws-glue-datacatalog-hive2-client/src/main/java/com/amazonaws/glue/catalog/metastore

@frankgrimes97
Copy link
Author

We are still interested in working to get this work accepted upstream. Are any current Confluent maintainers available to help us accomplish that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants