Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes #5448: Implement initial Iceberg Connector using PyIceberg #14825

Merged
merged 36 commits into from
Jan 29, 2024

Conversation

IceS2
Copy link
Contributor

@IceS2 IceS2 commented Jan 23, 2024

Describe your changes:

Fixes #5448

image image image

Initial implementation on the Iceberg Connector, leveraging PyIceberg to connect to supported catalogs and extract Iceberg table Metadata.

Forcing some requirements to make it work with the rest of the dependencies. It's important that someone double checks it 👀

Notes

  • Uses PyIceberg 0.4.0 due to being the last version that uses Pydantic v1

Covers

  • Overall Support for Hive, REST, Glue and DynamoDB as Catalogs
  • Overall Support for Local, AWS S3 and Azure Blob Storage as FileStorage

Does not (yet) Support

  • SQL Catalog (Not supported by PyIceberg 0.4.0
  • Account Key, Connection String, SaS Token on Azure Blob Storage
  • Signer, Proxy Uri, profileName, assumeRoleArn, assumeRoleSessionName, assumeRoleSourceIdentity on AWS S3
  • Google Storage
  • HDFS
  • Partition Transformations are not tracked (But source columns are)

Type of change:

  • New feature

Checklist:

  • I have read the CONTRIBUTING document.
  • My PR title is Fixes <issue-number>: <short explanation>
  • I have commented on my code, particularly in hard-to-understand areas.
  • For JSON Schema changes: I updated the migration scripts or explained why it is not needed.
  • The issue properly describes why the new feature is needed, what's the goal, and how we are building it. Any discussion
    or decision-making process is reflected in the issue.
  • I have updated the documentation.
  • I have added tests around the new logic.

@github-actions github-actions bot added UI UI specific issues Ingestion backend safe to test Add this label to run secure Github workflows on PRs labels Jan 23, 2024
@IceS2 IceS2 changed the title Fixes <issue-5448>: Implement initial Iceberg Connector using PyIceberg Fixes #5448: Implement initial Iceberg Connector using PyIceberg Jan 23, 2024
Copy link
Contributor

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

Copy link
Contributor

github-actions bot commented Jan 23, 2024

Jest test Coverage

UI tests summary

Lines Statements Branches Functions
Coverage: 52%
52.79% (27112/51359) 35.26% (10848/30770) 33.68% (3164/9394)

ingestion/setup.py Outdated Show resolved Hide resolved
ingestion/setup.py Outdated Show resolved Hide resolved
},
"warehouseLocation": {
"title": "Warehouse Location",
"description": "Warehouse Location. Used to specify a custom warehouse location.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe put an example for each value

ingestion/tests/unit/topology/database/test_iceberg.py Outdated Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of making IcebergCatalogFactory either an instantiable class or just breaking it down with catalog_type_map being a global and from_connection being a simple function? I understand the intent but we don't really need that object if we won't use it as an object.

Do you see any advantage in that context of defining this piece of code directly in _init__.py?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, to be honest I don't have a really strong opinion... Been messing around with both ways actually xD

I believe that trying to wrap things with specific classes tends to make the code a bit more readable because it gives you context straight away, but if we avoid doing that in this project I wouldn't mind having just a from_connection function.

About defining it on init.py, for me it made sense as the IcebergCatalogFactory is basically the entrypoint for the sub module.

Again if it's against our common practices I wouldn't mind defining it on another file and to make the import statement more directy we can "re-export" it on the init.py.

Copy link
Contributor

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

Copy link
Contributor

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

Copy link
Contributor

The Java checkstyle failed.

Please run mvn spotless:apply in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Java code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

Copy link

Quality Gate Passed Quality Gate passed for 'open-metadata-ui'

Kudos, no new issues were introduced!

0 New issues
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

Copy link

Quality Gate Passed Quality Gate passed for 'open-metadata-ingestion'

Kudos, no new issues were introduced!

0 New issues
0 Security Hotspots
74.6% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

Copy link
Collaborator

@pmbrull pmbrull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @IceS2, for the effort here. Not an easy first connector to tackle.

Congratulations on your first contribution

@pmbrull pmbrull merged commit 373cafc into open-metadata:main Jan 29, 2024
31 of 32 checks passed
@IceS2 IceS2 deleted the issue-5448-iceberg-connector branch January 29, 2024 06:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend documentation Improvements or additions to documentation Ingestion safe to test Add this label to run secure Github workflows on PRs UI UI specific issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create Generic Iceberg Source Connector
4 participants