-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
187: Adding apache hudi support to dbt #210
Conversation
@vingov What is the progress on this ? are you still working on this? |
Yes @atul016, I will fix the integration test in a few days. The code works in Spark 3, and hits an edge case in Spark 2, the fix should be in apache hudi repo or possibly a config change here. Apart from the integration tests, do you have any other questions or comments about this PR? |
@atul016 There is a lot of user interest for this. (I am the PMC chair for Apache Hudi). Please let us know how we can help take this forward. |
@vingov how are you? Do you have news about this PR...this is a very useful feature :) |
@vingov Thank you for this amazing contribution, and for the detailed testing you're adding along the way! I would love to include this in It looks like you're running into a tight spot around the integration tests, related to Spark 2. I'd have no opposition to upgrading the containerized cluster we run in CI, so that it uses Spark 3 instead, but we've struggled with that update in the past (#145). Also, as a fair warning, we're likely to move integration tests from CircleCI to GitHub Actions (as we've done for other plugins), and to lightly refactor the way we're setting up those integration tests (#227). I don't think that should impact the majority of your changes, I just don't want a big merge conflict to come as a surprise. |
Hey @rubenssoto - Sorry, I was on a vacation for the last few weeks, I'm back and I'll get this landed soon. |
@jtcohen6 - Thanks for the insights, I was about to ask you about updating the CI to use Spark 3, I will dig deeper into the gaps on #145. Thanks for the heads-up on GitHub Actions. I will iterate on Spark 3, after my findings, we can work together to get this PR landed. I'm on the dbt slack as well, you can reach me over there to iterate faster. |
@vingov dont be sorry, thank you so much for your work :) |
35f5fc7
to
68936d5
Compare
@jtcohen6 - Hey, can you please approve the CI workflow, to run the integration tests? It's stopped with a message that it needs a maintainer to approve running workflows. |
@vingov Approved to run unit tests and code checks via GitHub Actions. We're still mid-cutover between CircleCI and GHA |
68936d5
to
ae3bfe3
Compare
Hey @jtcohen6 - can you please approve the CI workflow, to run the integration tests? I rebased the fixed the integration tests, ran the circle ci locally to test it out as well. |
b7e4890
to
0723de9
Compare
@jtcohen6 - I'm really sorry to bug you again, last time I checked & fixed only the integration-spark-thrift circle ci tests. The databricks tests were not running in my local, hence could not test it out locally, now I have fixed that databricks error as well, can you please approve the workflow again? thanks in advance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vingov Sure thing! It looks like the one failing test may be related to lack of hudi support on Databricks. I'd recommend disabling that model for those tests or, if you see fit, setting up the persist_docs
test case to run on Apache Spark + Hudi as well.
@@ -0,0 +1,2 @@ | |||
{{ config(materialized='table', file_format='hudi') }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I right in thinking that this is failing on Databricks because the hudi
file format is not available there?
This specific test case (tests.integration.persist_docs
) isn't running on Apache Spark right now. You're welcome to either:
- add a test to
test_persist_docs.py
, separate fromtest_delta_comments
, with@use_profile("apache_spark")
, and configure this model to be enabled only for that test, and disabled when running with a databricks profile - disable this model for the time being
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right, since there were many iterations on this PR already, For now, I'll disable the model to keep it simple and merge this PR, later in the next iteration I'll bring back both these tests.
@@ -77,6 +81,26 @@ def run_and_test(self): | |||
def test_delta_strategies_databricks_cluster(self): | |||
self.run_and_test() | |||
|
|||
# Uncomment this hudi integration test after the hudi 0.10.0 release to make it work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat! Out of curiosity, what's the change coming in v0.10 that will make this sail smoothly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark SQL DML support has been added to Apache Hudi recently with the 0.9.0 release, but there were a few gaps that got fixed after we released the last version, which is scheduled for the next release in a few weeks.
Most specifically, these commits are the ones that are relevant to making these tests run smoothly.
…, will be added later
@jtcohen6 - Please approve the workflow one last time, thanks! |
@jtcohen6 - Finally all the integration tests passed, I guess it still needs your approval for running the python 3.8 unit tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vingov Thank you for the contribution! Very neat to be able to include this in time for v1 :)
* Refactor seed macros, clearer sql param logging (#250) * Try refactoring seed macros * Add changelog entry * 187: Adding apache hudi support to dbt (#210) * initial working version * Rebased and resolve all the merge conflicts. * Rebased and resolved merge conflicts. * Removed hudi dep jar and used the released version via packages option * Added insert overwrite unit tests for hudi * Used unique_key as default value for hudi primaryKey option * Updated changelog.md with this new update. * Final round of testing and few minor fixes * Fixed lint issues * Fixed the integration tests * Fixed the circle ci env to add hudi packages * Updated hudi spark bundle to use scala 2.11 * Fixed Hudi incremental strategy integration tests and other integration tests * Fixed the hudi hive sync hms integration test issues * Added sql HMS config to fix the integration tests. * Added hudi hive sync mode conf to CI * Set the hms schema verification to false * Removed the merge update columns hence its not supported. * Passed the correct hiveconf to the circle ci build script * Disabled few incremental tests for spark2 and reverted to spark2 config * Added hudi configs to the circle ci build script * Commented out the Hudi integration test until we have the hudi 0.10.0 version * Fixed the macro which checks the table type. * Disabled this model since hudi is not supported in databricks runtime, will be added later * Update profile_template.yml for v1 (#247) * Update profile_template.yml for v1 * PR feedback, fix indentation issues * It was my intention to remove the square brackets * Fixup changelog entry * Merge main, update changelog * Bump version to 1.0.0rc2 (#259) * bumpversion 1.0.0rc2 * Update changelog * Use pytest-dbt-adapter==0.6.0 * Corrected definition for set full_refresh_mode (#262) * Replaced definition for set full_refresh_mode * Updated changelog * Edit changelog Co-authored-by: Jeremy Cohen <jeremy@dbtlabs.com> * `get_response` -> `AdapterResponse` (#265) * Return AdapterResponse from get_response * fix flake Co-authored-by: Jeremy Cohen <jeremy@dbtlabs.com> Co-authored-by: Vinoth Govindarajan <vinothg@uber.com> Co-authored-by: Sindre Grindheim <sindre.grindheim@pm.me>
resolves #187
Description
Apache Hudi brings ACID transactions, record-level updates/deletes, and change streams to data lakes. Both Hudi & dbt are great technologies, this PR integrates the apache hudi file-format support to dbt to allow users to create and model hudi datasets using dbt.
This PR adds one more file format which supports incremental merge strategy, now users can use this feature on all spark environments. In addition to the delta format which works only on databricks runtime environment.
Tested locally:
Checklist
CHANGELOG.md
and added information about my change to the "dbt next" section.