Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug/databricks sql incremental #49

Merged
merged 12 commits into from
Jul 31, 2024

Conversation

fivetran-catfritz
Copy link
Contributor

@fivetran-catfritz fivetran-catfritz commented Jul 24, 2024

PR Overview

This PR will address the following Issue/Feature:

This PR will result in the following new package version:

  • v0.10.0 since we are changing file formats and materializations.

Please provide the finalized CHANGELOG entry which details the relevant changes included in this PR:

🚨 Breaking Changes 🚨

⚠️ Since the following changes result in the table format changing, we recommend running a --full-refresh after upgrading to this version to avoid possible incremental failures.

  • For Databricks All-Purpose clusters, incremental models will now be materialized using the delta table format (previously parquet).

    • Delta tables are generally more performant than parquet and are also more widely available for Databricks users. This will also prevent compilation issues on customers' managed tables.
  • For Databricks SQL Warehouses, incremental materialization will not be used due to the incompatibility of the insert_overwrite strategy.

Under the Hood

  • The is_incremental_compatible macro has been added to return true if the target warehouse supports our chosen incremental strategy.
    • This update was applied as there have been other Databricks runtimes discovered (ie. an endpoint and external runtime) which do not support the insert_overwrite incremental strategy used.
  • Added integration testing for Databricks SQL Warehouse.
  • Added consistency tests for models:
    • mixpanel__daily_events
    • mixpanel__event
    • mixpanel__monthly_events
    • mixpanel__sessions
  • Updated logic for macro mixpanel_lookback to align with logic used in similar macros in other packages.

PR Checklist

Basic Validation

Please acknowledge that you have successfully performed the following commands locally:

  • dbt run –full-refresh && dbt test
  • dbt run (if incremental models are present) && dbt test

Before marking this PR as "ready for review" the following have been applied:

  • The appropriate issue has been linked, tagged, and properly assigned.
  • All necessary documentation and version upgrades have been applied.
  • docs were regenerated (unless this PR does not include any code or yml updates).
  • BuildKite integration tests are passing.
  • Detailed validation steps have been provided below.

Detailed Validation

Please share any and all of your validation steps:

Consistency tests pass

  • Screenshot 2024-07-26 at 11 22 03 AM

Parquet file format

  • Before addressing the insert_overwrite incompatibility, our Databricks SQL instance is "managed", so I was getting the below error for the parquet file format.

    • Screenshot 2024-07-24 at 4 50 01 PM
  • Updating to delta format resolves this issue.

Insert-overwrite incremental strategy

  • Once the file format issue was resolved, I was able to reproduce the reported issue and get the below error.

    • Screenshot 2024-07-24 at 4 52 16 PM
  • Updating with the materialized='incremental' if is_incremental_compatible() else 'table' approach resolved this issue. When running a non-full-refresh run with a target of Databricks SQL, confirm there is no error and tables are created instead of and incremental run.

    • Screenshot 2024-07-24 at 4 55 48 PM

If you had to summarize this PR in an emoji, which would it be?

💃

@fivetran-catfritz fivetran-catfritz linked an issue Jul 24, 2024 that may be closed by this pull request
4 tasks
* Update mixpanel_lookback.sql

* update changelog
@fivetran-catfritz fivetran-catfritz self-assigned this Jul 24, 2024
Copy link
Contributor

@fivetran-joemarkiewicz fivetran-joemarkiewicz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fivetran-catfritz these changes look great! Just a few suggestions and one code update request which is the same as my HubSpot comment. Let me know if you have any questions.

Once these updates are applied the re-review should be good for approval!

CHANGELOG.md Outdated
@@ -1,3 +1,26 @@
# dbt_mixpanel v0.10.0

[PR #48](https://github.com/fivetran/dbt_mixpanel/pull/48) includes the following updates:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update to the PR link reference

Suggested change
[PR #48](https://github.com/fivetran/dbt_mixpanel/pull/48) includes the following updates:
[PR #49](https://github.com/fivetran/dbt_mixpanel/pull/49) includes the following updates:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks 😅. Updated.

CHANGELOG.md Outdated
- For Databricks SQL Warehouses, incremental materialization will not be used due to the incompatibility of the `insert_overwrite` strategy.

## Under the Hood
- The `is_incremental_compatible` macro has been added to return `true` if the target warehouse supports our chosen incremental strategy.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- The `is_incremental_compatible` macro has been added to return `true` if the target warehouse supports our chosen incremental strategy.
- The `is_incremental_compatible` macro has been added and will return `true` if the target warehouse supports our chosen incremental strategy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated!

@@ -6,30 +6,13 @@

{% macro default__mixpanel_lookback(from_date, datepart, interval, safety_date='2010-01-01') %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Fivetran Log I noticed we provide default values for datepart and interval. Should we do the same here or is that not necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh good callout. It's not necessary in this package because I defined those variables, but it would make sense to keep things consistent, and I think having the defaults is the better version. Updated!

Copy link
Contributor Author

@fivetran-catfritz fivetran-catfritz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fivetran-joemarkiewicz This one is ready for re-review as well!

CHANGELOG.md Outdated
@@ -1,3 +1,26 @@
# dbt_mixpanel v0.10.0

[PR #48](https://github.com/fivetran/dbt_mixpanel/pull/48) includes the following updates:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks 😅. Updated.

CHANGELOG.md Outdated
- For Databricks SQL Warehouses, incremental materialization will not be used due to the incompatibility of the `insert_overwrite` strategy.

## Under the Hood
- The `is_incremental_compatible` macro has been added to return `true` if the target warehouse supports our chosen incremental strategy.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated!

@@ -6,30 +6,13 @@

{% macro default__mixpanel_lookback(from_date, datepart, interval, safety_date='2010-01-01') %}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh good callout. It's not necessary in this package because I defined those variables, but it would make sense to keep things consistent, and I think having the defaults is the better version. Updated!

Copy link
Contributor

@fivetran-joemarkiewicz fivetran-joemarkiewicz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fivetran-catfritz great work on this PR and thanks for addressing my review notes. This PR looks good for release review! I did encounter a few issues when running the validations with internal data, but they were pretty small and we have an easy workaround. Please take a look and address my remaining notes, but no need for those to be a blocker from initiating the release review process.

Great work!

Co-authored-by: Joe Markiewicz <74217849+fivetran-joemarkiewicz@users.noreply.github.com>
Copy link
Contributor Author

@fivetran-catfritz fivetran-catfritz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I have updated the tests, so now proceeding with release!

Copy link
Contributor

@fivetran-reneeli fivetran-reneeli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@fivetran-catfritz fivetran-catfritz merged commit 153590a into main Jul 31, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] Update incremental strategy for Databricks SQL Warehouse
3 participants