Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Set unique table suffix to allow parallel incremental executions #650

Merged
merged 6 commits into from
May 23, 2024

Conversation

pierrebzl
Copy link
Contributor

@pierrebzl pierrebzl commented May 13, 2024

Description

For some specific cases (eg. backfill very large amount of data), we need to execute parallel multiple dbt build of specific incremental model in which we pass the date (or batch number) as var argument.
For example, we have a model we run every hour using Airflow for which we pass the a date relative to the Airflow scheduler.

If we want to process by batch of N hours in parallel using Airflow concurrency, we need the tmp table create by each of the dbt run to be unique. Else, you are going to end up with N insert attempting to run with the same __dbt_tmp name, creating conflict and ultimately creating failure.

Similar issue open here: Tomme/dbt-athena#62

Models used to test - Optional

Checklist

  • You followed contributing section
  • You kept your Pull Request small and focused on a single feature or bug fix.
  • You added unit testing when necessary
  • You added functional testing when necessary

Second (incremental) run model output:

-- /* {"app": "dbt", "dbt_version": "1.8.0", "profile_name": "test", "target_name": "default", "node_id": "model.test.unique_tmp_table_suffix"} */

create table "awsdatacatalog"."test17157222480613300988_test_unique_tmp_table_suffix"."unique_tmp_table_suffix__dbt_tmp_375973fa-ecdd-44dc-9a63-788abaa41832"
    with (
      table_type='hive',
    is_external=true,external_location='s3://XXX/test17157222480613300988_test_unique_tmp_table_suffix/unique_tmp_table_suffix__dbt_tmp_375973fa-ecdd-44dc-9a63-788abaa41832/1058bc6a-9eb1-4b0b-b23a-cce26c28b669',
      format='parquet'
    )
    as
      
select
    random() as rnd,
    cast(from_iso8601_date('2024-01-02') as date) as date_column

@nicor88
Copy link
Contributor

nicor88 commented May 13, 2024

The general idea of using unique tmp table location is good.
Few notes and concerns:

  • parallel transactions on the same iceberg table might fail, e.g if you run in parallel backfilling by a date, with a delete/insert operations leads to a failure, due to glue optimistic logic. With some retries might work
  • running parallel insert+overwrites with an hive/parquet table should just work fine, because each dbt run will operate on a specific "partition" with its own unique prefix, and tmp table

@pierrebzl I left few comments, please have a look.

@pierrebzl pierrebzl force-pushed the unique-tmp-table-name branch from 7a4520a to 8e41849 Compare May 14, 2024 21:39
@pierrebzl
Copy link
Contributor Author

Hi @nicor88,
Thanks a lot for your feedback, definitely makes a lot of sense. I've attempted to implement the logic you mentioned.
Couple of points I don't really know how to address:

  1. Any idea how in the functional test, assert the table name generate in the second run / CTAS query? Ideally, I would like to check a uuid is present and not the default __dbt_tmp value.
  2. Is the logic I put in the incremental.sql at the right place and looks good with you in terms of when this unique id should apply? In my use case, I only need it on insert_overwrite + hive.
    Thanks again for your time!

@pierrebzl pierrebzl force-pushed the unique-tmp-table-name branch from 8e41849 to 2658f77 Compare May 14, 2024 21:48
@pierrebzl pierrebzl marked this pull request as ready for review May 14, 2024 21:48
@pierrebzl pierrebzl force-pushed the unique-tmp-table-name branch from 2658f77 to 135ab4c Compare May 14, 2024 21:50
@nicor88
Copy link
Contributor

nicor88 commented May 15, 2024

@Jrmyy @svdimchenko could you assist @pierrebzl?
I will be away from keyboard till next week. Thanks

@nicor88
Copy link
Contributor

nicor88 commented May 15, 2024

@pierrebzl could you lint your changes? Regarding functional testing, what you proposed seems fine. It's hard to intercept the tmp table name in the test, I'm fine with what you proposed

@pierrebzl pierrebzl force-pushed the unique-tmp-table-name branch from 135ab4c to 54229d0 Compare May 15, 2024 12:53
@pierrebzl
Copy link
Contributor Author

@pierrebzl could you lint your changes? Regarding functional testing, what you proposed seems fine. It's hard to intercept the tmp table name in the test, I'm fine with what you proposed

Yes sorry for this. Should be done:

(venv) dbt-athena $ pre-commit run mypy --show-diff-on-failure --color=always --all-files
mypy.....................................................................Passed

@pierrebzl pierrebzl requested a review from nicor88 May 21, 2024 13:16
@pierrebzl
Copy link
Contributor Author

Hi @nicor88, let me know if anything missing to go ahead with this change. Thanks!

Jrmyy
Jrmyy previously approved these changes May 21, 2024
Copy link
Contributor

@Jrmyy Jrmyy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good ✅

@nicor88
Copy link
Contributor

nicor88 commented May 22, 2024

@pierrebzl there are some conflicts in the Readme, due to the latest merge, could you please resolve that?

@pierrebzl pierrebzl force-pushed the unique-tmp-table-name branch from 34a6f3e to fc61a5e Compare May 22, 2024 16:49
Jrmyy
Jrmyy previously approved these changes May 23, 2024
@nicor88
Copy link
Contributor

nicor88 commented May 23, 2024

@pierrebzl could you lint again? please, I left a tiny comment, looks good.

@pierrebzl
Copy link
Contributor Author

pierrebzl commented May 23, 2024

@pierrebzl could you lint again? please, I left a tiny comment, looks good.

Sorry for that.
Done, I must be something wrong but my local pre-commit check always fails on flake8 as it seems to check other files than changed ones.

@pierrebzl pierrebzl force-pushed the unique-tmp-table-name branch from c1b663d to 4a52129 Compare May 23, 2024 14:07
@pierrebzl
Copy link
Contributor Author

I had to change the regex expression format: 4a52129#diff-58cf52251bceac0bf774fa9511051801f5c8d6b34e5af08e3af56dbe271c6df8R65-R68 as it got broke by the auto lintting. Should pass tests now.

@nicor88
Copy link
Contributor

nicor88 commented May 23, 2024

@pierrebzl thanks for fixing, I was wondering why the CI was failing, but it was a genuine failure, and I just re-trigger it.

@pierrebzl
Copy link
Contributor Author

@pierrebzl thanks for fixing, I was wondering why the CI was failing, but it was a genuine failure, and I just re-trigger it.

Yes, I think we are good now 🥳

@nicor88
Copy link
Contributor

nicor88 commented May 23, 2024

@pierrebzl great job 💯

@nicor88 nicor88 merged commit 8813f9e into dbt-labs:main May 23, 2024
10 checks passed
@pierrebzl
Copy link
Contributor Author

@pierrebzl great job 💯

Thanks for your support!

kodiakhq bot referenced this pull request in cloudquery/policies Jun 5, 2024
This PR contains the following updates:

| Package | Update | Change |
|---|---|---|
| [dbt-athena-community](https://togithub.com/dbt-athena/dbt-athena) | minor | `==1.7.2` -> `==1.8.2` |

---

### Release Notes

<details>
<summary>dbt-athena/dbt-athena (dbt-athena-community)</summary>

### [`v1.8.2`](https://togithub.com/dbt-athena/dbt-athena/releases/tag/v1.8.2)

[Compare Source](https://togithub.com/dbt-athena/dbt-athena/compare/v1.8.1...v1.8.2)

### What's Changed

#### Fixes

-   fix: Add wait_random_exponential for query retries by [@&#8203;svdimchenko](https://togithub.com/svdimchenko) in [https://github.com/dbt-athena/dbt-athena/pull/655](https://togithub.com/dbt-athena/dbt-athena/pull/655)
-   fix: Resolve error when cloning Python models ([#&#8203;645](https://togithub.com/dbt-athena/dbt-athena/issues/645)) by [@&#8203;jeancochrane](https://togithub.com/jeancochrane) in [https://github.com/dbt-athena/dbt-athena/pull/651](https://togithub.com/dbt-athena/dbt-athena/pull/651)
-   fix: Fixed table_type for GOVERNED tables by [@&#8203;svdimchenko](https://togithub.com/svdimchenko) in [https://github.com/dbt-athena/dbt-athena/pull/661](https://togithub.com/dbt-athena/dbt-athena/pull/661)

#### Features

-   feat: Set unique table suffix to allow parallel incremental executions by [@&#8203;pierrebzl](https://togithub.com/pierrebzl) in [https://github.com/dbt-athena/dbt-athena/pull/650](https://togithub.com/dbt-athena/dbt-athena/pull/650)
-   feat: Allow custom schema def for tmp tables generated by incremental by [@&#8203;pierrebzl](https://togithub.com/pierrebzl) in [https://github.com/dbt-athena/dbt-athena/pull/659](https://togithub.com/dbt-athena/dbt-athena/pull/659)
-   feat: Implement iceberg retry logic by [@&#8203;svdimchenko](https://togithub.com/svdimchenko) in [https://github.com/dbt-athena/dbt-athena/pull/657](https://togithub.com/dbt-athena/dbt-athena/pull/657)

#### Dependencies

-   chore: Update moto requirement from ~=5.0.7 to ~=5.0.8 by [@&#8203;dependabot](https://togithub.com/dependabot) in [https://github.com/dbt-athena/dbt-athena/pull/660](https://togithub.com/dbt-athena/dbt-athena/pull/660)
-   chore: Bumped version to 1.8.2 for release by [@&#8203;svdimchenko](https://togithub.com/svdimchenko) in [https://github.com/dbt-athena/dbt-athena/pull/663](https://togithub.com/dbt-athena/dbt-athena/pull/663)

#### Docs

-   docs: Cleanup README grammar, punctuation, and capitalisation by [@&#8203;dfsnow](https://togithub.com/dfsnow) in [https://github.com/dbt-athena/dbt-athena/pull/654](https://togithub.com/dbt-athena/dbt-athena/pull/654)

#### New Contributors

-   [@&#8203;jeancochrane](https://togithub.com/jeancochrane) made their first contribution in [https://github.com/dbt-athena/dbt-athena/pull/651](https://togithub.com/dbt-athena/dbt-athena/pull/651)
-   [@&#8203;pierrebzl](https://togithub.com/pierrebzl) made their first contribution in [https://github.com/dbt-athena/dbt-athena/pull/650](https://togithub.com/dbt-athena/dbt-athena/pull/650)

**Full Changelog**: dbt-labs/dbt-athena@v1.8.1...v1.8.2

### [`v1.8.1`](https://togithub.com/dbt-athena/dbt-athena/releases/tag/v1.8.1)

[Compare Source](https://togithub.com/dbt-athena/dbt-athena/compare/v1.7.2...v1.8.1)

#### What's Changed

##### Relevant notes

⚠️ 1.8.1 version is equivalent to 1.8.0 in term of features and fixes.

You can install the changes from this release via

    pip install dbt-athena-community==1.8.1

##### Features

-   feat: Add column meta to glue column parameters by [@&#8203;SoumayaMauthoorMOJ](https://togithub.com/SoumayaMauthoorMOJ) in [https://github.com/dbt-athena/dbt-athena/pull/644](https://togithub.com/dbt-athena/dbt-athena/pull/644)

##### Dependencies

-   chore: Update moto requirement from ~=5.0.6 to ~=5.0.7 by [@&#8203;dependabot](https://togithub.com/dependabot) in [https://github.com/dbt-athena/dbt-athena/pull/648](https://togithub.com/dbt-athena/dbt-athena/pull/648)

##### Docs

-   docs: Cleanup Python models section of README by [@&#8203;dfsnow](https://togithub.com/dfsnow) in [https://github.com/dbt-athena/dbt-athena/pull/643](https://togithub.com/dbt-athena/dbt-athena/pull/643)

#### New Contributors

-   [@&#8203;dfsnow](https://togithub.com/dfsnow) made their first contribution in [https://github.com/dbt-athena/dbt-athena/pull/643](https://togithub.com/dbt-athena/dbt-athena/pull/643)

</details>

---

### Configuration

📅 **Schedule**: Branch creation - "before 4am on the first day of the month" (UTC), Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update again.

---

 - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box

---

This PR has been generated by [Renovate Bot](https://togithub.com/renovatebot/renovate).
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNy4zODMuMCIsInVwZGF0ZWRJblZlciI6IjM3LjM4My4wIiwidGFyZ2V0QnJhbmNoIjoibWFpbiIsImxhYmVscyI6WyJhdXRvbWVyZ2UiXX0=-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants