Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dbt: move to materialize-boilerplate, only trigger after delay #2082

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mdibaiee
Copy link
Member

@mdibaiee mdibaiee commented Oct 24, 2024

Description:

  • Move dbt job trigger logic to materialize-boilerplate where it can be better integrated with sync frequency logic, and makes it easier for materialization connectors to integrate the functionality in general

Workflow steps:

(How does one use this feature, and how has it changed)

Documentation links affected:

(list any documentation links that you created, or existing ones that you've identified as needing updates, along with a brief description)

Notes for reviewers:

(anything that might help someone review this PR)


This change is Reviewable

@mdibaiee mdibaiee added the change:planned This is a planned change label Oct 24, 2024
Comment on lines +210 to +211
// We trigger dbt job only if the delay is taking effect, to avoid bursting
// dbt job triggers during backfills
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question worth posing here: have cases where a materialization has so much data that it always skips the default delay? In those cases it might seem like dbt job trigger is not working at all, but it is being always postponed... 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are definitely active tasks where a materialization has some much continuous data that when combined with the backpressure of the destination system, the transactions are always greater than 1MM documents and the delay would always be skipped.

Fortunately these are also cases where the tasks are configured with an explicit delay of 0s, so it doesn't really matter. I doubt this will always be the case though. Our current delay triggering mechanism based on the number of documents in a transaction is fundamentally flawed but it is the best we can do right now.

Copy link
Member

@williamhbaker williamhbaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a scenario to consider: A user may have a batch capture that runs every 24 hours that produces some large number of documents only once a day but never at any other time. They are materializing from that capture, so the materialization only does any work every 24 hours. When a materialization first starts up, it runs a fixed number of commits (5) without applying any delay, to avoid accidentally tripping the delay due to artificially small commits from the runtime that sometimes happen when a connector restarts. If the materialization connector restarts in that ~24 hour period of idleness, it will have 5 instant commits queued up for when it starts seeing data. If it is able to work through the batch capture's 24 hours worth of data that it captures all at once in 5 or less transactions it won't ever trigger the update delay, and thus would never trigger the dbt job. It might eventually trigger the dbt job during the next 24 hour cycle, or it might not even then if the materialization is restarted again in the idle period.

Comment on lines +210 to +211
// We trigger dbt job only if the delay is taking effect, to avoid bursting
// dbt job triggers during backfills
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are definitely active tasks where a materialization has some much continuous data that when combined with the backpressure of the destination system, the transactions are always greater than 1MM documents and the delay would always be skipped.

Fortunately these are also cases where the tasks are configured with an explicit delay of 0s, so it doesn't really matter. I doubt this will always be the case though. Our current delay triggering mechanism based on the number of documents in a transaction is fundamentally flawed but it is the best we can do right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
change:planned This is a planned change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

dbt cloud trigger: avoid bursts of triggers when backfilling materializations
2 participants