Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create and run accurate SQL statements when using ExecutionMode.AIRFLOW_ASYNC #1474

Merged
merged 42 commits into from
Feb 5, 2025

Conversation

pankajkoti
Copy link
Contributor

@pankajkoti pankajkoti commented Jan 21, 2025

Overview

This PR introduces a reliable way to extract SQL statements run by dbt-core so Airflow asynchronous operators can use them. It fixes the experimental BQ implementation of ExecutionMode.AIRFLOW_ASYNC introduced in Cosmos 1.7 (#1230).

Previously, in #1230, we attempted to understand the implementation of how dbt-core runs --full-refresh for BQ, and we hard-coded the SQL header in Cosmos as an experimental feature. Since then, we realised that this approach was prone to errors (e.g. #1260) and that it is unrealistic for Cosmos to try to recreate the logic of how dbt-core and its adaptors generate all the SQL statements for different operations, data warehouses, and types of materialisation.

With this PR, we use dbt-core to create the complete SQL statements without dbt-core running those transformations. This enables better compatibility with various dbt-core features while ensuring correctness in running models.

The drawback of the current approach is that it relies on monkey patching, a technique used to dynamically update the behaviour of a piece of code at run-time. Cosmos is monkey patching dbt-core adaptors methods at the moment that they would generally execute SQL statements - Cosmos modifies this behaviour so that the SQL statements are writen to disk without performing any operations to the actual data warehouse.

The main drawback of this strategy is in case dbt changes its interface. For this reason, we logged the follow-up ticket #1489 to make sure we test the latest version of dbt and its adapters and confirm the monkey patching works as expected regardless of the version being used. That said, since the method being monkey patched is part of the dbt-core interface with its adaptors, we believe the risks of breaking changes will be low.

The other challenge with the current approach is that every Cosmos task relies on the following:

  1. dbt-core being installed alongside the Airflow installation
  2. the execution of a significant part of the dbtRunner logic

We have logged a follow-up ticket to evaluate the possibility of overcoming these challenges: #1477

Key Changes

  1. Mocked BigQuery Adapter Execution:
    • Introduced _mock_bigquery_adapter() to override BigQueryConnectionManager.execute, ensuring SQL is only written to the target directory and skipping execution in the warehouse.
    • The generated SQL is then submitted using Airflow’s BigQueryInsertJobOperator in deferrable mode.
  2. Refactoring AbstractDbtBaseOperator:
    • Previously, AbstractDbtBaseOperator inherited BaseOperator, causing conflicts when used with BigQueryInsertJobOperator with ourEXECUTIONMODE.AIRFLOW_ASYNC classes and the interface built in Add structure to support multiple db for async operator execution #1483
    • Refactored to AbstractDbtBase (no longer inheriting BaseOperator), requiring explicit BaseOperator initialization in all derived operators.
    • Updated the below existing operators to consider this refactoring needing derived classes to initialise BaseOperator:
      • DbtAzureContainerInstanceBaseOperator
      • DbtDockerBaseOperator
      • DbtGcpCloudRunJobBaseOperator
      • DbtKubernetesBaseOperator
  3. Changes to dbt Compilation Workflow
    • Removed _add_dbt_compile_task, which previously pre-generated SQL and uploaded it to remote storage and subsequent task downloaded this compiled SQL for their execution.
    • Instead, dbt run is now directly invoked in each task using the mocked adapter to generate the full SQL.
    • A future issue will assess whether we should reintroduce a compile task using the mocked adapter for SQL generation and upload, reducing redundant dbt calls in each task.

Issue updates

The PR fixes the following issues:

  1. closes: [bug] Fix ExecutionMode.AIRFLOW_ASYNC query #1260
    • Previously, we only supported --full-refresh dbt run with static SQL headers (e.g., CREATE/DROP TABLE).
    • Now, we support dynamic SQL headers based on materializations, including CREATE OR REPLACE TABLE, CREATE OR REPLACE VIEW, etc.
  2. closes: [async] Evaluate possibility of supporting macros when using ExecutionMode.AIRFLOW_ASYNC #1271
    • dbt macros are evaluated at runtime during dbt run invocation using mocked adapter, and this PR lays the groundwork for supporting them in async execution mode.
  3. closes: [async] Support running models without --full-refresh when using ExecutionMode.AIRFLOW_ASYNC #1265
    • Now, large datasets can avoid full drops and recreations, enabling incremental model updates.
  4. closes: [async] Support different materializations for BQ #1261
    • Previously, only tables (--full-refresh) were supported; this PR implements logic for handling different materializations that dbt supports like table, view, incremental, ephemeral, and materialized views.
  5. closes: [async] Evaluate the possiblity of using dbt itself to create the full SQL command #1266
    • Instead of relying on dbt compile (which only outputs SELECT statements), we now let dbt generate complete SQL queries, including SQL headers/DDL statements for the queries corresponding to the resource nodes and state of tables/views in the backend warehouse
  6. closes: [async] Emit datasets when using ExecutionMode.AIRFLOW_ASYNC #1264
    • We support emitting datasets for EXECUTIONMODE.AIRFLOW_ASYNC too with this PR

Example DAG showing EXECUTIONMODE.AIRFLOW_ASYNC deferring tasks and the dynamic query submitted in the logs

Screenshot 2025-02-04 at 1 02 42 PM

Next Steps & Considerations:

Copy link

netlify bot commented Jan 21, 2025

Deploy Preview for sunny-pastelito-5ecb04 canceled.

Name Link
🔨 Latest commit 08f1e85
🔍 Latest deploy log https://app.netlify.com/sites/sunny-pastelito-5ecb04/deploys/67a308b6e977fe0008358923

@pankajkoti pankajkoti changed the title Monkeypatch BiqQuery adapter for retriveing SQL for async execution Monkeypatch BiqQuery adapter to retrive SQL for async execution Jan 21, 2025
cosmos/operators/local.py Outdated Show resolved Hide resolved
cosmos/operators/local.py Outdated Show resolved Hide resolved
Copy link

cloudflare-workers-and-pages bot commented Jan 21, 2025

Deploying astronomer-cosmos with  Cloudflare Pages  Cloudflare Pages

Latest commit: 08f1e85
Status: ✅  Deploy successful!
Preview URL: https://06c345c4.astronomer-cosmos.pages.dev
Branch Preview URL: https://monkeypatch-bq-adapter.astronomer-cosmos.pages.dev

View logs

cosmos/operators/local.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@tatiana tatiana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pankajkoti I'm very excited that we now have a more reliable way of calculating the full dbt SQL query. This approach fixes #1260 and solves many of the async tickets we have open.

Monkey-patching always carries a risk, but it is worth it at this stage.

It would be great if - either as part of this PR - or as a priority follow-up PR, we have an efficient way of testing that the monkey patching works in multiple versions of dbt, including the latest releases, and that the transformation is not being executed when we run the dbt command. I believe this must be done before we release this feature in 1.9.0

I've logged two follow-up tickets that are relevant:

It would be great if these could be accomplished before 1.9.0 release, but I'm also happy with us sticking to approach if time does not allow further analysis / implementation.

@pankajkoti
Copy link
Contributor Author

cc: @joppevos for visibility on the ongoing work

@pankajkoti pankajkoti marked this pull request as ready for review February 4, 2025 07:29
@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. area:execution Related to the execution environment/mode, like Docker, Kubernetes, Local, VirtualEnv, etc dbt:run Primarily related to dbt run command or functionality profile:bigquery Related to BigQuery ProfileConfig labels Feb 4, 2025
@pankajkoti pankajkoti requested a review from tatiana February 4, 2025 07:33
Copy link
Contributor

@pankajastro pankajastro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

cosmos/dbt_adapters/__init__.py Show resolved Hide resolved
cosmos/dbt_adapters/bigquery.py Show resolved Hide resolved
cosmos/operators/local.py Show resolved Hide resolved
@pankajkoti pankajkoti changed the title Use dbt to generate the full SQL and support different materializations for BQ for ExecutionMode.AIRFLOW_ASYNC Create and run accurate SQL statements when using ExecutionMode.AIRFLOW_ASYNC Feb 4, 2025
@tatiana tatiana changed the title Create and run accurate SQL statements when using ExecutionMode.AIRFLOW_ASYNC Create and run accurate SQL statements when using ExecutionMode.AIRFLOW_ASYNC Feb 4, 2025
CHANGELOG.rst Outdated Show resolved Hide resolved
Copy link
Collaborator

@tatiana tatiana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pankajkoti Congratulations on the outstanding work in this PR and on your patience in addressing and fixing each bug that popped up during the development of this feature. I can't wait to see this feature used in production.

In addition to the feedback that I gave previously, there are two change requests:

Given the size of this PR and all the challenges already overcome, I do not want my design requests to block its merging. So, your PR is approved. However, please create a follow-up ticket and prioritise it over any other work so the interfaces can be simplified as soon as possible. Other tasks planned for the 1.9 release will depend on these interface changes, so please prioritise them over any other work so we can wrap this up.

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Feb 4, 2025
Co-authored-by: Tatiana Al-Chueyr <tatiana.alchueyr@gmail.com>
CHANGELOG.rst Outdated Show resolved Hide resolved
CHANGELOG.rst Outdated Show resolved Hide resolved
@pankajkoti pankajkoti merged commit 24108f0 into main Feb 5, 2025
65 of 66 checks passed
@pankajkoti pankajkoti deleted the monkeypatch-bq-adapter branch February 5, 2025 06:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:execution Related to the execution environment/mode, like Docker, Kubernetes, Local, VirtualEnv, etc dbt:run Primarily related to dbt run command or functionality lgtm This PR has been approved by a maintainer priority:high High priority issues are blocking or critical issues without a workaround and large impact profile:bigquery Related to BigQuery ProfileConfig size:XXL This PR changes 1000+ lines, ignoring generated files.
Projects
None yet
3 participants