Tasks are running full parse each time #722

DanMawdsleyBA · 2023-11-28T16:07:33Z

I've noticed that within each job it is running a full parse each time. I upload the manifest as part of the cicd so not sure why it isn't recognizing the existing manifest file. This could significantly improve the job performance currently full parse step takes 7 seconds per task.

dosubot · 2023-11-28T16:15:18Z

🤖

Hello @DanMawdsleyBA,

Welcome to the Cosmos repository! I'm Dosu, a bot designed to assist you with solving bugs, answering questions, and guiding you through your contributions. While we wait for a human maintainer, I'm here to help. Let's get your issue addressed!

The issue you're experiencing is due to the way Cosmos is currently loading your dbt project. In the load method of the DbtGraph class, Cosmos checks if a manifest file is available. If it is, it loads the project from the manifest. If not, it tries to load the project using the dbt ls command or a custom parser.

However, even when the manifest is available, Cosmos doesn't use it to avoid parsing the project every time a job runs. Instead, it parses the project again to apply filters (based on self.exclude and self.select), which can be time-consuming.

To improve job performance, you can modify the load method to skip parsing if the manifest is available and no filters are applied. Here's an example of how you can do this:

def load(
    self,
    method: LoadMode = LoadMode.AUTOMATIC,
    execution_mode: ExecutionMode = ExecutionMode.LOCAL,
) -> None:
    if method == LoadMode.AUTOMATIC:
        if self.project.is_manifest_available() and not self.render_config.select and not self.render_config.exclude:
            self.load_from_dbt_manifest()
        else:
            if execution_mode == ExecutionMode.LOCAL and self.profile_config:
                try:
                    self.load_via_dbt_ls()
                except FileNotFoundError:
                    self.load_via_custom_parser()
            else:
                self.load_via_custom_parser()
    else:
        load_method[method]()

In this modified version of the load method, Cosmos will load the project from the manifest and skip parsing if the manifest is available and no filters are applied. If filters are applied, Cosmos will continue to parse the project as it currently does. This should improve job performance by avoiding unnecessary parsing.

You can find more details about the load method and its current implementation in the DbtGraph class.

Please let me know if you have any other questions or need further clarification.

Sources

cosmos/dbt/graph.py

DanMawdsleyBA · 2023-11-30T14:56:42Z

I think this is due to the syslink again as it does not copy across the target path by default. We could add a step to check if the manifest file is present then syslink just the manifest file. This should mean each task does not need to be reparsed fully each time if the manifest file is already present. @tatiana any thoughts on the above

tatiana · 2023-11-30T15:13:53Z

@DanMawdsleyBA I agree that we should allow users to cache the manifest since it would improve the performance.

What happens if users are using LoadMode.DBT_LS and the manifest is outdated?

I believe we should symlink the manifest when users opt for LoadMode.DBT_MANIFEST and define manifest_path. Perhaps for the other load modes we shouldn't do this by default, but give users the option do to so.

lrcuwicRSG1 · 2024-02-07T09:40:13Z

Hi @ALL, I am new to astronomer cosmos and I have the same performance issues as @DanMawdsleyBA noticed. When will it be fixed? @tatiana Milestone 1.4.0 was updated for the release date

In the meanwhile I want to fix it temporary with the codesnippet from dosubot. What is the easiest way to modiy the graph.py file and take it over to my local docker-installation based on astronomer-runtime and astro-cli? I am looking forward to your suggestions. Thanks Chris

tatiana · 2024-03-21T09:49:52Z

@DanMawdsleyBA @lrcuwicRSG1 Around two weeks ago, we managed to have an alpha release with #800 by @dwreeves, which seems to solve the problem. Could you try it out using 1.4.0a1? We're planning to release the stable 1.4 with this feature next week, and it would be great to confirm if/how much it helped you

lrcuwicRSG1 · 2024-03-21T09:58:17Z

@tatiana Thanks for your reply. The issue #800 was on my watchlist ;) I will try it soon. Cheers Chris

DanMawdsleyBA · 2024-03-21T10:25:50Z

I am using 1.4.0a1 and wasn't able to get it to work because DBT was saying the profile and env vars have changed.

[2024-03-14, 17:04:56 GMT] {{subprocess.py:94}} INFO - �[0m17:04:56 Running with dbt=1.7.9
[2024-03-14, 17:04:57 GMT] {{subprocess.py:94}} INFO - �[0m17:04:57 Registered adapter: snowflake=1.7.2
[2024-03-14, 17:04:57 GMT] {{subprocess.py:94}} INFO - �[0m17:04:57 Unable to do partial parsing because config vars, config profile, or config target have changed
[2024-03-14, 17:04:57 GMT] {{subprocess.py:94}} INFO - �[0m17:04:57 Unable to do partial parsing because profile has changed
[2024-03-14, 17:04:57 GMT] {{subprocess.py:94}} INFO - �[0m17:04:57 Unable to do partial parsing because env vars used in profiles.yml have changed
[2024-03-14, 17:05:00 GMT] {{subprocess.py:94}} INFO - �[0m17:05:00 Found 47 models, 12 snapshots, 0 sources, 0 exposures, 0 metrics, 554 macros, 0 groups, 0 semantic models

In my cicd pipeline I use a dummy target to generate the manifest. As cosmos uses a different target and generates it own in env vars I'm not sure if it will work? (I could be wrong I also use env vars for other things as well).

May also need to write a guide on how to get it to work for others. For example the dbt version in the cicd has to be exactly the same as the airflow. Possibly the env_vars as well? And the dbt target?

Haven't had time to do a proper investigation as we're upping our airflow instance to 2.8.

It uses the cosmos managed profiles for snowflake which uses env vars.

dwreeves · 2024-03-21T11:23:52Z

I was able to get it to work but it's a lot of effort. Later today I'll post the full guide. I think I actually posted it elsewhere in a comment in this repo.

tatiana · 2024-03-21T11:36:47Z

Thank you very much for the feedback, @DanMawdsleyBA !

To confirm: you generated both manifest.json and partial_parse.msgpack in the CI. The DAG rendered fine in Airflow with Cosmos 1.4.0a1, but when you attempted to run the tasks, you got these logs, meaning that the tasks ran without benefiting from partial parsing. Is this the case?

I was just reading dbt docs, and it seems when dbt attempts to use partial parsing, it needs to ensure that the environment variables used in profiles.yml remain consistent:
https://docs.getdbt.com/reference/parsing#known-limitations

If certain inputs change between runs, dbt will trigger a full re-parse. The results will be correct, but the full re-parse may be quite slow. Today those inputs are:

--vars

profiles.yml content (or env_var values used within)

dbt_project.yml content (or env_var values used within)

installed packages

dbt version

certain widely-used macros (for example, builtins, overrides, or generate_x_name for database/schema/alias)

If you're triggering CI job runs, the benefits of partial parsing are not applicable to new pull requests (PR) or new branches. However, they are applied on subsequent commits to the new PR or branch.

So, for users to fully benefit from this feature (partial parsing), we'd need to have the same profiles.yml in the CI, when parsing the DAG, and when running the tasks. At this moment, this would mean not using Cosmos profile mapping and necessarily specifying ProfileConfig(profiles_yml_filepath=).

@dwreeves, should we improve the docs to make this requirement more explicit? Let me know your thoughts: #898

dwreeves · 2024-03-22T00:22:05Z

I do agree that specifying a profiles_yml_filepath is the best way to go for most users. It should be what the docs suggest.

It's also possible to do something like this:

Go to one of your DbtOperator task logs in Airflow, and copy the profile YAML you see in the logs.
It will look something like this (example of a Snowflake profile):
my_profile_name:
    outputs:
        airflow_target:
            account: myaccount
            database: mydatabase
            password: '{{ env_var(''COSMOS_CONN_SNOWFLAKE_PASSWORD'') }}'
            # ... etc. ...
    target: my_target_name
Add this file to your Airflow project somewhere, e.g. dags/dbt/profile/profiles.yml.

In your CICD, set the DBT_PROFILES_DIR env var to the folder you placed that file, e.g. dags/dbt/profile.
Add --profile [my_profile_name] --target [my_target_name] to the dbt parse command you run in your CICD (replace [my_profile_name] and [my_target_name] with the appropriate names). You must add these flags and you cannot just rely on defaults. This is because dbt partial parsing does a checksum on a subset of the flags used in the CLI, and both --profile and --target are included in the checksum. Failing to add these flags means it won't work.
dbt deps
dbt parse --profile [my_profile_name] --target [my_target_name]
You must set up your CICD to run the exact same connection you use in Cosmos, meaning that you need to set COSMOS_CONN_SNOWFLAKE_PASSWORD to be the password you use in production. Again, similar deal as above: dbt does checksums when parsing, and every call to the env_var() macro creates a side-effect where the key-value pair of the env var is eventually put into the hash.

Any --vars and env vars you use must also be included the exact same way you use them in Airflow. You cannot rely on DagRun-specific or DAG-specific parameterized vars / env vars, otherwise partial parsing won't work. (The 1.5.0 or later updates to partial parsing support should address this limitation, though.)

…ct (#904) Improve the performance to run the benchmark DAG with 100 tasks by 34% and the benchmark DAG with 10 tasks by 22%, by persisting the dbt partial parse artifact in Airflow nodes. This performance can be even higher in the case of dbt projects that take more time to be parsed. With the introduction of #800, Cosmos supports using dbt partial parsing files. This feature has led to a substantial performance improvement, particularly for large dbt projects, both during Airflow DAG parsing (using LoadMode.DBT_LS) and also Airflow task execution (when using `ExecutionMode.LOCAL` and `ExecutionMode.VIRTUALENV`). There were two limitations with the initial support to partial parsing, which the current PR aims to address: 1. DAGs using Cosmos `ProfileMapping` classes could not leverage this feature. This is because the partial parsing relies on profile files not changing, and by default, Cosmos would mock the dbt profile in several parts of the code. The consequence is that users trying Cosmos 1.4.0a1 will see the following message: ``` 13:33:16 Unable to do partial parsing because profile has changed 13:33:16 Unable to do partial parsing because env vars used in profiles.yml have changed ``` 2. The user had to explicitly provide a `partial_parse.msgpack` file in the original project folder for their Airflow deployment - and if, for any reason, this became outdated, the user would not leverage the partial parsing feature. Since Cosmos runs dbt tasks from within a temporary directory, the partial parse would be stale for some users, it would be updated in the temporary directory, but the next time the task was run, Cosmos/dbt would not leverage the recently updated `partial_parse.msgpack` file. The current PR addresses these two issues respectfully by: 1. Allowing users that want to leverage Cosmos `ProfileMapping` and partial parsing to use `RenderConfig(enable_mock_profile=False)` 2. Introducing a Cosmos cache directory where we are persisting partial parsing files. This feature is enabled by default, but users can opt out by setting the Airflow configuration `[cosmos][enable_cache] = False` (exporting the environment variable `AIRFLOW__COSMOS__ENABLE_CACHE=0`). Users can also define the temporary directory used to store these files using the `[cosmos][cache_dir]` Airflow configuration. By default, Cosmos will create and use a folder `cosmos` inside the system's temporary directory: https://docs.python.org/3/library/tempfile.html#tempfile.gettempdir . This PR affects both DAG parsing and task execution. Although it does not introduce an optimisation per se, it makes the partial parse feature implemented #800 available to more users. Closes: #722 I updated the documentation in the PR: #898 Some future steps related to optimization associated to caching to be addressed in separate PRs: i. Change how we create mocked profiles, to create the file itself in the same way, referencing an environment variable with the same name - and only changing the value of the environment variable (#924) ii. Extend caching to the `profiles.yml` created by Cosmos in the newly introduced `tmp/cosmos` without the need to recreate it every time (#925). iii. Extend caching to the Airflow DAG/Task group as a pickle file - this approach is more generic and would work for every type of DAG parsing and executor. (#926) iv. Support persisting/fetching the cache from remote storage so we don't have to replicate it for every Airflow scheduler and worker node. (#927) v. Cache dbt deps lock file/avoid installing dbt steps every time. We can leverage `package-lock.yml` introduced in dbt t 1.7 (https://docs.getdbt.com/reference/commands/deps#predictable-package-installs), but ideally, we'd have a strategy to support older versions of dbt as well. (#930) vi. Support caching `partial_parse.msgpack` even when vars change: https://medium.com/@sebastian.daum89/how-to-speed-up-single-dbt-invocations-when-using-changing-dbt-variables-b9d91ce3fb0d vii. Support partial parsing in Docker and Kubernetes Cosmos executors (#929) viii. Centralise all the Airflow-based config into Cosmos settings.py & create a dedicated docs page containing information about these (#928) **How to validate this change** Run the performance benchmark against this and the `main` branch, checking the value of `/tmp/performance_results.txt`. Example of commands run locally: ``` # Setup AIRFLOW_HOME=`pwd` AIRFLOW_CONN_AIRFLOW_DB="postgres://postgres:postgres@0.0.0.0:5432/postgres" PYTHONPATH=`pwd` AIRFLOW_HOME=`pwd` AIRFLOW__CORE__DAGBAG_IMPORT_TIMEOUT=20000 AIRFLOW__CORE__DAG_FILE_PROCESSOR_TIMEOUT=20000 hatch run tests.py3.11-2.7:test-performance-setup # Run test for 100 dbt models per DAG: MODEL_COUNT=100 AIRFLOW_HOME=`pwd` AIRFLOW_CONN_AIRFLOW_DB="postgres://postgres:postgres@0.0.0.0:5432/postgres" PYTHONPATH=`pwd` AIRFLOW_HOME=`pwd` AIRFLOW__CORE__DAGBAG_IMPORT_TIMEOUT=20000 AIRFLOW__CORE__DAG_FILE_PROCESSOR_TIMEOUT=20000 hatch run tests.py3.11-2.7:test-performance ``` An example of output when running 100 with the main branch: ``` NUM_MODELS=100 TIME=114.18614888191223 MODELS_PER_SECOND=0.8757629623135543 DBT_VERSION=1.7.13 ``` And with the current PR: ``` NUM_MODELS=100 TIME=75.17766404151917 MODELS_PER_SECOND=1.33018232576064 DBT_VERSION=1.7.13 ```

Improves docs to highlight the limitation of the parsing parsing approach (introduced in #800), following up on the feedback on #722 and the changes introduced in #904

…ct (astronomer#904) Improve the performance to run the benchmark DAG with 100 tasks by 34% and the benchmark DAG with 10 tasks by 22%, by persisting the dbt partial parse artifact in Airflow nodes. This performance can be even higher in the case of dbt projects that take more time to be parsed. With the introduction of astronomer#800, Cosmos supports using dbt partial parsing files. This feature has led to a substantial performance improvement, particularly for large dbt projects, both during Airflow DAG parsing (using LoadMode.DBT_LS) and also Airflow task execution (when using `ExecutionMode.LOCAL` and `ExecutionMode.VIRTUALENV`). There were two limitations with the initial support to partial parsing, which the current PR aims to address: 1. DAGs using Cosmos `ProfileMapping` classes could not leverage this feature. This is because the partial parsing relies on profile files not changing, and by default, Cosmos would mock the dbt profile in several parts of the code. The consequence is that users trying Cosmos 1.4.0a1 will see the following message: ``` 13:33:16 Unable to do partial parsing because profile has changed 13:33:16 Unable to do partial parsing because env vars used in profiles.yml have changed ``` 2. The user had to explicitly provide a `partial_parse.msgpack` file in the original project folder for their Airflow deployment - and if, for any reason, this became outdated, the user would not leverage the partial parsing feature. Since Cosmos runs dbt tasks from within a temporary directory, the partial parse would be stale for some users, it would be updated in the temporary directory, but the next time the task was run, Cosmos/dbt would not leverage the recently updated `partial_parse.msgpack` file. The current PR addresses these two issues respectfully by: 1. Allowing users that want to leverage Cosmos `ProfileMapping` and partial parsing to use `RenderConfig(enable_mock_profile=False)` 2. Introducing a Cosmos cache directory where we are persisting partial parsing files. This feature is enabled by default, but users can opt out by setting the Airflow configuration `[cosmos][enable_cache] = False` (exporting the environment variable `AIRFLOW__COSMOS__ENABLE_CACHE=0`). Users can also define the temporary directory used to store these files using the `[cosmos][cache_dir]` Airflow configuration. By default, Cosmos will create and use a folder `cosmos` inside the system's temporary directory: https://docs.python.org/3/library/tempfile.html#tempfile.gettempdir . This PR affects both DAG parsing and task execution. Although it does not introduce an optimisation per se, it makes the partial parse feature implemented astronomer#800 available to more users. Closes: astronomer#722 I updated the documentation in the PR: astronomer#898 Some future steps related to optimization associated to caching to be addressed in separate PRs: i. Change how we create mocked profiles, to create the file itself in the same way, referencing an environment variable with the same name - and only changing the value of the environment variable (astronomer#924) ii. Extend caching to the `profiles.yml` created by Cosmos in the newly introduced `tmp/cosmos` without the need to recreate it every time (astronomer#925). iii. Extend caching to the Airflow DAG/Task group as a pickle file - this approach is more generic and would work for every type of DAG parsing and executor. (astronomer#926) iv. Support persisting/fetching the cache from remote storage so we don't have to replicate it for every Airflow scheduler and worker node. (astronomer#927) v. Cache dbt deps lock file/avoid installing dbt steps every time. We can leverage `package-lock.yml` introduced in dbt t 1.7 (https://docs.getdbt.com/reference/commands/deps#predictable-package-installs), but ideally, we'd have a strategy to support older versions of dbt as well. (astronomer#930) vi. Support caching `partial_parse.msgpack` even when vars change: https://medium.com/@sebastian.daum89/how-to-speed-up-single-dbt-invocations-when-using-changing-dbt-variables-b9d91ce3fb0d vii. Support partial parsing in Docker and Kubernetes Cosmos executors (astronomer#929) viii. Centralise all the Airflow-based config into Cosmos settings.py & create a dedicated docs page containing information about these (astronomer#928) **How to validate this change** Run the performance benchmark against this and the `main` branch, checking the value of `/tmp/performance_results.txt`. Example of commands run locally: ``` # Setup AIRFLOW_HOME=`pwd` AIRFLOW_CONN_AIRFLOW_DB="postgres://postgres:postgres@0.0.0.0:5432/postgres" PYTHONPATH=`pwd` AIRFLOW_HOME=`pwd` AIRFLOW__CORE__DAGBAG_IMPORT_TIMEOUT=20000 AIRFLOW__CORE__DAG_FILE_PROCESSOR_TIMEOUT=20000 hatch run tests.py3.11-2.7:test-performance-setup # Run test for 100 dbt models per DAG: MODEL_COUNT=100 AIRFLOW_HOME=`pwd` AIRFLOW_CONN_AIRFLOW_DB="postgres://postgres:postgres@0.0.0.0:5432/postgres" PYTHONPATH=`pwd` AIRFLOW_HOME=`pwd` AIRFLOW__CORE__DAGBAG_IMPORT_TIMEOUT=20000 AIRFLOW__CORE__DAG_FILE_PROCESSOR_TIMEOUT=20000 hatch run tests.py3.11-2.7:test-performance ``` An example of output when running 100 with the main branch: ``` NUM_MODELS=100 TIME=114.18614888191223 MODELS_PER_SECOND=0.8757629623135543 DBT_VERSION=1.7.13 ``` And with the current PR: ``` NUM_MODELS=100 TIME=75.17766404151917 MODELS_PER_SECOND=1.33018232576064 DBT_VERSION=1.7.13 ```

Improves docs to highlight the limitation of the parsing parsing approach (introduced in astronomer#800), following up on the feedback on astronomer#722 and the changes introduced in astronomer#904

tatiana modified the milestones: 1.3.1, 1.4.0 Jan 9, 2024

dwreeves mentioned this issue Feb 12, 2024

Support partial parsing #800

Merged

2 tasks

tatiana mentioned this issue Mar 21, 2024

Improve partial parsing docs #898

Merged

tatiana mentioned this issue Apr 24, 2024

Improve performance by 22-35% or more by caching partial parse artefact #904

Merged

tatiana closed this as completed in #904 May 1, 2024

tatiana added a commit that referenced this issue May 1, 2024

Improve partial parsing docs (#898)

46db262

Improves docs to highlight the limitation of the parsing parsing approach (introduced in #800), following up on the feedback on #722 and the changes introduced in #904

kevinsunny1996 mentioned this issue Aug 31, 2024

Improve DBT DAG execution time kevinsunny1996/E2E-Data-Science-Pipelines#34

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tasks are running full parse each time #722

Tasks are running full parse each time #722

DanMawdsleyBA commented Nov 28, 2023

dosubot bot commented Nov 28, 2023

DanMawdsleyBA commented Nov 30, 2023 •

edited

Loading

tatiana commented Nov 30, 2023

lrcuwicRSG1 commented Feb 7, 2024

tatiana commented Mar 21, 2024

lrcuwicRSG1 commented Mar 21, 2024

DanMawdsleyBA commented Mar 21, 2024 •

edited

Loading

dwreeves commented Mar 21, 2024

tatiana commented Mar 21, 2024 •

edited

Loading

dwreeves commented Mar 22, 2024

Tasks are running full parse each time #722

Tasks are running full parse each time #722

Comments

DanMawdsleyBA commented Nov 28, 2023

dosubot bot commented Nov 28, 2023

Sources

DanMawdsleyBA commented Nov 30, 2023 • edited Loading

tatiana commented Nov 30, 2023

lrcuwicRSG1 commented Feb 7, 2024

tatiana commented Mar 21, 2024

lrcuwicRSG1 commented Mar 21, 2024

DanMawdsleyBA commented Mar 21, 2024 • edited Loading

dwreeves commented Mar 21, 2024

tatiana commented Mar 21, 2024 • edited Loading

dwreeves commented Mar 22, 2024

DanMawdsleyBA commented Nov 30, 2023 •

edited

Loading

DanMawdsleyBA commented Mar 21, 2024 •

edited

Loading

tatiana commented Mar 21, 2024 •

edited

Loading