Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tasks are running full parse each time #722

Closed
DanMawdsleyBA opened this issue Nov 28, 2023 · 10 comments · Fixed by #800 or #904
Closed

Tasks are running full parse each time #722

DanMawdsleyBA opened this issue Nov 28, 2023 · 10 comments · Fixed by #800 or #904
Labels
area:parsing Related to parsing DAG/DBT improvement, issues, or fixes area:performance Related to performance, like memory usage, CPU usage, speed, etc dbt:parse Primarily related to dbt parse command or functionality parsing:dbt_manifest Issues, questions, or features related to dbt_manifest parsing priority:medium Medium priority issues are important issues that may have a workaround and medium impact
Milestone

Comments

@DanMawdsleyBA
Copy link
Contributor

I've noticed that within each job it is running a full parse each time. I upload the manifest as part of the cicd so not sure why it isn't recognizing the existing manifest file. This could significantly improve the job performance currently full parse step takes 7 seconds per task.
image

Copy link

dosubot bot commented Nov 28, 2023

🤖

Hello @DanMawdsleyBA,

Welcome to the Cosmos repository! I'm Dosu, a bot designed to assist you with solving bugs, answering questions, and guiding you through your contributions. While we wait for a human maintainer, I'm here to help. Let's get your issue addressed!

The issue you're experiencing is due to the way Cosmos is currently loading your dbt project. In the load method of the DbtGraph class, Cosmos checks if a manifest file is available. If it is, it loads the project from the manifest. If not, it tries to load the project using the dbt ls command or a custom parser.

However, even when the manifest is available, Cosmos doesn't use it to avoid parsing the project every time a job runs. Instead, it parses the project again to apply filters (based on self.exclude and self.select), which can be time-consuming.

To improve job performance, you can modify the load method to skip parsing if the manifest is available and no filters are applied. Here's an example of how you can do this:

def load(
    self,
    method: LoadMode = LoadMode.AUTOMATIC,
    execution_mode: ExecutionMode = ExecutionMode.LOCAL,
) -> None:
    if method == LoadMode.AUTOMATIC:
        if self.project.is_manifest_available() and not self.render_config.select and not self.render_config.exclude:
            self.load_from_dbt_manifest()
        else:
            if execution_mode == ExecutionMode.LOCAL and self.profile_config:
                try:
                    self.load_via_dbt_ls()
                except FileNotFoundError:
                    self.load_via_custom_parser()
            else:
                self.load_via_custom_parser()
    else:
        load_method[method]()

In this modified version of the load method, Cosmos will load the project from the manifest and skip parsing if the manifest is available and no filters are applied. If filters are applied, Cosmos will continue to parse the project as it currently does. This should improve job performance by avoiding unnecessary parsing.

You can find more details about the load method and its current implementation in the DbtGraph class.

Please let me know if you have any other questions or need further clarification.

Sources

@dosubot dosubot bot added area:parsing Related to parsing DAG/DBT improvement, issues, or fixes area:performance Related to performance, like memory usage, CPU usage, speed, etc dbt:parse Primarily related to dbt parse command or functionality parsing:dbt_manifest Issues, questions, or features related to dbt_manifest parsing priority:medium Medium priority issues are important issues that may have a workaround and medium impact labels Nov 28, 2023
@DanMawdsleyBA
Copy link
Contributor Author

DanMawdsleyBA commented Nov 30, 2023

I think this is due to the syslink again as it does not copy across the target path by default. We could add a step to check if the manifest file is present then syslink just the manifest file. This should mean each task does not need to be reparsed fully each time if the manifest file is already present. @tatiana any thoughts on the above

@tatiana
Copy link
Collaborator

tatiana commented Nov 30, 2023

@DanMawdsleyBA I agree that we should allow users to cache the manifest since it would improve the performance.

What happens if users are using LoadMode.DBT_LS and the manifest is outdated?

I believe we should symlink the manifest when users opt for LoadMode.DBT_MANIFEST and define manifest_path. Perhaps for the other load modes we shouldn't do this by default, but give users the option do to so.

@tatiana tatiana modified the milestones: 1.3.1, 1.4.0 Jan 9, 2024
@lrcuwicRSG1
Copy link

Hi @ALL, I am new to astronomer cosmos and I have the same performance issues as @DanMawdsleyBA noticed. When will it be fixed? @tatiana Milestone 1.4.0 was updated for the release date

In the meanwhile I want to fix it temporary with the codesnippet from dosubot. What is the easiest way to modiy the graph.py file and take it over to my local docker-installation based on astronomer-runtime and astro-cli? I am looking forward to your suggestions. Thanks Chris

@dwreeves dwreeves mentioned this issue Feb 12, 2024
2 tasks
@tatiana
Copy link
Collaborator

tatiana commented Mar 21, 2024

@DanMawdsleyBA @lrcuwicRSG1 Around two weeks ago, we managed to have an alpha release with #800 by @dwreeves, which seems to solve the problem. Could you try it out using 1.4.0a1? We're planning to release the stable 1.4 with this feature next week, and it would be great to confirm if/how much it helped you

@lrcuwicRSG1
Copy link

@tatiana Thanks for your reply. The issue #800 was on my watchlist ;) I will try it soon. Cheers Chris

@DanMawdsleyBA
Copy link
Contributor Author

DanMawdsleyBA commented Mar 21, 2024

I am using 1.4.0a1 and wasn't able to get it to work because DBT was saying the profile and env vars have changed.

[2024-03-14, 17:04:56 GMT] {{subprocess.py:94}} INFO - �[0m17:04:56 Running with dbt=1.7.9
[2024-03-14, 17:04:57 GMT] {{subprocess.py:94}} INFO - �[0m17:04:57 Registered adapter: snowflake=1.7.2
[2024-03-14, 17:04:57 GMT] {{subprocess.py:94}} INFO - �[0m17:04:57 Unable to do partial parsing because config vars, config profile, or config target have changed
[2024-03-14, 17:04:57 GMT] {{subprocess.py:94}} INFO - �[0m17:04:57 Unable to do partial parsing because profile has changed
[2024-03-14, 17:04:57 GMT] {{subprocess.py:94}} INFO - �[0m17:04:57 Unable to do partial parsing because env vars used in profiles.yml have changed
[2024-03-14, 17:05:00 GMT] {{subprocess.py:94}} INFO - �[0m17:05:00 Found 47 models, 12 snapshots, 0 sources, 0 exposures, 0 metrics, 554 macros, 0 groups, 0 semantic models

In my cicd pipeline I use a dummy target to generate the manifest. As cosmos uses a different target and generates it own in env vars I'm not sure if it will work? (I could be wrong I also use env vars for other things as well).

May also need to write a guide on how to get it to work for others. For example the dbt version in the cicd has to be exactly the same as the airflow. Possibly the env_vars as well? And the dbt target?

Haven't had time to do a proper investigation as we're upping our airflow instance to 2.8.

It uses the cosmos managed profiles for snowflake which uses env vars.
image

@dwreeves
Copy link
Collaborator

I was able to get it to work but it's a lot of effort. Later today I'll post the full guide. I think I actually posted it elsewhere in a comment in this repo.

@tatiana
Copy link
Collaborator

tatiana commented Mar 21, 2024

Thank you very much for the feedback, @DanMawdsleyBA !

To confirm: you generated both manifest.json and partial_parse.msgpack in the CI. The DAG rendered fine in Airflow with Cosmos 1.4.0a1, but when you attempted to run the tasks, you got these logs, meaning that the tasks ran without benefiting from partial parsing. Is this the case?

I was just reading dbt docs, and it seems when dbt attempts to use partial parsing, it needs to ensure that the environment variables used in profiles.yml remain consistent:
https://docs.getdbt.com/reference/parsing#known-limitations

If certain inputs change between runs, dbt will trigger a full re-parse. The results will be correct, but the full re-parse may be quite slow. Today those inputs are:

  • --vars
  • profiles.yml content (or env_var values used within)
  • dbt_project.yml content (or env_var values used within)
  • installed packages
  • dbt version
  • certain widely-used macros (for example, builtins, overrides, or generate_x_name for database/schema/alias)

If you're triggering CI job runs, the benefits of partial parsing are not applicable to new pull requests (PR) or new branches. However, they are applied on subsequent commits to the new PR or branch.

So, for users to fully benefit from this feature (partial parsing), we'd need to have the same profiles.yml in the CI, when parsing the DAG, and when running the tasks. At this moment, this would mean not using Cosmos profile mapping and necessarily specifying ProfileConfig(profiles_yml_filepath=).

@dwreeves, should we improve the docs to make this requirement more explicit? Let me know your thoughts: #898

@dwreeves
Copy link
Collaborator

I do agree that specifying a profiles_yml_filepath is the best way to go for most users. It should be what the docs suggest.

It's also possible to do something like this:

  1. Go to one of your DbtOperator task logs in Airflow, and copy the profile YAML you see in the logs.
    It will look something like this (example of a Snowflake profile):

    my_profile_name:
        outputs:
            airflow_target:
                account: myaccount
                database: mydatabase
                password: '{{ env_var(''COSMOS_CONN_SNOWFLAKE_PASSWORD'') }}'
                # ... etc. ...
        target: my_target_name
  2. Add this file to your Airflow project somewhere, e.g. dags/dbt/profile/profiles.yml.

  3. In your CICD, set the DBT_PROFILES_DIR env var to the folder you placed that file, e.g. dags/dbt/profile.

  4. Add --profile [my_profile_name] --target [my_target_name] to the dbt parse command you run in your CICD (replace [my_profile_name] and [my_target_name] with the appropriate names). You must add these flags and you cannot just rely on defaults. This is because dbt partial parsing does a checksum on a subset of the flags used in the CLI, and both --profile and --target are included in the checksum. Failing to add these flags means it won't work.

    dbt deps
    dbt parse --profile [my_profile_name] --target [my_target_name]
  5. You must set up your CICD to run the exact same connection you use in Cosmos, meaning that you need to set COSMOS_CONN_SNOWFLAKE_PASSWORD to be the password you use in production. Again, similar deal as above: dbt does checksums when parsing, and every call to the env_var() macro creates a side-effect where the key-value pair of the env var is eventually put into the hash.

  6. Any --vars and env vars you use must also be included the exact same way you use them in Airflow. You cannot rely on DagRun-specific or DAG-specific parameterized vars / env vars, otherwise partial parsing won't work. (The 1.5.0 or later updates to partial parsing support should address this limitation, though.)

tatiana added a commit that referenced this issue May 1, 2024
…ct (#904)

Improve the performance to run the benchmark DAG with 100 tasks by 34%
and the benchmark DAG with 10 tasks by 22%, by persisting the dbt
partial parse artifact in Airflow nodes. This performance can be even
higher in the case of dbt projects that take more time to be parsed.

With the introduction of #800, Cosmos supports using dbt partial parsing
files. This feature has led to a substantial performance improvement,
particularly for large dbt projects, both during Airflow DAG parsing
(using LoadMode.DBT_LS) and also Airflow task execution (when using
`ExecutionMode.LOCAL` and `ExecutionMode.VIRTUALENV`).

There were two limitations with the initial support to partial parsing,
which the current PR aims to address:

1. DAGs using Cosmos `ProfileMapping` classes could not leverage this
feature. This is because the partial parsing relies on profile files not
changing, and by default, Cosmos would mock the dbt profile in several
parts of the code. The consequence is that users trying Cosmos 1.4.0a1
will see the following message:
```
13:33:16  Unable to do partial parsing because profile has changed
13:33:16  Unable to do partial parsing because env vars used in profiles.yml have changed
```

2. The user had to explicitly provide a `partial_parse.msgpack` file in
the original project folder for their Airflow deployment - and if, for
any reason, this became outdated, the user would not leverage the
partial parsing feature. Since Cosmos runs dbt tasks from within a
temporary directory, the partial parse would be stale for some users, it
would be updated in the temporary directory, but the next time the task
was run, Cosmos/dbt would not leverage the recently updated
`partial_parse.msgpack` file.

The current PR addresses these two issues respectfully by:

1. Allowing users that want to leverage Cosmos `ProfileMapping` and
partial parsing to use `RenderConfig(enable_mock_profile=False)`

2. Introducing a Cosmos cache directory where we are persisting partial
parsing files. This feature is enabled by default, but users can opt out
by setting the Airflow configuration `[cosmos][enable_cache] = False`
(exporting the environment variable `AIRFLOW__COSMOS__ENABLE_CACHE=0`).
Users can also define the temporary directory used to store these files
using the `[cosmos][cache_dir]` Airflow configuration. By default,
Cosmos will create and use a folder `cosmos` inside the system's
temporary directory:
https://docs.python.org/3/library/tempfile.html#tempfile.gettempdir .

This PR affects both DAG parsing and task execution. Although it does
not introduce an optimisation per se, it makes the partial parse feature
implemented #800 available to more users.

Closes: #722

I updated the documentation in the PR: #898

Some future steps related to optimization associated to caching to be
addressed in separate PRs:
i. Change how we create mocked profiles, to create the file itself in
the same way, referencing an environment variable with the same name -
and only changing the value of the environment variable (#924)
ii. Extend caching to the `profiles.yml` created by Cosmos in the newly
introduced `tmp/cosmos` without the need to recreate it every time
(#925).
iii. Extend caching to the Airflow DAG/Task group as a pickle file -
this approach is more generic and would work for every type of DAG
parsing and executor. (#926)
iv. Support persisting/fetching the cache from remote storage so we
don't have to replicate it for every Airflow scheduler and worker node.
(#927)
v. Cache dbt deps lock file/avoid installing dbt steps every time. We
can leverage `package-lock.yml` introduced in dbt t 1.7
(https://docs.getdbt.com/reference/commands/deps#predictable-package-installs),
but ideally, we'd have a strategy to support older versions of dbt as
well. (#930)
vi. Support caching `partial_parse.msgpack` even when vars change:
https://medium.com/@sebastian.daum89/how-to-speed-up-single-dbt-invocations-when-using-changing-dbt-variables-b9d91ce3fb0d
vii. Support partial parsing in Docker and Kubernetes Cosmos executors
(#929)
viii. Centralise all the Airflow-based config into Cosmos settings.py &
create a dedicated docs page containing information about these (#928)

**How to validate this change**

Run the performance benchmark against this and the `main` branch,
checking the value of `/tmp/performance_results.txt`.

Example of commands run locally:

```
# Setup
AIRFLOW_HOME=`pwd` AIRFLOW_CONN_AIRFLOW_DB="postgres://postgres:postgres@0.0.0.0:5432/postgres" PYTHONPATH=`pwd` AIRFLOW_HOME=`pwd` AIRFLOW__CORE__DAGBAG_IMPORT_TIMEOUT=20000 AIRFLOW__CORE__DAG_FILE_PROCESSOR_TIMEOUT=20000 hatch run tests.py3.11-2.7:test-performance-setup

# Run test for 100 dbt models per DAG:
MODEL_COUNT=100 AIRFLOW_HOME=`pwd` AIRFLOW_CONN_AIRFLOW_DB="postgres://postgres:postgres@0.0.0.0:5432/postgres" PYTHONPATH=`pwd` AIRFLOW_HOME=`pwd` AIRFLOW__CORE__DAGBAG_IMPORT_TIMEOUT=20000 AIRFLOW__CORE__DAG_FILE_PROCESSOR_TIMEOUT=20000 hatch run tests.py3.11-2.7:test-performance
```

An example of output when running 100 with the main branch:
```
NUM_MODELS=100
TIME=114.18614888191223
MODELS_PER_SECOND=0.8757629623135543
DBT_VERSION=1.7.13
```

And with the current PR:
```
NUM_MODELS=100
TIME=75.17766404151917
MODELS_PER_SECOND=1.33018232576064
DBT_VERSION=1.7.13
```
tatiana added a commit that referenced this issue May 1, 2024
Improves docs to highlight the limitation of the parsing parsing
approach (introduced in #800), following up on the feedback on #722 and the changes introduced in #904
arojasb3 pushed a commit to arojasb3/astronomer-cosmos that referenced this issue Jul 14, 2024
…ct (astronomer#904)

Improve the performance to run the benchmark DAG with 100 tasks by 34%
and the benchmark DAG with 10 tasks by 22%, by persisting the dbt
partial parse artifact in Airflow nodes. This performance can be even
higher in the case of dbt projects that take more time to be parsed.

With the introduction of astronomer#800, Cosmos supports using dbt partial parsing
files. This feature has led to a substantial performance improvement,
particularly for large dbt projects, both during Airflow DAG parsing
(using LoadMode.DBT_LS) and also Airflow task execution (when using
`ExecutionMode.LOCAL` and `ExecutionMode.VIRTUALENV`).

There were two limitations with the initial support to partial parsing,
which the current PR aims to address:

1. DAGs using Cosmos `ProfileMapping` classes could not leverage this
feature. This is because the partial parsing relies on profile files not
changing, and by default, Cosmos would mock the dbt profile in several
parts of the code. The consequence is that users trying Cosmos 1.4.0a1
will see the following message:
```
13:33:16  Unable to do partial parsing because profile has changed
13:33:16  Unable to do partial parsing because env vars used in profiles.yml have changed
```

2. The user had to explicitly provide a `partial_parse.msgpack` file in
the original project folder for their Airflow deployment - and if, for
any reason, this became outdated, the user would not leverage the
partial parsing feature. Since Cosmos runs dbt tasks from within a
temporary directory, the partial parse would be stale for some users, it
would be updated in the temporary directory, but the next time the task
was run, Cosmos/dbt would not leverage the recently updated
`partial_parse.msgpack` file.

The current PR addresses these two issues respectfully by:

1. Allowing users that want to leverage Cosmos `ProfileMapping` and
partial parsing to use `RenderConfig(enable_mock_profile=False)`

2. Introducing a Cosmos cache directory where we are persisting partial
parsing files. This feature is enabled by default, but users can opt out
by setting the Airflow configuration `[cosmos][enable_cache] = False`
(exporting the environment variable `AIRFLOW__COSMOS__ENABLE_CACHE=0`).
Users can also define the temporary directory used to store these files
using the `[cosmos][cache_dir]` Airflow configuration. By default,
Cosmos will create and use a folder `cosmos` inside the system's
temporary directory:
https://docs.python.org/3/library/tempfile.html#tempfile.gettempdir .

This PR affects both DAG parsing and task execution. Although it does
not introduce an optimisation per se, it makes the partial parse feature
implemented astronomer#800 available to more users.

Closes: astronomer#722

I updated the documentation in the PR: astronomer#898

Some future steps related to optimization associated to caching to be
addressed in separate PRs:
i. Change how we create mocked profiles, to create the file itself in
the same way, referencing an environment variable with the same name -
and only changing the value of the environment variable (astronomer#924)
ii. Extend caching to the `profiles.yml` created by Cosmos in the newly
introduced `tmp/cosmos` without the need to recreate it every time
(astronomer#925).
iii. Extend caching to the Airflow DAG/Task group as a pickle file -
this approach is more generic and would work for every type of DAG
parsing and executor. (astronomer#926)
iv. Support persisting/fetching the cache from remote storage so we
don't have to replicate it for every Airflow scheduler and worker node.
(astronomer#927)
v. Cache dbt deps lock file/avoid installing dbt steps every time. We
can leverage `package-lock.yml` introduced in dbt t 1.7
(https://docs.getdbt.com/reference/commands/deps#predictable-package-installs),
but ideally, we'd have a strategy to support older versions of dbt as
well. (astronomer#930)
vi. Support caching `partial_parse.msgpack` even when vars change:
https://medium.com/@sebastian.daum89/how-to-speed-up-single-dbt-invocations-when-using-changing-dbt-variables-b9d91ce3fb0d
vii. Support partial parsing in Docker and Kubernetes Cosmos executors
(astronomer#929)
viii. Centralise all the Airflow-based config into Cosmos settings.py &
create a dedicated docs page containing information about these (astronomer#928)

**How to validate this change**

Run the performance benchmark against this and the `main` branch,
checking the value of `/tmp/performance_results.txt`.

Example of commands run locally:

```
# Setup
AIRFLOW_HOME=`pwd` AIRFLOW_CONN_AIRFLOW_DB="postgres://postgres:postgres@0.0.0.0:5432/postgres" PYTHONPATH=`pwd` AIRFLOW_HOME=`pwd` AIRFLOW__CORE__DAGBAG_IMPORT_TIMEOUT=20000 AIRFLOW__CORE__DAG_FILE_PROCESSOR_TIMEOUT=20000 hatch run tests.py3.11-2.7:test-performance-setup

# Run test for 100 dbt models per DAG:
MODEL_COUNT=100 AIRFLOW_HOME=`pwd` AIRFLOW_CONN_AIRFLOW_DB="postgres://postgres:postgres@0.0.0.0:5432/postgres" PYTHONPATH=`pwd` AIRFLOW_HOME=`pwd` AIRFLOW__CORE__DAGBAG_IMPORT_TIMEOUT=20000 AIRFLOW__CORE__DAG_FILE_PROCESSOR_TIMEOUT=20000 hatch run tests.py3.11-2.7:test-performance
```

An example of output when running 100 with the main branch:
```
NUM_MODELS=100
TIME=114.18614888191223
MODELS_PER_SECOND=0.8757629623135543
DBT_VERSION=1.7.13
```

And with the current PR:
```
NUM_MODELS=100
TIME=75.17766404151917
MODELS_PER_SECOND=1.33018232576064
DBT_VERSION=1.7.13
```
arojasb3 pushed a commit to arojasb3/astronomer-cosmos that referenced this issue Jul 14, 2024
Improves docs to highlight the limitation of the parsing parsing
approach (introduced in astronomer#800), following up on the feedback on astronomer#722 and the changes introduced in astronomer#904
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:parsing Related to parsing DAG/DBT improvement, issues, or fixes area:performance Related to performance, like memory usage, CPU usage, speed, etc dbt:parse Primarily related to dbt parse command or functionality parsing:dbt_manifest Issues, questions, or features related to dbt_manifest parsing priority:medium Medium priority issues are important issues that may have a workaround and medium impact
Projects
None yet
4 participants