Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor LoadMethod.LOCAL to use symlinks instead of copying directory #660

Merged
merged 22 commits into from
Nov 14, 2023

Conversation

jbandoro
Copy link
Collaborator

@jbandoro jbandoro commented Nov 9, 2023

Description

This PR refactors the create_symlinks function that was previously used in load via dbt ls so that it can be used in DbtLocalBaseOperator.run_command instead of copying the entire directory.

Related Issue(s)

closes #614

Breaking Change?

None

Checklist

  • I have made corresponding changes to the documentation (if required)
  • I have added tests that prove my fix is effective or that my feature works

DanMawdsleyBA and others added 13 commits November 8, 2023 13:07
This allows you to fully refresh a model from the console.
Full-refresh/backfill is a common task. Using Airflow parameters makes
this easy. Without this, you'd have to trigger an entire deployment. In our
setup, company analysts manage their models without modifying
the DAG code. This empowers such users.

Example of usage:
```python
with DAG(
        dag_id="jaffle",
        params={"full_refresh": Param(default=False, type="boolean")},
        render_template_as_native_obj=True
):
    task = DbtTaskGroup(
        operator_args={"full_refresh": "{{ params.get('full_refresh') }}", "install_deps": True},
    )

```

Closes: astronomer#151
@jbandoro jbandoro requested a review from a team as a code owner November 9, 2023 00:22
@jbandoro jbandoro requested a review from a team November 9, 2023 00:22
Copy link

netlify bot commented Nov 9, 2023

👷 Deploy Preview for amazing-pothos-a3bca0 processing.

Name Link
🔨 Latest commit a9e69fe
🔍 Latest deploy log https://app.netlify.com/sites/amazing-pothos-a3bca0/deploys/65537e388e1214000833aa8c

Copy link

codecov bot commented Nov 9, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (0f16d15) 92.75% compared to head (a9e69fe) 92.76%.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #660   +/-   ##
=======================================
  Coverage   92.75%   92.76%           
=======================================
  Files          54       55    +1     
  Lines        2235     2238    +3     
=======================================
+ Hits         2073     2076    +3     
  Misses        162      162           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jbandoro
Copy link
Collaborator Author

jbandoro commented Nov 9, 2023

It looks like symlinking this file: https://github.com/astronomer/astronomer-cosmos/blob/main/dev/dags/dbt/simple/data/imdb.db is locking the database causing the Sqlite integration test to fail.

I think a workaround might be to ignore .db files. This didn't work it resulted in an error of:

INFO     cosmos.hooks.subprocess:subprocess.py:94 02:05:05  Database Error in model movies_ratings_simplified (models/movies_ratings_simplified.sql)
INFO     cosmos.hooks.subprocess:subprocess.py:94 02:05:05    no such table: imdb.movies_ratings
INFO     cosmos.hooks.subprocess:subprocess.py:94 02:05:05    compiled Code at target/run/simple/models/movies_ratings_simplified.sql

@tatiana
Copy link
Collaborator

tatiana commented Nov 9, 2023

It looks like symlinking this file: https://github.com/astronomer/astronomer-cosmos/blob/main/dev/dags/dbt/simple/data/imdb.db is locking the database causing the Sqlite integration test to fail.

I think a workaround might be to ignore .db files. This didn't work it resulted in an error of:

INFO     cosmos.hooks.subprocess:subprocess.py:94 02:05:05  Database Error in model movies_ratings_simplified (models/movies_ratings_simplified.sql)
INFO     cosmos.hooks.subprocess:subprocess.py:94 02:05:05    no such table: imdb.movies_ratings
INFO     cosmos.hooks.subprocess:subprocess.py:94 02:05:05    compiled Code at target/run/simple/models/movies_ratings_simplified.sql

@jbandoro Perhaps the issue is with our test/code structure. I believe it is unlike someone who would have the database based into the dbt project folder in a real use case, what do you think? We could move it to a folder outside of the DBT project.

Copy link
Collaborator

@tatiana tatiana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbandoro this is looking very good, one comment regarding the issue and only one comment regarding the code. Great job!

cosmos/utils.py Outdated Show resolved Hide resolved
@tatiana tatiana added area:performance Related to performance, like memory usage, CPU usage, speed, etc area:execution Related to the execution environment/mode, like Docker, Kubernetes, Local, VirtualEnv, etc priority:high High priority issues are blocking or critical issues without a workaround and large impact labels Nov 9, 2023
@jbandoro
Copy link
Collaborator Author

jbandoro commented Nov 11, 2023

We're really close to getting this merged, @jbandoro ! Please, let us know if we can support you in any sense

Thanks @tatiana! I've been struggling to get the sqlite integration test to pass with this refactor. I tried moving the imdb.db file outside of the simple dbt project, and updating the references here where it sets the env var and here in the profiles.yml file but they keep failing locally with no such table: imdb.movies_ratings which suggests it can open the db file but the table does not exist.

I tried to move the sqlite db file on the main branch to see if it was related still to the symbolic linking and it results in the same error for me locally 😕.

If you have time to investigate I would be 🙏 otherwise I can look into it further. See comment below I figured out why it wasn't working.

@jbandoro
Copy link
Collaborator Author

jbandoro commented Nov 13, 2023

@tatiana I was able to fix the sqlite integration test after some troubleshooting. The issue was that the current copying of directories was actually fixing a problem with the simple dbt project. Not using cosmos, running a model in the simple dbt project will fail with errors like:

❯ dbt run --models movies_ratings_simplified --profiles-dir /Users/justin.bandoro/astronomer-cosmos/dev/dags/dbt/simple --profile simple --target dev
20:16:24  Running with dbt=1.4.0
20:16:24  Unable to do partial parsing because profile has changed
20:16:24  Unable to do partial parsing because env vars used in profiles.yml have changed
20:16:24  Found 2 models, 0 tests, 0 snapshots, 0 analyses, 290 macros, 0 operations, 0 seed files, 1 source, 1 exposure, 0 metrics
20:16:24  
20:16:24  Concurrency: 1 threads (target='dev')
20:16:24  
20:16:24  1 of 1 START sql table model main.movies_ratings_simplified .................... [RUN]
20:16:24  1 of 1 ERROR creating sql table model main.movies_ratings_simplified ........... [ERROR in 0.02s]
20:16:24  
20:16:24  Finished running 1 table model in 0 hours 0 minutes and 0.06 seconds (0.06s).
20:16:24  
20:16:24  Completed with 1 error and 0 warnings:
20:16:24  
20:16:24  Database Error in model movies_ratings_simplified (models/movies_ratings_simplified.sql)
20:16:24    no such table: imdb.movies_ratings
20:16:24    compiled Code at target/run/simple/models/movies_ratings_simplified.sql
20:16:24  
20:16:24  Done. PASS=0 WARN=0 ERROR=1 SKIP=0 TOTAL=1

This is because in the sources.yml file here it's referencing imdb as the source when it should instead be main which is set in the profiles.yml file. sqlite-dbt setup requires a main schema to be set in schemas_and_paths.

This works currently because cosmos is copying the entire dbt directory, what this did was have 2 different directories:

      schema: 'main'
      schemas_and_paths:
        main: "{{ env_var('DBT_SQLITE_PATH', '.') }}/data/imdb.db"
      schema_directory: 'data'

The main schema which points to the original imdb.db location which is used as the main schema and the schema_directory which is the relative data directory in the copied directory that is used as the source where imdb schema is found. The directory copying along with the profiles setup caused an imdb schema to be created, when there should have ever only been 1 schema of main.

The update here is to change the sources to reference main instead of imdb.

Copy link
Collaborator

@tatiana tatiana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brilliant work, @jbandoro !

Thank you so much for digging into the SQLite issue and solving it. we'll be releasing this as part of 1.3. As soon as the tests pass, I'll merge this change and we'll have a new 1.3 alpha release.

@tatiana tatiana merged commit 5d23758 into astronomer:main Nov 14, 2023
40 checks passed
@tatiana tatiana added this to the 1.2.4 milestone Nov 14, 2023
tatiana pushed a commit that referenced this pull request Nov 15, 2023
#660)

This PR refactors the `create_symlinks` function that was previously
used in load via dbt ls so that it can be used in
`DbtLocalBaseOperator.run_command` instead of copying the entire
directory.

Closes: #614
(cherry picked from commit 5d23758)
tatiana added a commit that referenced this pull request Nov 15, 2023
Bug fixes

* Store `compiled_sql` even when task fails by @agreenburg in #671
* Refactor `LoadMethod.LOCAL` to use symlinks instead of copying directory by @jbandoro in #660
* Fix 'Unable to find the dbt executable: dbt' error by @tatiana in #666
* Fix installing deps when using `profile_mapping` & `ExecutionMode.LOCAL` by @joppevos in #659

Others

* Docs fix: add execution config to MWAA code example by @ugmuka in #674
@tatiana tatiana mentioned this pull request Nov 15, 2023
tatiana added a commit that referenced this pull request Nov 15, 2023
Bug fixes

* Store `compiled_sql` even when task fails by @agreenburg in #671
* Refactor `LoadMethod.LOCAL` to use symlinks instead of copying directory by @jbandoro in #660
* Fix 'Unable to find the dbt executable: dbt' error by @tatiana in #666
* Fix installing deps when using `profile_mapping` & `ExecutionMode.LOCAL` by @joppevos in #659

Others

* Docs fix: add execution config to MWAA code example by @ugmuka in #674

(cherry picked from commit aa9b7bb)
@tatiana tatiana mentioned this pull request Nov 15, 2023
tatiana pushed a commit that referenced this pull request Nov 15, 2023
#660)

This PR refactors the `create_symlinks` function that was previously
used in load via dbt ls so that it can be used in
`DbtLocalBaseOperator.run_command` instead of copying the entire
directory.

Closes: #614
(cherry picked from commit 5d23758)
tatiana added a commit that referenced this pull request Nov 15, 2023
Bug fixes

* Store `compiled_sql` even when task fails by @agreenburg in #671
* Refactor `LoadMethod.LOCAL` to use symlinks instead of copying directory by @jbandoro in #660
* Fix 'Unable to find the dbt executable: dbt' error by @tatiana in #666
* Fix installing deps when using `profile_mapping` & `ExecutionMode.LOCAL` by @joppevos in #659

Others

* Docs fix: add execution config to MWAA code example by @ugmuka in #674
@tatiana
Copy link
Collaborator

tatiana commented Nov 16, 2023

Thanks for the contribution, @jbandoro , we released it as part of Cosmos 1.2.4 yesterday:
https://github.com/astronomer/astronomer-cosmos/releases/tag/astronomer-cosmos-v1.2.4

tatiana added a commit that referenced this pull request Nov 16, 2023
**Bug fixes**

* Store `compiled_sql` even when task fails by @agreenburg in #671
* Refactor `LoadMethod.LOCAL` to use symlinks instead of copying
directory by @jbandoro in #660
* Fix 'Unable to find the dbt executable: dbt' error by @tatiana in #666
* Fix installing deps when using `profile_mapping` &
`ExecutionMode.LOCAL` by @joppevos in #659

**Others**

* Docs fix: add execution config to MWAA code example by @ugmuka in #674

(cherry picked from commit aa9b7bb)
tatiana added a commit that referenced this pull request Dec 5, 2023
[Justin Bandoro](https://www.linkedin.com/in/justin-bandoro-592b14a7/)
(@jbandoro) is a Data Engineer at Kevala Inc. He's based in San
Francisco (USA) and has been an early adopter of Cosmos, using it
regularly at his company.

Not only has he been using Cosmos since the early stages, but he has
consistently improved Cosmos since January 2023:
![Screenshot 2023-12-04 at 16 28
29](https://github.com/astronomer/astronomer-cosmos/assets/272048/43197938-d1ab-431f-b101-b6026e5cd3ab)

Some of his contributions include new features, code quality,
documentation and overall improvements. Some examples:
* Speed up integration tests in 67% #732 
* Prevent override of dbt profile fields #702
* Add support for env vars in `RenderConfig` in #690 
* Use symbolic links to run local tasks, avoiding to copy potentially
huge dbt project folders in #660
* Improve documentation in #638
* Automated and improved the code complexity checks in #629
* Added `DbtDocsGCSOperator` in #616 
* Added support for Python 3.7 in #88 and #214

Additionally, he has been interacting with users in the #airflow-dbt
Slack channel in a very collaborative and supportive way.

We want to promote him as a Cosmos committer and maintainer for all
these, recognising his constant efforts and achievements towards our
community. Thank you very much, @jbandoro !
arojasb3 pushed a commit to arojasb3/astronomer-cosmos that referenced this pull request Jul 14, 2024
astronomer#660)

This PR refactors the `create_symlinks` function that was previously
used in load via dbt ls so that it can be used in
`DbtLocalBaseOperator.run_command` instead of copying the entire
directory.

Closes: astronomer#614
arojasb3 pushed a commit to arojasb3/astronomer-cosmos that referenced this pull request Jul 14, 2024
**Bug fixes**

* Store `compiled_sql` even when task fails by @agreenburg in astronomer#671
* Refactor `LoadMethod.LOCAL` to use symlinks instead of copying
directory by @jbandoro in astronomer#660
* Fix 'Unable to find the dbt executable: dbt' error by @tatiana in astronomer#666
* Fix installing deps when using `profile_mapping` &
`ExecutionMode.LOCAL` by @joppevos in astronomer#659

**Others**

* Docs fix: add execution config to MWAA code example by @ugmuka in astronomer#674

(cherry picked from commit aa9b7bb)
arojasb3 pushed a commit to arojasb3/astronomer-cosmos that referenced this pull request Jul 14, 2024
[Justin Bandoro](https://www.linkedin.com/in/justin-bandoro-592b14a7/)
(@jbandoro) is a Data Engineer at Kevala Inc. He's based in San
Francisco (USA) and has been an early adopter of Cosmos, using it
regularly at his company.

Not only has he been using Cosmos since the early stages, but he has
consistently improved Cosmos since January 2023:
![Screenshot 2023-12-04 at 16 28
29](https://github.com/astronomer/astronomer-cosmos/assets/272048/43197938-d1ab-431f-b101-b6026e5cd3ab)

Some of his contributions include new features, code quality,
documentation and overall improvements. Some examples:
* Speed up integration tests in 67% astronomer#732 
* Prevent override of dbt profile fields astronomer#702
* Add support for env vars in `RenderConfig` in astronomer#690 
* Use symbolic links to run local tasks, avoiding to copy potentially
huge dbt project folders in astronomer#660
* Improve documentation in astronomer#638
* Automated and improved the code complexity checks in astronomer#629
* Added `DbtDocsGCSOperator` in astronomer#616 
* Added support for Python 3.7 in astronomer#88 and astronomer#214

Additionally, he has been interacting with users in the #airflow-dbt
Slack channel in a very collaborative and supportive way.

We want to promote him as a Cosmos committer and maintainer for all
these, recognising his constant efforts and achievements towards our
community. Thank you very much, @jbandoro !
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:execution Related to the execution environment/mode, like Docker, Kubernetes, Local, VirtualEnv, etc area:performance Related to performance, like memory usage, CPU usage, speed, etc priority:high High priority issues are blocking or critical issues without a workaround and large impact
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor LoadMethod.LOCAL to use symlinks as opposed to copying the dbt source dir
4 participants