Skip to content

Commit

Permalink
Additions to docs regarding the DBT_MANIFEST load mode (#757)
Browse files Browse the repository at this point in the history
## Description

I thought that there were a few aspects regarding execution modes that
could require more clarification in the docs.

- The parsing methods docs mentions that only `LOCAL` execution mode is
supported for `DBT_LS`, but the reverse was not true (i.e. execution
mode docs made no mention of parsing methods), so I added notes about
that.
- GCC docs suggest using `VIRTUALENV` execution mode, but makes no
mention of the fact that the `DBT_LS` parsing method is not supported in
this execution mode. Naturally, in this case, users should be utilizing
the `DBT_MANIFEST` load mode, but that means that the docs are
incomplete since they don't include a `manifest_path=?` in the
`ProjectConfig`.
- Note that there are also discussions in the Airflow Slack regarding
issues users have had parsing the `DbtDag` in GCC that are fixable via
using a pre-compiled `manifest,json`, e.g.
https://apache-airflow.slack.com/archives/C059CC42E9W/p1696435273519979
    - Also see #520 for more discussion.
- Generally speaking when doing the `DBT_MANIFEST` load method, the
pattern is that you run `dbt deps && dbt compile` as part of your
deployment, and upload your full dbt project including these artifacts.
This deployment approach may be obvious to veteran users of Airflow
and/or dbt, but it may not be obvious to everyone, so I think adding a
couple sentences in `parsing-methods.rst` is beneficial.

## Related Issue(s)

Not explicitly related, but #520 discusses some issues encountered using
the default parsing method. (Specifically, running `dbt deps` from a
blank slate tends to slow everything down a lot.)

Part of my motivation for adding to the docs is to better advertise +
better document this alternate method of parsing the dbt DAG.

## Breaking Change?

n/a

## Checklist

n/a
  • Loading branch information
dwreeves authored Jan 4, 2024
1 parent d787616 commit 049a8c1
Show file tree
Hide file tree
Showing 3 changed files with 13 additions and 3 deletions.
4 changes: 2 additions & 2 deletions docs/configuration/parsing-methods.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ There are benefits and drawbacks to each method:
- ``dbt_manifest``: You have to generate the manifest file on your own. When using the manifest, Cosmos gets a complete set of metadata about your models. However, Cosmos uses its own selecting & excluding logic to determine which models to run, which may not be as robust as dbt's.
- ``dbt_ls``: Cosmos will generate the manifest file for you. This method uses dbt's metadata AND dbt's selecting/excluding logic. This is the most robust method. However, this requires the dbt executable to be installed on your machine (either on the host directly or in a virtual environment).
- ``dbt_ls_file`` (new in 1.3): Path to a file containing the ``dbt ls`` output. To use this method, run ``dbt ls`` using ``--output json`` and store the output in a file. ``RenderConfig.select`` and ``RenderConfig.exclude`` will not work using this method.
- ``custom``: Cosmos will parse your project and model files for you. This means that Cosmos will not have access to dbt's metadata. However, this method does not require the dbt executable to be installed on your machine.
- ``custom``: Cosmos will parse your project and model files. This means that Cosmos will not have access to dbt's metadata. However, this method does not require the dbt executable to be installed on your machine, and does not require the user to provide any dbt artifacts.

If you're using the ``local`` mode, you should use the ``dbt_ls`` method.

Expand Down Expand Up @@ -60,7 +60,7 @@ To use this:

.. note::

This only works for the ``local`` execution mode.
This only works if a dbt command / executable is available to the scheduler.

If you don't have a ``manifest.json`` file, Cosmos will attempt to generate one from your dbt project. It does this by running ``dbt ls`` and parsing the output.

Expand Down
6 changes: 5 additions & 1 deletion docs/getting_started/execution-modes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,10 @@ In this case, users are responsible for declaring which version of ``dbt`` they

Similar to the ``local`` execution mode, Cosmos converts Airflow Connections into a way ``dbt`` understands them by creating a ``dbt`` profile file (``profiles.yml``).

A drawback with this approach is that it is slower than ``local`` because it creates a new Python virtual environment for each Cosmos dbt task run.
Some drawbacks of this approach:

- It is slower than ``local`` because it creates a new Python virtual environment for each Cosmos dbt task run.
- If dbt is unavailable in the Airflow scheduler, the default ``LoadMode.DBT_LS`` will not work. In this scenario, users must use a `parsing method <parsing-methods.html>`_ that does not rely on dbt, such as ``LoadMode.MANIFEST``.

Example of how to use:

Expand All @@ -91,6 +94,7 @@ The user has better environment isolation than when using ``local`` or ``virtual
The other challenge with the ``docker`` approach is if the Airflow worker is already running in Docker, which sometimes can lead to challenges running `Docker in Docker <https://devops.stackexchange.com/questions/676/why-is-docker-in-docker-considered-bad>`__.

This approach can be significantly slower than ``virtualenv`` since it may have to build the ``Docker`` container, which is slower than creating a Virtualenv with ``dbt-core``.
If dbt is unavailable in the Airflow scheduler, the default ``LoadMode.DBT_LS`` will not work. In this scenario, users must use a `parsing method <parsing-methods.html>`_ that does not rely on dbt, such as ``LoadMode.MANIFEST``.

Check the step-by-step guide on using the ``docker`` execution mode at :ref:`docker`.

Expand Down
6 changes: 6 additions & 0 deletions docs/getting_started/gcc.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,8 @@ Make a new folder, ``dbt``, inside your local ``dags`` folder. Then, copy/paste

Note: your dbt projects can go anywhere that Airflow can read. By default, Cosmos looks in the ``/usr/local/airflow/dags/dbt`` directory, but you can change this by setting the ``dbt_project_dir`` argument when you create your DAG instance.

For more accurate parsing of your dbt project, you should pre-compile your dbt project's ``manifest.json`` (include ``dbt deps && dbt compile`` as part of your deployment process).

For example, if you wanted to put your dbt project in the ``/usr/local/airflow/dags/my_dbt_project`` directory, you would do:

.. code-block:: python
Expand All @@ -31,11 +33,15 @@ For example, if you wanted to put your dbt project in the ``/usr/local/airflow/d
my_cosmos_dag = DbtDag(
project_config=ProjectConfig(
dbt_project_path="/usr/local/airflow/dags/my_dbt_project",
manifest_path="/usr/local/airflow/dags/my_dbt_project/target/manifest.json",
),
# ...,
)
.. note::
You can also exclude the ``manifest_path=...`` from the ``ProjectConfig``. Excluding a ``manifest_path`` file will by default use Cosmos's ``custom`` parsing method, which may be less accurate at parsing a dbt project compared to providing a ``manifest.json``.

Create your DAG
---------------

Expand Down

0 comments on commit 049a8c1

Please sign in to comment.