Skip to content

Commit

Permalink
Update docs for Databricks Workflows (#187)
Browse files Browse the repository at this point in the history
### Description
Update documentation with instructions for public preview support for dbt task type in Databricks Workflows.

Co-authored-by: Bilal Aslam <bilal.aslam@databricks.com>
  • Loading branch information
bilalaslamseattle and bilalaslamseattle authored Sep 23, 2022
1 parent e7b1620 commit b06bdd2
Show file tree
Hide file tree
Showing 2 changed files with 12 additions and 11 deletions.
23 changes: 12 additions & 11 deletions docs/databricks-workflows.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,19 +2,18 @@

Databricks Workflows is a highly-reliable, managed orchestrator that lets you author and schedule DAGs of notebooks, Python scripts as well as dbt projects as production jobs.

> The capability of running dbt in a Job is currently in private preview. You must be enrolled in the private preview to follow the steps in this guide. Features, capabilities and pricing may change at any time.
> The capability of running dbt in a Job is currently in public preview. You must be enrolled in the public preview to follow the steps in this guide. Features, capabilities and pricing may change at any time.
In this guide, you will learn how to update an existing dbt project to run as a job, retrieving dbt run artifacts using an API and debug common issues.
In this guide, you will learn how to update an existing dbt project to run as a job, retrieve dbt run artifacts using the Jobs API and debug common issues.

# Overview
When you run a dbt project as a Databricks Job, the dbt Python process as well as the SQL generated by dbt run on the same Automated Cluster.

If you want to run the SQL on, say, a Databricks SQL endpoint or even another cloud data warehouse, you can customize the checked-in `profiles.yml` file appropriately (see below).
When you run a dbt project as a Databricks Job, the dbt CLI runs on a single-node Automated Cluster. The SQL generated by dbt runs on a serverless SQL warehouse.

# Prerequisites
- An existing dbt project version controlled in git
- Access to a Databricks workspace
- Ability to launch job clusters (using a policy or cluster create permissions) or access to an existing interactive cluster with `dbt-core` and `dbt-databricks` libraries installed or `CAN_MANAGE` permissions to install the `dbt-core` and `dbt-databricks` as cluster libraries. We recommend using DBR 10.4 or later versions for better SQL compatibility.
- Ability to launch job clusters (using a policy or cluster create permissions) or access to an existing interactive cluster with `dbt-core` and `dbt-databricks` libraries installed or `CAN_MANAGE` permissions to install the `dbt-core` and `dbt-databricks` as cluster libraries.
- Access to serverless SQL warehouses. See [documentation](https://docs.databricks.com/serverless-compute/index.html) to learn more about this feature and regional availability.
- [Files in Repos](https://docs.databricks.com/repos/index.html#enable-support-for-arbitrary-files-in-databricks-repos) must be enabled and is only supported on Databricks Runtime (DBR) 8.4+ or DBR 11+ depending on the configuration. Please make sure the cluster has the appropriate DBR version.
- Install and configure the [Databricks CLI](https://docs.databricks.com/dev-tools/cli/index.html)
- Install [jq](https://stedolan.github.io/jq/download/), a popular open source tool for parsing JSON from the command line
Expand All @@ -38,9 +37,10 @@ The dbt task only supports retrieve dbt projects from Git. Please follow [the do

![dbt-task-type](/docs/img/dbt-task-type.png)

10. By default, Databricks installs a recent version of `dbt-databricks` from PyPi, which will also install `dbt-spark` as well as `dbt-core`. You can customize this version if you wish.
11. You can customize the Automated Cluster if you wish by clicking _Edit_ in the Cluster dropdown.
12. Click _Save_
10. Under _SQL warehouse_, choose the serverless SQL warehouse where SQL generated by dbt will run. You can optionally choose a custom catalog and schema where tables and views will be created.
11. By default, Databricks installs a recent version of `dbt-databricks` from PyPi, which will also install `dbt-spark` as well as `dbt-core`. You can customize this version if you wish.
12. You can customize the Automated Cluster if you wish by clicking _Edit_ in the dbt CLI cluster dropdown.
13. Click _Save_

# Run the job and view dbt output
You can now run your newly-saved job and see its output.
Expand Down Expand Up @@ -80,15 +80,16 @@ $ tar -xvf artifact.tar.gz

# Common issues
## Unable to connect to Databricks
- You must provide a `profiles.yml` file for now in the root of the Git repository. Please check that this file is present and is properly named e.g. it is not `profile.yml`
- If you do not use the automatically-generated `profiles.yml`, check your Personal Access Token (PAT). It must not be expired.
- Consider adding `dbt debug` as the first command. This may give you a clue about the failure.

## dbt cannot find my `dbt_project.yml` file
If you have checked out the Git repository before enabling the _Files in Repos_ feature, the checkout might be cached invalidly. You need to push a dummy commit to your repository to force a fresh checkout.

# Connecting to different sources (custom profile)
By default the dbt task type will connect to the Automated Cluster dbt-core is running on without any configuration changes or need to check in any secrets. It does so by generating a default `profiles.yml` and telling dbt to use it. We have no restrictions on connection to any other dbt targets such as Databricks SQL, Amazon Redshift, Google BigQuery, Snowflak, or any other [supported adapter](https://docs.getdbt.com/docs/available-adapters). The automatically generated profile can be overridden by specifying an alternative profiles directory in the dbt command using `--profiles-dir <dir>`, where the path of the `<dir>` should be a relative path like `.` or `./my-directory`.
By default the dbt task type will connect to the serverless SQL warehouse specified in the task without any configuration changes or need to check in any secrets. It does so by generating a default `profiles.yml` and telling dbt to use it. We have no restrictions on connection to any other dbt targets such as Databricks SQL, Amazon Redshift, Google BigQuery, Snowflake, or any other [supported adapter](https://docs.getdbt.com/docs/available-adapters).

The automatically generated profile can be overridden by specifying an alternative profiles directory in the dbt command using `--profiles-dir <dir>`, where the path of the `<dir>` should be a relative path like `.` or `./my-directory`.

If you'd like to connect to multiple outputs and include the current Automated Cluster as one of those, the following configuration can be used without exposing any secrets:
```yaml
Expand Down
Binary file modified docs/img/dbt-task-type.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit b06bdd2

Please sign in to comment.