databricks-data-pipeline-template.g8

A Databricks Data Pipeline project template using Giter8

This is a Scala SBT project template intended to fast track data engineers in developing data pipelines using [Apache Spark][spark] and [Delta][delta] on the Databricks platform by providing a working Scala SBT project with the necessary dependencies.

What does this template provide?

An SBT Scala project with the following dependencies:
- spark-sql for developing your ETL pipeline
- delta-core for creating Delta Table
- dbutils-api for accessing Databricks Utilities
- scalatest for unit test
- scribe for logging
It also provides sbt-assembly sbt plugin in case you need to build an uber jar.

Getting started

Using template

Install SBT in case you don't have it installed on your local machine
create a project using the template

$ sbt new ganeshchand/databricks-data-pipeline-template.g8

Note: you will be prompted to either provide parameter values or simply accept the defaults. Alternatively, you can simply provide the parameter values you wish to customize as shown below:

$ sbt new ganeshchand/databricks-data-pipeline-template.g8 --name=example --organization=com.databricks

$ cd <YOUR_PROJECT_NAME>
Run test $ sbt test. The first test will take about a minute or less depending on your internet speed because it will download the project dependencies jars from maven
Open your project in IDE:
- IntelliJ: $ idea .
- VS CODE: $ code .
If you are a Databricks customer, you can create a Workflow using the jar. Following the instructions here To create a jar, run the following command:
```
 $ sbt assembly 
```

For reference, below is a JSON representation of a Databricks Workflow with a Jar task

{
  "name": "scala-sbt-template-test",
  "tasks": [
    {
      "task_key": "app",
      "spark_jar_task": {
        "jar_uri": "",
        "main_class_name": "com.examples.databricks.PipelineMain",
        "parameters": ["dev"]
      },
      "job_cluster_key": "job_cluster",
      "libraries": [
        {
          "jar": "dbfs:/gc/jars/example-assembly-0.0.1.jar"
        }
      ],
  ],
  "job_clusters": [
    {
      "job_cluster_key": "job_cluster",
      "new_cluster": {
        "spark_version": "14.3.x-scala2.12",
        "node_type_id": "m5d.large",
        "driver_node_type_id": "m5d.large",
        "data_security_mode": "SINGLE_USER",
        "runtime_engine": "STANDARD",
        "num_workers": 1
      }
  ]
}

Scala Version

Scala 2.13 support was added Starting Apache Spark 3.2.0. However, Databricks only supports Scala 2.12. Template default is scala 2.12 If you wish to to create template with Scala 2.13 version, you can use scala213 branch.

$ sbt new ganeshchand/databricks-data-pipeline-template.g8 -b scala213

Contributing to this template

Fork this repo and clone it on your local machine
Make changes to the template and make sure to clean the previous build sbt clean compile
Test the template on your local machine

sbt new file://databricks-data-pipeline-template.g8 --name=test --organization=com.example
cd template-test
sbt test

Send the pull request

Note: If you need to use the absolute local path, you'll need to use file://<absolute_path>

$ sbt new file:///temp/databricks-data-pipeline-template.g8 --name=test --organization=com.example

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github/workflows		.github/workflows
doc/images		doc/images
project		project
src/main/g8		src/main/g8
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

databricks-data-pipeline-template.g8

A Databricks Data Pipeline project template using Giter8

What does this template provide?

Getting started

Using template

Scala Version

Contributing to this template

About

Releases

Packages

Languages

License

ganeshchand/databricks-data-pipeline-template.g8

Folders and files

Latest commit

History

Repository files navigation

databricks-data-pipeline-template.g8

A Databricks Data Pipeline project template using Giter8

What does this template provide?

Getting started

Using template

Scala Version

Contributing to this template

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages