A Databricks Data Pipeline project template using Giter8
This is a Scala SBT project template intended to fast track data engineers in developing data pipelines using [Apache Spark][spark] and [Delta][delta] on the Databricks platform by providing a working Scala SBT project with the necessary dependencies.
-
An SBT Scala project with the following dependencies:
spark-sql
for developing your ETL pipelinedelta-core
for creating Delta Tabledbutils-api
for accessing Databricks Utilitiesscalatest
for unit testscribe
for logging
-
It also provides
sbt-assembly
sbt plugin in case you need to build an uber jar.
- Install SBT in case you don't have it installed on your local machine
- create a project using the template
$ sbt new ganeshchand/databricks-data-pipeline-template.g8
Note: you will be prompted to either provide parameter values or simply accept the defaults. Alternatively, you can simply provide the parameter values you wish to customize as shown below:
$ sbt new ganeshchand/databricks-data-pipeline-template.g8 --name=example --organization=com.databricks
$ cd <YOUR_PROJECT_NAME>
- Run test
$ sbt test
. The first test will take about a minute or less depending on your internet speed because it will download the project dependencies jars from maven - Open your project in IDE:
- If you are a Databricks customer, you can create a Workflow using the jar. Following the instructions here
To create a jar, run the following command:
$ sbt assembly
For reference, below is a JSON representation of a Databricks Workflow with a Jar task
{
"name": "scala-sbt-template-test",
"tasks": [
{
"task_key": "app",
"spark_jar_task": {
"jar_uri": "",
"main_class_name": "com.examples.databricks.PipelineMain",
"parameters": ["dev"]
},
"job_cluster_key": "job_cluster",
"libraries": [
{
"jar": "dbfs:/gc/jars/example-assembly-0.0.1.jar"
}
],
],
"job_clusters": [
{
"job_cluster_key": "job_cluster",
"new_cluster": {
"spark_version": "14.3.x-scala2.12",
"node_type_id": "m5d.large",
"driver_node_type_id": "m5d.large",
"data_security_mode": "SINGLE_USER",
"runtime_engine": "STANDARD",
"num_workers": 1
}
]
}
Scala 2.13 support was added Starting Apache Spark 3.2.0. However, Databricks only supports Scala 2.12. Template default is scala 2.12 If you wish to to create template with Scala 2.13 version, you can use scala213 branch.
$ sbt new ganeshchand/databricks-data-pipeline-template.g8 -b scala213
- Fork this repo and clone it on your local machine
- Make changes to the template and make sure to clean the previous build
sbt clean compile
- Test the template on your local machine
sbt new file://databricks-data-pipeline-template.g8 --name=test --organization=com.example
cd template-test
sbt test
- Send the pull request
Note: If you need to use the absolute local path, you'll need to use file://<absolute_path>
$ sbt new file:///temp/databricks-data-pipeline-template.g8 --name=test --organization=com.example