- all-purpose clusters (notebook/job clusters) :: block by block code execution
- standard job clusters :: job execution (~3-4 times cheaper)
- path to root:
file:/databricks/driver/
- use your own IDE (e.g. Jupyterlab): https://databricks.com/blog/2019/12/03/jupyterlab-databricks-integration-bridge-local-and-remote-workflows.html
- Koalas: pandas API on Apache Spark
- secrets: credentials manager
- the comment/uncomment shortcut
Cmd
+/
doesn't work with German / Swiss mac keyboards.
-> German fix [credits: Franziska] Option
+ select beginning of lines with mouse + #
-> Swiss fix Ctrl
+ -
- chose a new job cluster (the all purpose ones are more expensive and can also be notebook clusters).
- Saving logs on that cluster: advanced options -> logging -> destination -> DBFS
- give cluster access to AWS/Azure storage: options -> advanced
- If you are using a previously created mlflow experiment linked to the notebook, find the experiment name with this
from mlflow.tracking import MlflowClient
client = MlflowClient()
experiments = client.list_experiments() # returns a list of mlflow.entities.Experiment
experiments
and then at the beginning of the notebook that runs in the job:
import mlflow
mlflow.set_experiment("expName")
sometimes the job cluster will fail without a good reason (e.g. ImportError: __import__ not found
). Try to change the cluster's runtime.
job clusters, unlike all-purpose notebooks, don't have a "stop when idle" option, which means your job may run indefinitely
bucket = "path/to/bucket"
mount = "folder_name"
# mount if not already mounted
m = dbutils.fs.mounts()
if(not any('/mnt/' + mount in s for s in m)):
dbutils.fs.mount("s3a://" + bucket, '/mnt/' + mount)
display(dbutils.fs.ls('/mnt/' + mount))
dbutils.fs.cp("s3a://path/to/bucket", "file:/databricks/driver/", True)
-
go to user settings under the little man top right. setup a new token
-
in terminal:
pip install databricks-cli
databricks configure --token
westeurope.azuredatabricks.net/?o=5728379491119130
- copy paste the token
- for help:
databricks fs -h
- to download a file:
databricks fs cp dbfs:/FileStore/... local/file/directory
- right click in the workspace and click import (creating a .py file from scratch is not possible)
- write
%run /Users/firstName.lastName@company.com/path/to/external.py
or%run ./relative/path/to/external.py
in a notebook block.
go to main page -> import library -> pypi -> "package_name"
whenever you restart the cluster that is attached to the notebook, you have to "detach reattach" to the cluster on the upper left corner.
Once one is done testing, it is easier to move to a standard cluster, so as to start scheduling and always have the packages installed.
create a job -> associate a clsuter to it -> add a dependent library that was imported with the steps above.
https://docs.databricks.com/notebooks/widgets.html
import logging
logger = spark._jvm.org.apache.log4j
logging.getLogger("py4j").setLevel(logging.ERROR)
import urllib3
response = urllib3.PoolManager().request('GET', 'http://health.data.ny.gov/api/views/myeu-hzra/rows.csv')
csvfile = response.data.decode("utf-8")
dbutils.fs.put("dbfs:/babynames.csv", csvfile)
packages <- c("data.table", "profvis", ...)
package.check <- lapply(packages, FUN = function(x) {
if (!require(x, character.only = T)) install.packages(x)
if (! (x %in% (.packages() ))) library(x, character.only = T)
})
# making sure that the cluster's default python is used to install keras and run models
# without this command, keras_model.compile(), install_keras() and install_tensorflow() conflict by using different python installations.
Sys.setenv(RETICULATE_PYTHON = system("which python", intern = T))
install.packages("tensorflow")
library(tensorflow)
install_tensorflow()
# version = "gpu"
install.packages("keras")
library(keras)
install_keras(tensorflow = "gpu")
### check
k = backend()
sess = k$get_session()
sess$list_devices()
Databricks does not offer an integration of an entire github repo. Instead, it works on a file by file basis. To manage that, to use your favourite IDE and to separate the script into several text files for readability (e.g. one text file per function / class definition), one solution is the following:
Have a central R notebook on databricks that calls scripts from other R files in a central script
directory. Workflow is then the following:
- modify script files locally on favourite IDE
- in your shell (to install the CLI, see "download files from folder"):
databricks fs mkdirs dbfs:/FileStore/video2vec/script
databricks fs cp -r script dbfs:/FileStore/video2vec/script
- in databricks
setwd("/dbfs/FileStore/video2vec/script")
lapply(list.files(pattern = "[.][rR]$", recursive = FALSE), source)
Another option is an R-Package, but that would have required an extra step of de- and re-installing the package at every modification. This workflow sounds a bit more tedious than usual, but requires just a few seconds of time and is not that much slower than full repo git integration (until it comes to Databricks). Please do message me in case you found a nicer workflow for R or Python.