subcategory |
---|
Storage |
This resource will mount your cloud storage on dbfs:/mnt/name
. Right now it supports mounting AWS S3, Azure (Blob Storage, ADLS Gen1 & Gen2), Google Cloud Storage. It is important to understand that this will start up the cluster if the cluster is terminated. The read and refresh terraform command will require a cluster and may take some time to validate the mount.
Note When cluster_id
is not specified, it will create the smallest possible cluster in the default availability zone with name equal to or starting with terraform-mount
for the shortest possible amount of time. To avoid mount failure due to potentially quota or capacity issues with the default cluster, we recommend specifying a cluster to use for mounting.
This resource provides two ways of mounting a storage account:
- Use a storage-specific configuration block - this could be used for the most cases, as it will fill most of the necessary details. Currently we support following configuration blocks:
s3
- to mount AWS S3gs
- to mount Google Cloud Storageabfs
- to mount ADLS Gen2 using Azure Blob Filesystem (ABFS) driveradl
- to mount ADLS Gen1 using Azure Data Lake (ADL) driverwasb
- to mount Azure Blob Storage using Windows Azure Storage Blob (WASB) driver
- Use generic arguments - you have a responsibility for providing all necessary parameters that are required to mount specific storage. This is most flexible option
cluster_id
- (Optional, String) Cluster to use for mounting. If no cluster is specified, a new cluster will be created and will mount the bucket for all of the clusters in this workspace. If the cluster is not running - it's going to be started, so be aware to set auto-termination rules on it.name
- (Optional, String) Name, under which mount will be accessible indbfs:/mnt/<MOUNT_NAME>
. If not specified, provider will try to infer it from depending on the resource type:bucket_name
for AWS S3 and Google Cloud Storagecontainer_name
for ADLS Gen2 and Azure Blob Storagestorage_resource_name
for ADLS Gen1
uri
- (Optional, String) the URI for accessing specific storage (s3a://....
,abfss://....
,gs://....
, etc.)extra_configs
- (Optional, String map) configuration parameters that are necessary for mounting of specific storageresource_id
- (Optional, String) resource ID for a given storage account. Could be used to fill defaults, such as storage account & container names on Azure.encryption_type
- (Optional, String) encryption type. Currently used only for AWS S3 mounts
locals {
tenant_id = "00000000-1111-2222-3333-444444444444"
client_id = "55555555-6666-7777-8888-999999999999"
secret_scope = "some-kv"
secret_key = "some-sp-secret"
container = "test"
storage_acc = "lrs"
}
resource "databricks_mount" "this" {
name = "tf-abfss"
uri = "abfss://${local.container}@${local.storage_acc}.dfs.core.windows.net"
extra_configs = {
"fs.azure.account.auth.type" : "OAuth",
"fs.azure.account.oauth.provider.type" : "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id" : local.client_id,
"fs.azure.account.oauth2.client.secret" : "{{secrets/${local.secret_scope}/${local.secret_key}}}",
"fs.azure.account.oauth2.client.endpoint" : "https://login.microsoftonline.com/${local.tenant_id}/oauth2/token",
"fs.azure.createRemoteFileSystemDuringInitialization" : "false",
}
}
Note You can't create AAD passthrough mount using service principal!
To mount ALDS Gen2 with Azure Active Directory Credentials passthrough we need to execute the mount commands using the cluster configured with AAD Credentials passthrough & provide necessary configuration parameters (see documentation for more details).
provider "azurerm" {
features {}
}
variable "resource_group" {
type = string
description = "Resource group for Databricks Workspace"
}
variable "workspace_name" {
type = string
description = "Name of the Databricks Workspace"
}
data "azurerm_databricks_workspace" "this" {
name = var.workspace_name
resource_group_name = var.resource_group
}
# it works only with AAD token!
provider "databricks" {
host = data.azurerm_databricks_workspace.this.workspace_url
}
data "databricks_node_type" "smallest" {
local_disk = true
}
data "databricks_spark_version" "latest" {
}
resource "databricks_cluster" "shared_passthrough" {
cluster_name = "Shared Passthrough for mount"
spark_version = data.databricks_spark_version.latest.id
node_type_id = data.databricks_node_type.smallest.id
autotermination_minutes = 10
num_workers = 1
spark_conf = {
"spark.databricks.cluster.profile" : "serverless",
"spark.databricks.repl.allowedLanguages" : "python,sql",
"spark.databricks.passthrough.enabled" : "true",
"spark.databricks.pyspark.enableProcessIsolation" : "true"
}
custom_tags = {
"ResourceClass" : "Serverless"
}
}
variable "storage_acc" {
type = string
description = "Name of the ADLS Gen2 storage container"
}
variable "container" {
type = string
description = "Name of container inside storage account"
}
resource "databricks_mount" "passthrough" {
name = "passthrough-test"
cluster_id = databricks_cluster.shared_passthrough.id
uri = "abfss://${var.container}@${var.storage_acc}.dfs.core.windows.net"
extra_configs = {
"fs.azure.account.auth.type" : "CustomAccessToken",
"fs.azure.account.custom.token.provider.class" : "{{sparkconf/spark.databricks.passthrough.adls.gen2.tokenProviderClassName}}",
}
}
This block allows specifying parameters for mounting of the ADLS Gen2. The following arguments are required inside the s3
block:
instance_profile
- (Optional) (String) ARN of registered instance profile for data access. If it's not specified, then thecluster_id
should be provided, and the cluster should have an instance profile attached to it. If bothcluster_id
&instance_profile
are specified, thencluster_id
takes precedence.bucket_name
- (Required) (String) S3 bucket name to be mounted.
// now you can do `%fs ls /mnt/experiments` in notebooks
resource "databricks_mount" "this" {
name = "experiments"
s3 {
instance_profile = databricks_instance_profile.ds.id
bucket_name = aws_s3_bucket.this.bucket
}
}
This block allows specifying parameters for mounting of the ADLS Gen2. The following arguments are required inside the abfs
block:
client_id
- (Required) (String) This is the client_id (Application Object ID) for the enterprise application for the service principal.tenant_id
- (Optional) (String) This is your azure directory tenant id. It is required for creating the mount. (Could be omitted if Azure authentication is used, and we can extracttenant_id
from it).client_secret_key
- (Required) (String) This is the secret key in which your service principal/enterprise app client secret will be stored.client_secret_scope
- (Required) (String) This is the secret scope in which your service principal/enterprise app client secret will be stored.container_name
- (Required) (String) ADLS gen2 container name. (Could be omitted ifresource_id
is provided)storage_account_name
- (Required) (String) The name of the storage resource in which the data is. (Could be omitted ifresource_id
is provided)directory
- (Computed) (String) This is optional if you don't want to add an additional directory that you wish to mount. This must start with a "/".initialize_file_system
- (Required) (Bool) either or not initialize FS for the first use
In this example, we're using Azure authentication, so we can omit some parameters (tenant_id
, storage_account_name
, and container_name
) that will be detected automatically.
resource "databricks_secret_scope" "terraform" {
name = "application"
initial_manage_principal = "users"
}
resource "databricks_secret" "service_principal_key" {
key = "service_principal_key"
string_value = "${var.ARM_CLIENT_SECRET}"
scope = databricks_secret_scope.terraform.name
}
resource "azurerm_storage_account" "this" {
name = "${var.prefix}datalake"
resource_group_name = var.resource_group_name
location = var.resource_group_location
account_tier = "Standard"
account_replication_type = "GRS"
account_kind = "StorageV2"
is_hns_enabled = true
}
resource "azurerm_role_assignment" "this" {
scope = azurerm_storage_account.this.id
role_definition_name = "Storage Blob Data Contributor"
principal_id = data.azurerm_client_config.current.object_id
}
resource "azurerm_storage_container" "this" {
name = "marketing"
storage_account_name = azurerm_storage_account.this.name
container_access_type = "private"
}
resource "databricks_mount" "marketing" {
name = "marketing"
resource_id = azurerm_storage_container.this.resource_manager_id
abfs {
client_id = data.azurerm_client_config.current.client_id
client_secret_scope = databricks_secret_scope.terraform.name
client_secret_key = databricks_secret.service_principal_key.key
initialize_file_system = true
}
}
This block allows specifying parameters for mounting of the Google Cloud Storage. The following arguments are required inside the gs
block:
service_account
- (Optional) (String) email of registered Google Service Account for data access. If it's not specified, then thecluster_id
should be provided, and the cluster should have a Google service account attached to it.bucket_name
- (Required) (String) GCS bucket name to be mounted.
resource "databricks_mount" "this_gs" {
name = "gs-mount"
gs {
service_account = "acc@company.iam.gserviceaccount.com"
bucket_name = "mybucket"
}
}
This block allows specifying parameters for mounting of the ADLS Gen1. The following arguments are required inside the adl
block:
-
client_id
- (Required) (String) This is the client_id for the enterprise application for the service principal. -
tenant_id
- (Optional) (String) This is your azure directory tenant id. It is required for creating the mount. (Could be omitted if Azure authentication is used, and we can extracttenant_id
from it) -
client_secret_key
- (Required) (String) This is the secret key in which your service principal/enterprise app client secret will be stored. -
client_secret_scope
- (Required) (String) This is the secret scope in which your service principal/enterprise app client secret will be stored. -
storage_resource_name
- (Required) (String) The name of the storage resource in which the data is for ADLS gen 1. This is what you are trying to mount. (Could be omitted ifresource_id
is provided) -
spark_conf_prefix
- (Optional) (String) This is the spark configuration prefix for adls gen 1 mount. The options arefs.adl
,dfs.adls
. Usefs.adl
for runtime 6.0 and above for the clusters. Otherwise usedfs.adls
. The default value is:fs.adl
. -
directory
- (Computed) (String) This is optional if you don't want to add an additional directory that you wish to mount. This must start with a "/".
resource "databricks_mount" "mount" {
name = "{var.RANDOM}"
adl {
storage_resource_name = "{env.TEST_STORAGE_ACCOUNT_NAME}"
tenant_id = data.azurerm_client_config.current.tenant_id
client_id = data.azurerm_client_config.current.client_id
client_secret_scope = databricks_secret_scope.terraform.name
client_secret_key = databricks_secret.service_principal_key.key
spark_conf_prefix = "fs.adl"
}
}
This block allows specifying parameters for mounting of the Azure Blob Storage. The following arguments are required inside the wasb
block:
auth_type
- (Required) (String) This is the auth type for blob storage. This can either be SAS tokens (SAS
) or account access keys (ACCESS_KEY
).token_secret_scope
- (Required) (String) This is the secret scope in which your auth type token is stored.token_secret_key
- (Required) (String) This is the secret key in which your auth type token is stored.container_name
- (Required) (String) The container in which the data is. This is what you are trying to mount. (Could be omitted ifresource_id
is provided)storage_account_name
- (Required) (String) The name of the storage resource in which the data is. (Could be omitted ifresource_id
is provided)directory
- (Computed) (String) This is optional if you don't want to add an additional directory that you wish to mount. This must start with a "/".
resource "azurerm_storage_account" "blobaccount" {
name = "${var.prefix}blob"
resource_group_name = var.resource_group_name
location = var.resource_group_location
account_tier = "Standard"
account_replication_type = "LRS"
account_kind = "StorageV2"
}
resource "azurerm_storage_container" "marketing" {
name = "marketing"
storage_account_name = azurerm_storage_account.blobaccount.name
container_access_type = "private"
}
resource "databricks_secret_scope" "terraform" {
name = "application"
initial_manage_principal = "users"
}
resource "databricks_secret" "storage_key" {
key = "blob_storage_key"
string_value = azurerm_storage_account.blobaccount.primary_access_key
scope = databricks_secret_scope.terraform.name
}
resource "databricks_mount" "marketing" {
name = "marketing"
wasb {
container_name = azurerm_storage_container.marketing.name
storage_account_name = azurerm_storage_account.blobaccount.name
auth_type = "ACCESS_KEY"
token_secret_scope = databricks_secret_scope.terraform.name
token_secret_key = databricks_secret.storage_key.key
}
}
Migration from the specific mount resource is straightforward:
- rename
mount_name
toname
- wrap storage-specific settings (
container_name
, ...) into corresponding block (adl
,abfs
,s3
,wasbs
) - for S3 mounts, rename
s3_bucket_name
tobucket_name
In addition to all arguments above, the following attributes are exported:
id
- mount namesource
- (String) HDFS-compatible url
-> Note Importing this resource is not currently supported.
The following resources are often used in the same context:
- End to end workspace management guide.
- databricks_aws_bucket_policy data to configure a simple access policy for AWS S3 buckets, so that Databricks can access data in it.
- databricks_cluster to create Databricks Clusters.
- databricks_dbfs_file data to get file content from Databricks File System (DBFS).
- databricks_dbfs_file_paths data to get list of file names from get file content from Databricks File System (DBFS).
- databricks_dbfs_file to manage relatively small files on Databricks File System (DBFS).
- databricks_instance_profile to manage AWS EC2 instance profiles that users can launch databricks_cluster and access data, like databricks_mount.
- databricks_library to install a library on databricks_cluster.