databricks_cluster resource

This resource allows you to manage Databricks Clusters.

data "databricks_node_type" "smallest" {
  local_disk = true
}

data "databricks_spark_version" "latest_lts" {
  long_term_support = true
}

resource "databricks_cluster" "shared_autoscaling" {
  cluster_name            = "Shared Autoscaling"
  spark_version           = data.databricks_spark_version.latest_lts.id
  node_type_id            = data.databricks_node_type.smallest.id
  autotermination_minutes = 20
  autoscale {
    min_workers = 1
    max_workers = 50
  }
}

Argument Reference

The following example demonstrates how to create an autoscaling cluster with Delta Cache enabled:

data "databricks_node_type" "smallest" {
  local_disk = true
}

data "databricks_spark_version" "latest_lts" {
  long_term_support = true
}

resource "databricks_cluster" "shared_autoscaling" {
  cluster_name            = "Shared Autoscaling"
  spark_version           = data.databricks_spark_version.latest_lts.id
  node_type_id            = data.databricks_node_type.smallest.id
  autotermination_minutes = 20
  autoscale {
    min_workers = 1
    max_workers = 50
  }
  spark_conf = {
    "spark.databricks.io.cache.enabled" : true,
    "spark.databricks.io.cache.maxDiskUsage" : "50g",
    "spark.databricks.io.cache.maxMetaDataCache" : "1g"
  }
}

Fixed size or autoscaling cluster

When you create a Databricks cluster, you can either provide a num_workers for the fixed-size cluster or provide min_workers and/or max_workers for the cluster within the autoscale group. When you give a fixed-sized cluster, Databricks ensures that your cluster has a specified number of workers. When you provide a range for the number of workers, Databricks chooses the appropriate number of workers required to run your job - also known as "autoscaling." With autoscaling, Databricks dynamically reallocates workers to account for the characteristics of your job. Certain parts of your pipeline may be more computationally demanding than others, and Databricks automatically adds additional workers during these phases of your job (and removes them when they’re no longer needed).

autoscale optional configuration block supports the following:

When using a Single Node cluster, num_workers needs to be 0. It can be set to 0 explicitly, or simply not specified, as it defaults to 0. When num_workers is 0, provider checks for presence of the required Spark configurations:

and also custom_tag entry:

The following example demonstrates how to create an single node cluster:

data "databricks_node_type" "smallest" {
  local_disk = true
}

data "databricks_spark_version" "latest_lts" {
  long_term_support = true
}

resource "databricks_cluster" "single_node" {
  cluster_name            = "Single Node"
  spark_version           = data.databricks_spark_version.latest_lts.id
  node_type_id            = data.databricks_node_type.smallest.id
  autotermination_minutes = 20

  spark_conf = {
    # Single-node
    "spark.databricks.cluster.profile" : "singleNode"
    "spark.master" : "local[*]"
  }

  custom_tags = {
    "ResourceClass" = "SingleNode"
  }
}

(Legacy) High-Concurrency clusters

To create High-Concurrency cluster, following settings should be provided:

For example:

resource "databricks_cluster" "cluster_with_table_access_control" {
  cluster_name            = "Shared High-Concurrency"
  spark_version           = data.databricks_spark_version.latest_lts.id
  node_type_id            = data.databricks_node_type.smallest.id
  autotermination_minutes = 20

  spark_conf = {
    "spark.databricks.repl.allowedLanguages" : "python,sql",
    "spark.databricks.cluster.profile" : "serverless"
  }

  custom_tags = {
    "ResourceClass" = "Serverless"
  }
}

library Configuration Block

To install libraries, one must specify each library in a separate configuration block. Each different type of library has a slightly different syntax. It's possible to set only one type of library within one config block. Otherwise, the plan will fail with an error.

Installing JAR artifacts on a cluster. Location can be anything, that is DBFS or mounted object store (s3, adls, ...)

library {
  jar = "dbfs:/FileStore/app-0.0.1.jar"
}

Installing Python EGG artifacts. Location can be anything, that is DBFS or mounted object store (s3, adls, ...)

library {
  egg = "dbfs:/FileStore/foo.egg"
}

Installing Python Wheel artifacts. Location can be anything, that is DBFS or mounted object store (s3, adls, ...)

library {
  whl = "dbfs:/FileStore/baz.whl"
}

Installing Python PyPI artifacts. You can optionally also specify the repo parameter for a custom PyPI mirror, which should be accessible without any authentication for the network that cluster runs in.

library {
  pypi {
    package = "fbprophet==0.6"
    // repo can also be specified here
  }
}

Installing artifacts from Maven repository. You can also optionally specify a repo parameter for a custom Maven-style repository, that should be accessible without any authentication. Maven libraries are resolved in Databricks Control Plane, so repo should be accessible from it. It can even be properly configured maven s3 wagon, AWS CodeArtifact or Azure Artifacts.

library {
  maven {
    coordinates = "com.amazon.deequ:deequ:1.0.4"
    // exlusions block is optional
    exclusions = ["org.apache.avro:avro"]
  }
}

Installing artifacts from CRan. You can also optionally specify a repo parameter for a custom cran mirror.

library {
  cran {
    package = "rkeops"
  }
}

cluster_log_conf

Example of pushing all cluster logs to DBFS:

cluster_log_conf {
  dbfs {
    destination = "dbfs:/cluster-logs"
  }
}

Example of pushing all cluster logs to S3:

cluster_log_conf {
  s3 {
    destination = "s3://acmecorp-main/cluster-logs"
    region      = "us-east-1"
  }
}

There are a few more advanced attributes for S3 log delivery:

init_scripts

To run a particular init script on all clusters within the same workspace, both automated/job and interactive/all-purpose cluster types, please consider the databricks_global_init_script resource.

It is possible to specify up to 10 different cluster-scoped init scripts per cluster. Init scripts support DBFS, cloud storage locations, and workspace files.

Example of using a Databricks workspace file as init script:

init_scripts {
  workspace {
    destination = "/Users/user@domain/install-elk.sh"
  }
}

Example of using a file from Unity Catalog Volume as init script:

init_scripts {
  volumes {
    destination = "/Volumes/Catalog/default/init-scripts/init-script.sh"
  }
}

Example of taking init script from DBFS (deprecated):

init_scripts {
  dbfs {
    destination = "dbfs:/init-scripts/install-elk.sh"
  }
}

Example of taking init script from S3:

init_scripts {
  s3 {
    destination = "s3://acmecorp-main/init-scripts/install-elk.sh"
    region      = "us-east-1"
  }
}

Similarly, for an init script stored in GCS:

init_scripts {
  gcs {
    destination = "gs://init-scripts/install-elk.sh"
  }
}

Similarly, for an init script stored in ADLS:

init_scripts {
  abfss {
    destination = "abfss://container@storage.dfs.core.windows.net/install-elk.sh"
  }
}

Please note that you need to provide Spark Hadoop configuration (spark.hadoop.fs.azure...) to authenticate to ADLS to get access to the init script.

Clusters with custom Docker containers also allow a local file location for init scripts as follows:

init_scripts {
  file {
    destination = "file:/my/local/file.sh"
  }
}

aws_attributes

aws_attributes optional configuration block contains attributes related to clusters running on Amazon Web Services.

Here is the example of shared autoscaling cluster with some of AWS options set:

data "databricks_spark_version" "latest" {}
data "databricks_node_type" "smallest" {
  local_disk = true
}
resource "databricks_cluster" "this" {
  cluster_name            = "Shared Autoscaling"
  spark_version           = data.databricks_spark_version.latest.id
  node_type_id            = data.databricks_node_type.smallest.id
  autotermination_minutes = 20
  autoscale {
    min_workers = 1
    max_workers = 50
  }
  aws_attributes {
    availability           = "SPOT"
    zone_id                = "us-east-1"
    first_on_demand        = 1
    spot_bid_price_percent = 100
  }
}

The following options are available:

azure_attributes

azure_attributes optional configuration block contains attributes related to clusters running on Azure.

Here is the example of shared autoscaling cluster with some of Azure options set:

data "databricks_spark_version" "latest" {}
data "databricks_node_type" "smallest" {
  local_disk = true
}
resource "databricks_cluster" "this" {
  cluster_name            = "Shared Autoscaling"
  spark_version           = data.databricks_spark_version.latest.id
  node_type_id            = data.databricks_node_type.smallest.id
  autotermination_minutes = 20
  autoscale {
    min_workers = 1
    max_workers = 50
  }
  azure_attributes {
    availability       = "SPOT_WITH_FALLBACK_AZURE"
    first_on_demand    = 1
    spot_bid_max_price = 100
  }
}

The following options are available:

gcp_attributes

gcp_attributes optional configuration block contains attributes related to clusters running on GCP.

Here is the example of shared autoscaling cluster with some of GCP options set:

resource "databricks_cluster" "this" {
  cluster_name            = "Shared Autoscaling"
  spark_version           = data.databricks_spark_version.latest.id
  node_type_id            = data.databricks_node_type.smallest.id
  autotermination_minutes = 20
  autoscale {
    min_workers = 1
    max_workers = 50
  }
  gcp_attributes {
    availability = "PREEMPTIBLE_WITH_FALLBACK_GCP"
    zone_id      = "AUTO"
  }
}

The following options are available:

docker_image

Databricks Container Services lets you specify a Docker image when you create a cluster. You need to enable Container Services in Admin Console / Advanced page in the user interface. By enabling this feature, you acknowledge and agree that your usage of this feature is subject to the applicable additional terms.

docker_image configuration block has the following attributes:

Example usage with azurerm_container_registry and docker_registry_image, that you can adapt to your specific use-case:

resource "docker_registry_image" "this" {
  name = "${azurerm_container_registry.this.login_server}/sample:latest"
  build {
    # ...
  }
}

resource "databricks_cluster" "this" {
  # ...
  docker_image {
    url = docker_registry_image.this.name
    basic_auth {
      username = azurerm_container_registry.this.admin_username
      password = azurerm_container_registry.this.admin_password
    }
  }
}

cluster_mount_info blocks (experimental)

It's possible to mount NFS (Network File System) resources into the Spark containers inside the cluster. You can specify one or more cluster_mount_info blocks describing the mount. This block has following attributes:

For example, you can mount Azure Data Lake Storage container using the following code:

locals {
  storage_account   = "ewfw3ggwegwg"
  storage_container = "test"
}

resource "databricks_cluster" "with_nfs" {
  # ...
  cluster_mount_info {
    network_filesystem_info {
      server_address = "${local.storage_account}.blob.core.windows.net"
      mount_options  = "sec=sys,vers=3,nolock,proto=tcp"
    }
    remote_mount_dir_path = "${local.storage_account}/${local.storage_container}"
    local_mount_dir_path  = "/mnt/nfs-test"
  }
}

workload_type block

It's possible to restrict which workloads may run on the given cluster - notebooks and/or jobs. It's done by defining a workload_type block that consists of a single block clients with following attributes:

resource "databricks_cluster" "with_nfs" {
  # ...
  workload_type {
    clients {
      jobs      = false
      notebooks = true
    }
  }
}

Attribute Reference

In addition to all arguments above, the following attributes are exported:

Access Control

Import

The resource cluster can be imported using cluster id.

terraform import databricks_cluster.this <cluster-id>

The following resources are often used in the same context: