Note If your workspace was enabled for Unity Catalog automatically, this guide does not apply to you. See this guide instead.
Note Except for metastore, metastore assignment and storage credential objects, Unity Catalog APIs are accessible via workspace-level APIs. This design may change in the future.
Databricks Unity Catalog brings fine-grained governance and security to Lakehouse data using a familiar, open interface. You can use Terraform to deploy the underlying cloud resources and Unity Catalog objects automatically, using a programmatic approach.
This guide creates a metastore without a storage root location or credential to maintain strict separation of storage across catalogs or environments.
This guide uses the following variables in configurations:
databricks_client_id
: The client_id
is the application_id
of a Service Principal that has account-level admin permission on https://accounts.cloud.databricks.com.databricks_client_secret
: The secret of the above service principal.databricks_account_id
: The numeric ID for your Databricks account. When you are logged in, it appears in the top right corner of the Databricks Account Console or Azure Databricks Account Console.databricks_workspace_url
: Value of workspace_url
attribute from databricks_mws_workspaces resource.This guide is provided as-is and you can use this guide as the basis for your custom Terraform module.
To get started with Unity Catalog, this guide takes you through the following high-level steps:
Initialize provider with mws
alias to set up account-level resources. See provider authentication for more details.
terraform {
required_providers {
databricks = {
source = "databricks/databricks"
}
aws = {
source = "hashicorp/aws"
version = "3.49.0"
}
}
}
provider "aws" {
region = var.region
}
// initialize provider in "MWS" mode for account-level resources
provider "databricks" {
alias = "mws"
host = "https://accounts.cloud.databricks.com"
account_id = var.databricks_account_id
client_id = var.databricks_client_id
client_secret = var.databricks_client_secret
}
// initialize provider at workspace level, to create UC resources
provider "databricks" {
alias = "workspace"
host = var.databricks_workspace_url
client_id = var.databricks_client_id
client_secret = var.databricks_client_secret
}
Define the required variables
variable "databricks_client_id" {}
variable "databricks_client_secret" {}
variable "databricks_account_id" {}
variable "databricks_workspace_url" {}
variable "tags" {
default = {}
}
variable "region" {
default = "eu-west-1"
}
variable "databricks_workspace_ids" {
description = <<EOT
List of Databricks workspace IDs to be enabled with Unity Catalog.
Enter with square brackets and double quotes
e.g. ["111111111", "222222222"]
EOT
type = list(string)
}
variable "databricks_users" {
description = <<EOT
List of Databricks users to be added at account-level for Unity Catalog.
Enter with square brackets and double quotes
e.g ["first.last@domain.com", "second.last@domain.com"]
EOT
type = list(string)
}
variable "databricks_metastore_admins" {
description = <<EOT
List of Admins to be added at account-level for Unity Catalog.
Enter with square brackets and double quotes
e.g ["first.admin@domain.com", "second.admin@domain.com"]
EOT
type = list(string)
}
variable "unity_admin_group" {
description = "Name of the admin group. This group will be set as the owner of the Unity Catalog metastore"
type = string
}
//generate a random string as the prefix for AWS resources, to ensure uniqueness
resource "random_string" "naming" {
special = false
upper = false
length = 6
}
locals {
prefix = "demo${random_string.naming.result}"
}
List of Databricks workspace IDs to be enabled with Unity Catalog.
Enter with square brackets and double quotes
e.g. ["111111111", "222222222"]
EOT
type = list(string)
}
variable "databricks_users" {
description = <<EOT
List of Databricks users to be added at account-level for Unity Catalog.
Enter with square brackets and double quotes
e.g ["first.last@domain.com", "second.last@domain.com"]
EOT
type = list(string)
}
variable "databricks_metastore_admins" {
description = <<EOT
List of Admins to be added at account-level for Unity Catalog.
Enter with square brackets and double quotes
e.g ["first.admin@domain.com", "second.admin@domain.com"]
EOT
type = list(string)
}
variable "unity_admin_group" {
description = "Name of the admin group. This group will be set as the owner of the Unity Catalog metastore"
type = string
}
//generate a random string as the prefix for AWS resources, to ensure uniqueness
resource "random_string" "naming" {
special = false
upper = false
length = 6
}
locals {
prefix = "demo${random_string.naming.result}"
}
List of Databricks users to be added at account-level for Unity Catalog.
Enter with square brackets and double quotes
e.g ["first.last@domain.com", "second.last@domain.com"]
EOT
type = list(string)
}
variable "databricks_metastore_admins" {
description = <<EOT
List of Admins to be added at account-level for Unity Catalog.
Enter with square brackets and double quotes
e.g ["first.admin@domain.com", "second.admin@domain.com"]
EOT
type = list(string)
}
variable "unity_admin_group" {
description = "Name of the admin group. This group will be set as the owner of the Unity Catalog metastore"
type = string
}
//generate a random string as the prefix for AWS resources, to ensure uniqueness
resource "random_string" "naming" {
special = false
upper = false
length = 6
}
locals {
prefix = "demo${random_string.naming.result}"
}
List of Admins to be added at account-level for Unity Catalog.
Enter with square brackets and double quotes
e.g ["first.admin@domain.com", "second.admin@domain.com"]
EOT
type = list(string)
}
variable "unity_admin_group" {
description = "Name of the admin group. This group will be set as the owner of the Unity Catalog metastore"
type = string
}
//generate a random string as the prefix for AWS resources, to ensure uniqueness
resource "random_string" "naming" {
special = false
upper = false
length = 6
}
locals {
prefix = "demo${random_string.naming.result}"
}
A Unity Catalog databricks_metastore can be shared across multiple Databricks workspaces. To enable this, Databricks must have a consistent view of users and groups across all workspaces, and has introduced features within the account console to manage this. Users and groups that wish to use Unity Catalog must be created as account level identities and as workspace-level identities. All users are added to the account users
group by default.
resource "databricks_user" "unity_users" {
provider = databricks.mws
for_each = toset(concat(var.databricks_users, var.databricks_metastore_admins))
user_name = each.key
force = true
}
resource "databricks_group" "admin_group" {
provider = databricks.mws
display_name = var.unity_admin_group
}
resource "databricks_group_member" "admin_group_member" {
provider = databricks.mws
for_each = toset(var.databricks_metastore_admins)
group_id = databricks_group.admin_group.id
member_id = databricks_user.unity_users[each.value].id
}
resource "databricks_user_role" "metastore_admin" {
provider = databricks.mws
for_each = toset(var.databricks_metastore_admins)
user_id = databricks_user.unity_users[each.value].id
role = "account_admin"
}
A databricks_metastore is the top level container for data in Unity Catalog. You can only create a single metastore for each region in which your organization operates, and attach workspaces to the metastore. Each workspace will have the same view of the data you manage in Unity Catalog.
resource "databricks_metastore" "this" {
provider = databricks.mws
name = "primary"
owner = var.unity_admin_group
region = var.region
force_destroy = true
}
resource "databricks_metastore_assignment" "default_metastore" {
provider = databricks.mws
for_each = toset(var.databricks_workspace_ids)
workspace_id = each.key
metastore_id = databricks_metastore.this.id
default_catalog_name = "hive_metastore"
}
Unity Catalog introduces two new objects to access and work with external cloud storage:
First, we need to create the storage credential in Databricks before creating the IAM role in AWS. This is because the external ID of the Databricks storage credential is required in the IAM role trust policy.
data "aws_caller_identity" "current" {}
resource "databricks_storage_credential" "external" {
provider = databricks.workspace
name = "${local.prefix}-external-access"
aws_iam_role {
role_arn = "arn:aws:iam::${data.aws_caller_identity.current.account_id}:role/${local.prefix}-uc-access" //cannot reference aws_iam_role directly, as it will create circular dependency
}
comment = "Managed by TF"
}
resource "databricks_grants" "external_creds" {
provider = databricks.workspace
storage_credential = databricks_storage_credential.external.id
grant {
principal = "Data Engineers"
privileges = ["CREATE_TABLE"]
}
}
Then we can create the required objects in AWS
resource "aws_s3_bucket" "external" {
bucket = "${local.prefix}-external"
acl = "private"
// destroy all objects with bucket destroy
force_destroy = true
tags = merge(local.tags, {
Name = "${local.prefix}-external"
})
}
resource "aws_s3_bucket_versioning" "external_versioning" {
bucket = aws_s3_bucket.external.id
versioning_configuration {
status = "Disabled"
}
}
resource "aws_s3_bucket_public_access_block" "external" {
bucket = aws_s3_bucket.external.id
ignore_public_acls = true
depends_on = [aws_s3_bucket.external]
}
data "aws_iam_policy_document" "passrole_for_uc" {
statement {
effect = "Allow"
actions = ["sts:AssumeRole"]
principals {
identifiers = [databricks_storage_credential.external.aws_iam_role[0].unity_catalog_iam_arn]
type = "AWS"
}
condition {
test = "StringEquals"
variable = "sts:ExternalId"
values = [databricks_storage_credential.external.aws_iam_role[0].external_id]
}
}
statement {
sid = "ExplicitSelfRoleAssumption"
effect = "Allow"
actions = ["sts:AssumeRole"]
principals {
type = "AWS"
identifiers = ["arn:aws:iam::${data.aws_caller_identity.current.account_id}:root"]
}
condition {
test = "ArnLike"
variable = "aws:PrincipalArn"
values = ["arn:aws:iam::${data.aws_caller_identity.current.account_id}:role/${local.prefix}-uc-access"]
}
}
}
resource "aws_iam_policy" "external_data_access" {
// Terraform's "jsonencode" function converts a
// Terraform expression's result to valid JSON syntax.
policy = jsonencode({
Version = "2012-10-17"
Id = "${aws_s3_bucket.external.id}-access"
Statement = [
{
"Action" : [
"s3:GetObject",
"s3:GetObjectVersion",
"s3:PutObject",
"s3:PutObjectAcl",
"s3:DeleteObject",
"s3:ListBucket",
"s3:GetBucketLocation"
],
"Resource" : [
aws_s3_bucket.external.arn,
"${aws_s3_bucket.external.arn}/*"
],
"Effect" : "Allow"
},
{
"Action" : [
"sts:AssumeRole"
],
"Resource" : [
"arn:aws:iam::${data.aws_caller_identity.current.account_id}:role/${local.prefix}-uc-access"
],
"Effect" : "Allow"
},
]
})
tags = merge(local.tags, {
Name = "${local.prefix}-unity-catalog external access IAM policy"
})
}
resource "aws_iam_role" "external_data_access" {
name = "${local.prefix}-uc-access"
assume_role_policy = data.aws_iam_policy_document.passrole_for_uc.json
managed_policy_arns = [aws_iam_policy.external_data_access.arn]
tags = merge(local.tags, {
Name = "${local.prefix}-unity-catalog external access IAM role"
})
}
Then we can create the databricks_external_location in Unity Catalog.
resource "databricks_external_location" "some" {
provider = databricks.workspace
name = "external"
url = "s3://${aws_s3_bucket.external.id}/some"
credential_name = databricks_storage_credential.external.id
comment = "Managed by TF"
}
resource "databricks_grants" "some" {
provider = databricks.workspace
external_location = databricks_external_location.some.id
grant {
principal = "Data Engineers"
privileges = ["CREATE_EXTERNAL_TABLE", "READ_FILES"]
}
}
Each metastore exposes a 3-level namespace (catalog-schema-table) by which data can be organized.
resource "databricks_catalog" "sandbox" {
provider = databricks.workspace
storage_root = "s3://${aws_s3_bucket.external.id}/some"
name = "sandbox"
comment = "this catalog is managed by terraform"
properties = {
purpose = "testing"
}
depends_on = [databricks_metastore_assignment.default_metastore]
}
resource "databricks_grants" "sandbox" {
provider = databricks.workspace
catalog = databricks_catalog.sandbox.name
grant {
principal = "Data Scientists"
privileges = ["USE_CATALOG", "CREATE"]
}
grant {
principal = "Data Engineers"
privileges = ["USE_CATALOG"]
}
}
resource "databricks_schema" "things" {
provider = databricks.workspace
catalog_name = databricks_catalog.sandbox.id
name = "things"
comment = "this database is managed by terraform"
properties = {
kind = "various"
}
}
resource "databricks_grants" "things" {
provider = databricks.workspace
schema = databricks_schema.things.id
grant {
principal = "Data Engineers"
privileges = ["USE_SCHEMA"]
}
}
To ensure the integrity of ACLs, Unity Catalog data can be accessed only through compute resources configured with strong isolation guarantees and other security features. A Unity Catalog databricks_cluster has a ‘Security Mode’ set to either User Isolation or Single User.
data "databricks_spark_version" "latest" {
provider = databricks.workspace
}
data "databricks_node_type" "smallest" {
provider = databricks.workspace
local_disk = true
}
resource "databricks_cluster" "unity_shared" {
provider = databricks.workspace
cluster_name = "Shared clusters"
spark_version = data.databricks_spark_version.latest.id
node_type_id = data.databricks_node_type.smallest.id
autotermination_minutes = 60
enable_elastic_disk = false
num_workers = 2
aws_attributes {
availability = "SPOT"
}
data_security_mode = "USER_ISOLATION"
depends_on = [
databricks_metastore_assignment.this
]
}
for_each
meta-attribute will help us achieve this.First we use databricks_group and databricks_user data resources to get the list of user names that belong to a group.
data "databricks_group" "dev" {
provider = databricks.workspace
display_name = "dev-clusters"
}
data "databricks_user" "dev" {
provider = databricks.workspace
for_each = data.databricks_group.dev.members
user_id = each.key
}
Once we have a specific list of user resources, we could proceed creating single-user clusters and provide permissions with for_each = data.databricks_user.dev
to ensure it's done for each user:
resource "databricks_cluster" "dev" {
for_each = data.databricks_user.dev
provider = databricks.workspace
cluster_name = "${each.value.display_name} unity cluster"
spark_version = data.databricks_spark_version.latest.id
node_type_id = data.databricks_node_type.smallest.id
autotermination_minutes = 10
enable_elastic_disk = false
num_workers = 2
aws_attributes {
availability = "SPOT"
}
data_security_mode = "SINGLE_USER"
single_user_name = each.value.user_name
depends_on = [
databricks_metastore_assignment.this
]
}
resource "databricks_permissions" "dev_restart" {
for_each = data.databricks_user.dev
provider = databricks.workspace
cluster_id = databricks_cluster.dev[each.key].cluster_id
access_control {
user_name = each.value.user_name
permission_level = "CAN_RESTART"
}
}