Manages a Cloud Dataproc cluster resource within GCP.
resource "google_dataproc_cluster" "simplecluster" {
name = "simplecluster"
region = "us-central1"
}
resource "google_service_account" "default" {
account_id = "service-account-id"
display_name = "Service Account"
}
resource "google_dataproc_cluster" "mycluster" {
name = "mycluster"
region = "us-central1"
graceful_decommission_timeout = "120s"
labels = {
foo = "bar"
}
cluster_config {
staging_bucket = "dataproc-staging-bucket"
master_config {
num_instances = 1
machine_type = "e2-medium"
disk_config {
boot_disk_type = "pd-ssd"
boot_disk_size_gb = 30
}
}
worker_config {
num_instances = 2
machine_type = "e2-medium"
min_cpu_platform = "Intel Skylake"
disk_config {
boot_disk_size_gb = 30
num_local_ssds = 1
}
}
preemptible_worker_config {
num_instances = 0
}
# Override or set some custom properties
software_config {
image_version = "2.0.35-debian10"
override_properties = {
"dataproc:dataproc.allow.zero.workers" = "true"
}
}
gce_cluster_config {
tags = ["foo", "bar"]
# Google recommends custom service accounts that have cloud-platform scope and permissions granted via IAM Roles.
service_account = google_service_account.default.email
service_account_scopes = [
"cloud-platform"
]
}
# You can define multiple initialization_action blocks
initialization_action {
script = "gs://dataproc-initialization-actions/stackdriver/stackdriver.sh"
timeout_sec = 500
}
}
}
resource "google_dataproc_cluster" "accelerated_cluster" {
name = "my-cluster-with-gpu"
region = "us-central1"
cluster_config {
gce_cluster_config {
zone = "us-central1-a"
}
master_config {
accelerators {
accelerator_type = "nvidia-tesla-k80"
accelerator_count = "1"
}
}
}
}
name
- (Required) The name of the cluster, unique within the project and
zone.project
- (Optional) The ID of the project in which the cluster
will exist. If it
is not provided, the provider project is used.
region
- (Optional) The region in which the cluster and associated nodes will be created in.
Defaults to global
.
labels
- (Optional) The list of labels (key/value pairs) configured on the resource through Terraform and to be applied to
instances in the cluster.
Note: This field is non-authoritative, and will only manage the labels present in your configuration. Please refer to the field effective_labels
for all of the labels present on the resource.
terraform_labels
-
The combination of labels configured directly on the resource and default labels configured on the provider.
effective_labels
- (Computed) The list of labels (key/value pairs) to be applied to
instances in the cluster. GCP generates some itself including goog-dataproc-cluster-name
which is the name of the cluster.
virtual_cluster_config
- (Optional) Allows you to configure a virtual Dataproc on GKE cluster.
Structure defined below.
cluster_config
- (Optional) Allows you to configure various aspects of the cluster.
Structure defined below.
graceful_decommission_timeout
- (Optional) Allows graceful decomissioning when you change the number of worker nodes directly through a terraform apply.
Does not affect auto scaling decomissioning from an autoscaling policy.
Graceful decommissioning allows removing nodes from the cluster without interrupting jobs in progress.
Timeout specifies how long to wait for jobs in progress to finish before forcefully removing nodes (and potentially interrupting jobs).
Default timeout is 0 (for forceful decommission), and the maximum allowed timeout is 1 day. (see JSON representation of
Duration).
Only supported on Dataproc image versions 1.2 and higher.
For more context see the docs
The virtual_cluster_config
block supports:
virtual_cluster_config {
auxiliary_services_config { ... }
kubernetes_cluster_config { ... }
}
staging_bucket
- (Optional) The Cloud Storage staging bucket used to stage files,
such as Hadoop jars, between client machines and the cluster.
Note: If you don't explicitly specify a staging_bucket
then GCP will auto create / assign one for you. However, you are not guaranteed
an auto generated bucket which is solely dedicated to your cluster; it may be shared
with other clusters in the same region/zone also choosing to use the auto generation
option.
auxiliary_services_config
(Optional) Configuration of auxiliary services used by this cluster.
Structure defined below.
kubernetes_cluster_config
(Required) The configuration for running the Dataproc cluster on Kubernetes.
Structure defined below.
The auxiliary_services_config
block supports:
virtual_cluster_config {
auxiliary_services_config {
metastore_config {
dataproc_metastore_service = google_dataproc_metastore_service.metastore_service.id
}
spark_history_server_config {
dataproc_cluster = google_dataproc_cluster.dataproc_cluster.id
}
}
}
metastore_config
(Optional) The Hive Metastore configuration for this workload.
dataproc_metastore_service
(Required) Resource name of an existing Dataproc Metastore service.spark_history_server_config
(Optional) The Spark History Server configuration for the workload.
dataproc_cluster
(Optional) Resource name of an existing Dataproc Cluster to act as a Spark History Server for the workload.The kubernetes_cluster_config
block supports:
virtual_cluster_config {
kubernetes_cluster_config {
kubernetes_namespace = "foobar"
kubernetes_software_config {
component_version = {
"SPARK" : "3.1-dataproc-7"
}
properties = {
"spark:spark.eventLog.enabled": "true"
}
}
gke_cluster_config {
gke_cluster_target = google_container_cluster.primary.id
node_pool_target {
node_pool = "dpgke"
roles = ["DEFAULT"]
node_pool_config {
autoscaling {
min_node_count = 1
max_node_count = 6
}
config {
machine_type = "n1-standard-4"
preemptible = true
local_ssd_count = 1
min_cpu_platform = "Intel Sandy Bridge"
}
locations = ["us-central1-c"]
}
}
}
}
}
kubernetes_namespace
(Optional) A namespace within the Kubernetes cluster to deploy into.
If this namespace does not exist, it is created.
If it exists, Dataproc verifies that another Dataproc VirtualCluster is not installed into it.
If not specified, the name of the Dataproc Cluster is used.
kubernetes_software_config
(Required) The software configuration for this Dataproc cluster running on Kubernetes.
component_version
(Required) The components that should be installed in this Dataproc cluster. The key must be a string from the NOTE : component_version[SPARK]
is mandatory to set, or the creation of the cluster will fail.
properties
(Optional) The properties to set on daemon config files. Property keys are specified in prefix:property format,
for example spark:spark.kubernetes.container.image.
gke_cluster_config
(Required) The configuration for running the Dataproc cluster on GKE.
gke_cluster_target
(Optional) A target GKE cluster to deploy to. It must be in the same project and region as the Dataproc cluster
(the GKE cluster can be zonal or regional)
node_pool_target
(Optional) GKE node pools where workloads will be scheduled. At least one node pool must be assigned the DEFAULT
GkeNodePoolTarget.Role. If a GkeNodePoolTarget is not specified, Dataproc constructs a DEFAULT
GkeNodePoolTarget.
Each role can be given to only one GkeNodePoolTarget. All node pools must have the same location settings.
node_pool
(Required) The target GKE node pool.
roles
(Required) The roles associated with the GKE node pool.
One of "DEFAULT"
, "CONTROLLER"
, "SPARK_DRIVER"
or "SPARK_EXECUTOR"
.
node_pool_config
(Input only) The configuration for the GKE node pool.
If specified, Dataproc attempts to create a node pool with the specified shape.
If one with the same name already exists, it is verified against all specified fields.
If a field differs, the virtual cluster creation will fail.
autoscaling
(Optional) The autoscaler configuration for this node pool.
The autoscaler is enabled only when a valid configuration is present.
min_node_count
(Optional) The minimum number of nodes in the node pool. Must be >= 0 and <= maxNodeCount.
max_node_count
(Optional) The maximum number of nodes in the node pool. Must be >= minNodeCount, and must be > 0.
config
(Optional) The node pool configuration.
machine_type
(Optional) The name of a Compute Engine machine type.
local_ssd_count
(Optional) The number of local SSD disks to attach to the node,
which is limited by the maximum number of disks allowable per zone.
preemptible
(Optional) Whether the nodes are created as preemptible VM instances.
Preemptible nodes cannot be used in a node pool with the CONTROLLER role or in the DEFAULT node pool if the
CONTROLLER role is not assigned (the DEFAULT node pool will assume the CONTROLLER role).
min_cpu_platform
(Optional) Minimum CPU platform to be used by this instance.
The instance may be scheduled on the specified or a newer CPU platform.
Specify the friendly names of CPU platforms, such as "Intel Haswell" or "Intel Sandy Bridge".
spot
(Optional) Spot flag for enabling Spot VM, which is a rebrand of the existing preemptible flag.
locations
(Optional) The list of Compute Engine zones where node pool nodes associated
with a Dataproc on GKE virtual cluster will be located.
The cluster_config
block supports:
cluster_config {
gce_cluster_config { ... }
master_config { ... }
worker_config { ... }
preemptible_worker_config { ... }
software_config { ... }
# You can define multiple initialization_action blocks
initialization_action { ... }
encryption_config { ... }
endpoint_config { ... }
metastore_config { ... }
}
staging_bucket
- (Optional) The Cloud Storage staging bucket used to stage files,
such as Hadoop jars, between client machines and the cluster.
Note: If you don't explicitly specify a staging_bucket
then GCP will auto create / assign one for you. However, you are not guaranteed
an auto generated bucket which is solely dedicated to your cluster; it may be shared
with other clusters in the same region/zone also choosing to use the auto generation
option.
temp_bucket
- (Optional) The Cloud Storage temp bucket used to store ephemeral cluster
and jobs data, such as Spark and MapReduce history files.
Note: If you don't explicitly specify a temp_bucket
then GCP will auto create / assign one for you.
gce_cluster_config
(Optional) Common config settings for resources of Google Compute Engine cluster
instances, applicable to all instances in the cluster. Structure defined below.
master_config
(Optional) The Google Compute Engine config settings for the master instances
in a cluster. Structure defined below.
worker_config
(Optional) The Google Compute Engine config settings for the worker instances
in a cluster. Structure defined below.
preemptible_worker_config
(Optional) The Google Compute Engine config settings for the additional
instances in a cluster. Structure defined below.
preemptible_worker_config
is
an alias for the api's secondaryWorkerConfig. The name doesn't necessarily mean it is preemptible and is named as
such for legacy/compatibility reasons.software_config
(Optional) The config settings for software inside the cluster.
Structure defined below.
security_config
(Optional) Security related configuration. Structure defined below.
autoscaling_config
(Optional) The autoscaling policy config associated with the cluster.
Note that once set, if autoscaling_config
is the only field set in cluster_config
, it can
only be removed by setting policy_uri = ""
, rather than removing the whole block.
Structure defined below.
initialization_action
(Optional) Commands to execute on each node after config is completed.
You can specify multiple versions of these. Structure defined below.
encryption_config
(Optional) The Customer managed encryption keys settings for the cluster.
Structure defined below.
lifecycle_config
(Optional) The settings for auto deletion cluster schedule.
Structure defined below.
endpoint_config
(Optional) The config settings for port access on the cluster.
Structure defined below.
dataproc_metric_config
(Optional) The Compute Engine accelerator (GPU) configuration for these instances. Can be specified multiple times.
Structure defined below.
auxiliary_node_groups
(Optional) A Dataproc NodeGroup resource is a group of Dataproc cluster nodes that execute an assigned role.
Structure defined below.
metastore_config
(Optional) The config setting for metastore service with the cluster.
Structure defined below.
The cluster_config.gce_cluster_config
block supports:
cluster_config {
gce_cluster_config {
zone = "us-central1-a"
# One of the below to hook into a custom network / subnetwork
network = google_compute_network.dataproc_network.name
subnetwork = google_compute_network.dataproc_subnetwork.name
tags = ["foo", "bar"]
}
}
zone
- (Optional, Computed) The GCP zone where your data is stored and used (i.e. where
the master and the worker nodes will be created in). If region
is set to 'global' (default)
then zone
is mandatory, otherwise GCP is able to make use of Auto Zone Placement
to determine this automatically for you.
Note: This setting additionally determines and restricts
which computing resources are available for use with other configs such as
cluster_config.master_config.machine_type
and cluster_config.worker_config.machine_type
.
network
- (Optional, Computed) The name or self_link of the Google Compute Engine
network to the cluster will be part of. Conflicts with subnetwork
.
If neither is specified, this defaults to the "default" network.
subnetwork
- (Optional) The name or self_link of the Google Compute Engine
subnetwork the cluster will be part of. Conflicts with network
.
service_account
- (Optional) The service account to be used by the Node VMs.
If not specified, the "default" service account is used.
service_account_scopes
- (Optional, Computed) The set of Google API scopes
to be made available on all of the node VMs under the service_account
specified. Both OAuth2 URLs and gcloud
short names are supported. To allow full access to all Cloud APIs, use the
cloud-platform
scope. See a complete list of scopes here.
tags
- (Optional) The list of instance tags applied to instances in the cluster.
Tags are used to identify valid sources or targets for network firewalls.
internal_ip_only
- (Optional) By default, clusters are not restricted to internal IP addresses,
and will have ephemeral external IP addresses assigned to each instance. If set to true, all
instances in the cluster will only have internal IP addresses. Note: Private Google Access
(also known as privateIpGoogleAccess
) must be enabled on the subnetwork that the cluster
will be launched in.
metadata
- (Optional) A map of the Compute Engine metadata entries to add to all instances
(see Project and instance metadata).
reservation_affinity
- (Optional) Reservation Affinity for consuming zonal reservation.
consume_reservation_type
- (Optional) Corresponds to the type of reservation consumption.key
- (Optional) Corresponds to the label key of reservation resource.values
- (Optional) Corresponds to the label values of reservation resource.node_group_affinity
- (Optional) Node Group Affinity for sole-tenant clusters.
node_group_uri
- (Required) The URI of a sole-tenant node group resource that the cluster will be created on.shielded_instance_config
(Optional) Shielded Instance Config for clusters using Compute Engine Shielded VMs.
The cluster_config.gce_cluster_config.shielded_instance_config
block supports:
cluster_config{
gce_cluster_config{
shielded_instance_config{
enable_secure_boot = true
enable_vtpm = true
enable_integrity_monitoring = true
}
}
}
enable_secure_boot
- (Optional) Defines whether instances have Secure Boot enabled.
enable_vtpm
- (Optional) Defines whether instances have the vTPM enabled.
enable_integrity_monitoring
- (Optional) Defines whether instances have integrity monitoring enabled.
The cluster_config.master_config
block supports:
cluster_config {
master_config {
num_instances = 1
machine_type = "e2-medium"
min_cpu_platform = "Intel Skylake"
disk_config {
boot_disk_type = "pd-ssd"
boot_disk_size_gb = 30
num_local_ssds = 1
}
}
}
num_instances
- (Optional, Computed) Specifies the number of master nodes to create.
If not specified, GCP will default to a predetermined computed value (currently 1).
machine_type
- (Optional, Computed) The name of a Google Compute Engine machine type
to create for the master. If not specified, GCP will default to a predetermined
computed value (currently n1-standard-4
).
min_cpu_platform
- (Optional, Computed) The name of a minimum generation of CPU family
for the master. If not specified, GCP will default to a predetermined computed value
for each zone. See the guide
for details about which CPU families are available (and defaulted) for each zone.
image_uri
(Optional) The URI for the image to use for this worker. See the guide
for more information.
disk_config
(Optional) Disk Config
boot_disk_type
- (Optional) The disk type of the primary disk attached to each node.
One of "pd-ssd"
or "pd-standard"
. Defaults to "pd-standard"
.
boot_disk_size_gb
- (Optional, Computed) Size of the primary disk attached to each node, specified
in GB. The primary disk contains the boot volume and system libraries, and the
smallest allowed disk size is 10GB. GCP will default to a predetermined
computed value if not set (currently 500GB). Note: If SSDs are not
attached, it also contains the HDFS data blocks and Hadoop working directories.
num_local_ssds
- (Optional) The amount of local SSD disks that will be
attached to each master cluster node. Defaults to 0.
accelerators
(Optional) The Compute Engine accelerator (GPU) configuration for these instances. Can be specified multiple times.
accelerator_type
- (Required) The short name of the accelerator type to expose to this instance. For example, nvidia-tesla-k80
.
accelerator_count
- (Required) The number of the accelerator cards of this type exposed to this instance. Often restricted to one of 1
, 2
, 4
, or 8
.
The cluster_config.worker_config
block supports:
cluster_config {
worker_config {
num_instances = 3
machine_type = "e2-medium"
min_cpu_platform = "Intel Skylake"
min_num_instance = 2
disk_config {
boot_disk_type = "pd-standard"
boot_disk_size_gb = 30
num_local_ssds = 1
}
}
}
num_instances
- (Optional, Computed) Specifies the number of worker nodes to create.
If not specified, GCP will default to a predetermined computed value (currently 2).
There is currently a beta feature which allows you to run a
Single Node Cluster.
In order to take advantage of this you need to set
"dataproc:dataproc.allow.zero.workers" = "true"
in
cluster_config.software_config.properties
machine_type
- (Optional, Computed) The name of a Google Compute Engine machine type
to create for the worker nodes. If not specified, GCP will default to a predetermined
computed value (currently n1-standard-4
).
min_cpu_platform
- (Optional, Computed) The name of a minimum generation of CPU family
for the master. If not specified, GCP will default to a predetermined computed value
for each zone. See the guide
for details about which CPU families are available (and defaulted) for each zone.
disk_config
(Optional) Disk Config
boot_disk_type
- (Optional) The disk type of the primary disk attached to each node.
One of "pd-ssd"
or "pd-standard"
. Defaults to "pd-standard"
.
boot_disk_size_gb
- (Optional, Computed) Size of the primary disk attached to each worker node, specified
in GB. The smallest allowed disk size is 10GB. GCP will default to a predetermined
computed value if not set (currently 500GB). Note: If SSDs are not
attached, it also contains the HDFS data blocks and Hadoop working directories.
num_local_ssds
- (Optional) The amount of local SSD disks that will be
attached to each worker cluster node. Defaults to 0.
image_uri
(Optional) The URI for the image to use for this worker. See the guide
for more information.
min_num_instances
(Optional) The minimum number of primary worker instances to create. If min_num_instances
is set, cluster creation will succeed if the number of primary workers created is at least equal to the min_num_instances
number.
accelerators
(Optional) The Compute Engine accelerator configuration for these instances. Can be specified multiple times.
accelerator_type
- (Required) The short name of the accelerator type to expose to this instance. For example, nvidia-tesla-k80
.
accelerator_count
- (Required) The number of the accelerator cards of this type exposed to this instance. Often restricted to one of 1
, 2
, 4
, or 8
.
The cluster_config.preemptible_worker_config
block supports:
cluster_config {
preemptible_worker_config {
num_instances = 1
disk_config {
boot_disk_type = "pd-standard"
boot_disk_size_gb = 30
num_local_ssds = 1
}
instance_flexibility_policy {
instance_selection_list {
machine_types = ["n2-standard-2","n1-standard-2"]
rank = 1
}
instance_selection_list {
machine_types = ["n2d-standard-2"]
rank = 3
}
}
}
}
Note: Unlike worker_config
, you cannot set the machine_type
value directly. This
will be set for you based on whatever was set for the worker_config.machine_type
value.
num_instances
- (Optional) Specifies the number of preemptible nodes to create.
Defaults to 0.
preemptibility
- (Optional) Specifies the preemptibility of the secondary workers. The default value is PREEMPTIBLE
Accepted values are:
disk_config
(Optional) Disk Config
boot_disk_type
- (Optional) The disk type of the primary disk attached to each preemptible worker node.
One of "pd-ssd"
or "pd-standard"
. Defaults to "pd-standard"
.
boot_disk_size_gb
- (Optional, Computed) Size of the primary disk attached to each preemptible worker node, specified
in GB. The smallest allowed disk size is 10GB. GCP will default to a predetermined
computed value if not set (currently 500GB). Note: If SSDs are not
attached, it also contains the HDFS data blocks and Hadoop working directories.
num_local_ssds
- (Optional) The amount of local SSD disks that will be
attached to each preemptible worker node. Defaults to 0.
instance_flexibility_policy
(Optional) Instance flexibility Policy allowing a mixture of VM shapes and provisioning models.
instance_selection_list
- (Optional) List of instance selection options that the group will use when creating new VMs.
machine_types
- (Optional) Full machine-type names, e.g. "n1-standard-16"
.
rank
- (Optional) Preference of this instance selection. A lower number means higher preference. Dataproc will first try to create a VM based on the machine-type with priority rank and fallback to next rank based on availability. Machine types and instance selections with the same priority have the same preference.
The cluster_config.software_config
block supports:
cluster_config {
# Override or set some custom properties
software_config {
image_version = "2.0.35-debian10"
override_properties = {
"dataproc:dataproc.allow.zero.workers" = "true"
}
}
}
image_version
- (Optional, Computed) The Cloud Dataproc image version to use
for the cluster - this controls the sets of software versions
installed onto the nodes when you create clusters. If not specified, defaults to the
latest version. For a list of valid versions see
Cloud Dataproc versions
override_properties
- (Optional) A list of override and additional properties (key/value pairs)
used to modify various aspects of the common configuration files used when creating
a cluster. For a list of valid properties please see
Cluster properties
optional_components
- (Optional) The set of optional components to activate on the cluster. See Available Optional Components.
The cluster_config.security_config
block supports:
cluster_config {
# Override or set some custom properties
security_config {
kerberos_config {
kms_key_uri = "projects/projectId/locations/locationId/keyRings/keyRingId/cryptoKeys/keyId"
root_principal_password_uri = "bucketId/o/objectId"
}
}
}
kerberos_config
(Required) Kerberos Configuration
cross_realm_trust_admin_server
- (Optional) The admin server (IP or hostname) for the
remote trusted realm in a cross realm trust relationship.
cross_realm_trust_kdc
- (Optional) The KDC (IP or hostname) for the
remote trusted realm in a cross realm trust relationship.
cross_realm_trust_realm
- (Optional) The remote realm the Dataproc on-cluster KDC will
trust, should the user enable cross realm trust.
cross_realm_trust_shared_password_uri
- (Optional) The Cloud Storage URI of a KMS
encrypted file containing the shared password between the on-cluster Kerberos realm
and the remote trusted realm, in a cross realm trust relationship.
enable_kerberos
- (Optional) Flag to indicate whether to Kerberize the cluster.
kdc_db_key_uri
- (Optional) The Cloud Storage URI of a KMS encrypted file containing
the master key of the KDC database.
key_password_uri
- (Optional) The Cloud Storage URI of a KMS encrypted file containing
the password to the user provided key. For the self-signed certificate, this password
is generated by Dataproc.
keystore_uri
- (Optional) The Cloud Storage URI of the keystore file used for SSL encryption.
If not provided, Dataproc will provide a self-signed certificate.
keystore_password_uri
- (Optional) The Cloud Storage URI of a KMS encrypted file containing
the password to the user provided keystore. For the self-signed certificated, the password
is generated by Dataproc.
kms_key_uri
- (Required) The URI of the KMS key used to encrypt various sensitive files.
realm
- (Optional) The name of the on-cluster Kerberos realm. If not specified, the
uppercased domain of hostnames will be the realm.
root_principal_password_uri
- (Required) The Cloud Storage URI of a KMS encrypted file
containing the root principal password.
tgt_lifetime_hours
- (Optional) The lifetime of the ticket granting ticket, in hours.
truststore_password_uri
- (Optional) The Cloud Storage URI of a KMS encrypted file
containing the password to the user provided truststore. For the self-signed
certificate, this password is generated by Dataproc.
truststore_uri
- (Optional) The Cloud Storage URI of the truststore file used for
SSL encryption. If not provided, Dataproc will provide a self-signed certificate.
The cluster_config.autoscaling_config
block supports:
cluster_config {
# Override or set some custom properties
autoscaling_config {
policy_uri = "projects/projectId/locations/region/autoscalingPolicies/policyId"
}
}
policy_uri
- (Required) The autoscaling policy used by the cluster.Only resource names including projectid and location (region) are valid. Examples:
https://www.googleapis.com/compute/v1/projects/[projectId]/locations/[dataproc_region]/autoscalingPolicies/[policy_id]
projects/[projectId]/locations/[dataproc_region]/autoscalingPolicies/[policy_id]
Note that the policy must be in the same project and Cloud Dataproc region.
The initialization_action
block (Optional) can be specified multiple times and supports:
cluster_config {
# You can define multiple initialization_action blocks
initialization_action {
script = "gs://dataproc-initialization-actions/stackdriver/stackdriver.sh"
timeout_sec = 500
}
}
script
- (Required) The script to be executed during initialization of the cluster.
The script must be a GCS file with a gs:// prefix.
timeout_sec
- (Optional, Computed) The maximum duration (in seconds) which script
is
allowed to take to execute its action. GCP will default to a predetermined
computed value if not set (currently 300).
The encryption_config
block supports:
cluster_config {
encryption_config {
kms_key_name = "projects/projectId/locations/region/keyRings/keyRingName/cryptoKeys/keyName"
}
}
kms_key_name
- (Required) The Cloud KMS key name to use for PD disk encryption for
all instances in the cluster.The dataproc_metric_config
block supports:
dataproc_metric_config {
metrics {
metric_source = "HDFS"
metric_overrides = ["yarn:ResourceManager:QueueMetrics:AppsCompleted"]
}
}
metrics
- (Required) Metrics sources to enable.
metric_source
- (Required) A source for the collection of Dataproc OSS metrics (see available OSS metrics).
metric_overrides
- (Optional) One or more [available OSS metrics] (https://cloud.google.com/dataproc/docs/guides/monitoring#available_oss_metrics) to collect for the metric course.
The auxiliary_node_groups
block supports:
auxiliary_node_groups{
node_group {
roles = ["DRIVER"]
node_group_config{
num_instances=2
machine_type="n1-standard-2"
min_cpu_platform = "AMD Rome"
disk_config {
boot_disk_size_gb = 35
boot_disk_type = "pd-standard"
num_local_ssds = 1
}
accelerators {
accelerator_count = 1
accelerator_type = "nvidia-tesla-t4"
}
}
}
}
node_group
- (Required) Node group configuration.
roles
- (Required) Node group roles.
One of "DRIVER"
.
name
- (Optional) The Node group resource name.
node_group_config
- (Optional) The node group instance group configuration.
num_instances
- (Optional, Computed) Specifies the number of master nodes to create.
Please set a number greater than 0. Node Group must have at least 1 instance.
machine_type
- (Optional, Computed) The name of a Google Compute Engine machine type
to create for the node group. If not specified, GCP will default to a predetermined
computed value (currently n1-standard-4
).
min_cpu_platform
- (Optional, Computed) The name of a minimum generation of CPU family
for the node group. If not specified, GCP will default to a predetermined computed value
for each zone. See the guide
for details about which CPU families are available (and defaulted) for each zone.
disk_config
(Optional) Disk Config
boot_disk_type
- (Optional) The disk type of the primary disk attached to each node.
One of "pd-ssd"
or "pd-standard"
. Defaults to "pd-standard"
.
boot_disk_size_gb
- (Optional, Computed) Size of the primary disk attached to each node, specified
in GB. The primary disk contains the boot volume and system libraries, and the
smallest allowed disk size is 10GB. GCP will default to a predetermined
computed value if not set (currently 500GB). Note: If SSDs are not
attached, it also contains the HDFS data blocks and Hadoop working directories.
num_local_ssds
- (Optional) The amount of local SSD disks that will be attached to each master cluster node.
Defaults to 0.
accelerators
(Optional) The Compute Engine accelerator (GPU) configuration for these instances. Can be specified
multiple times.
accelerator_type
- (Required) The short name of the accelerator type to expose to this instance. For example, nvidia-tesla-k80
.
accelerator_count
- (Required) The number of the accelerator cards of this type exposed to this instance. Often restricted to one of 1
, 2
, 4
, or 8
.
The lifecycle_config
block supports:
cluster_config {
lifecycle_config {
idle_delete_ttl = "10m"
auto_delete_time = "2120-01-01T12:00:00.01Z"
}
}
idle_delete_ttl
- (Optional) The duration to keep the cluster alive while idling
(no jobs running). After this TTL, the cluster will be deleted. Valid range: [10m, 14d].
auto_delete_time
- (Optional) The time when cluster will be auto-deleted.
A timestamp in RFC3339 UTC "Zulu" format, accurate to nanoseconds.
Example: "2014-10-02T15:01:23.045123456Z".
The endpoint_config
block (Optional, Computed, Beta) supports:
cluster_config {
endpoint_config {
enable_http_port_access = true
}
}
enable_http_port_access
- (Optional) The flag to enable http access to specific ports
on the cluster from external sources (aka Component Gateway). Defaults to false.The metastore_config
block (Optional, Computed, Beta) supports:
cluster_config {
metastore_config {
dataproc_metastore_service = "projects/projectId/locations/region/services/serviceName"
}
}
dataproc_metastore_service
- (Required) Resource name of an existing Dataproc Metastore service.Only resource names including projectid and location (region) are valid. Examples:
projects/[projectId]/locations/[dataproc_region]/services/[service-name]
In addition to the arguments listed above, the following computed attributes are exported:
cluster_config.0.master_config.0.instance_names
- List of master instance names which
have been assigned to the cluster.
cluster_config.0.worker_config.0.instance_names
- List of worker instance names which have been assigned
to the cluster.
cluster_config.0.preemptible_worker_config.0.instance_names
- List of preemptible instance names which have been assigned
to the cluster.
cluster_config.0.bucket
- The name of the cloud storage bucket ultimately used to house the staging data
for the cluster. If staging_bucket
is specified, it will contain this value, otherwise
it will be the auto generated name.
cluster_config.0.software_config.0.properties
- A list of the properties used to set the daemon config files.
This will include any values supplied by the user via cluster_config.software_config.override_properties
cluster_config.0.lifecycle_config.0.idle_start_time
- Time when the cluster became idle
(most recent job finished) and became eligible for deletion due to idleness.
cluster_config.0.endpoint_config.0.http_ports
- The map of port descriptions to URLs. Will only be populated if
enable_http_port_access
is true.
This resource does not support import.
This resource provides the following Timeouts configuration options: configuration options:
create
- Default is 45 minutes.update
- Default is 45 minutes.delete
- Default is 45 minutes.