A Dataplex task represents the work that you want Dataplex to do on a schedule. It encapsulates code, parameters, and the schedule.
To get more information about Task, see:
data "google_project" "project" {
}
resource "google_dataplex_lake" "example" {
name = "tf-test-lake%{random_suffix}"
location = "us-central1"
project = "my-project-name"
}
resource "google_dataplex_task" "example" {
task_id = "tf-test-task%{random_suffix}"
location = "us-central1"
lake = google_dataplex_lake.example.name
description = "Test Task Basic"
display_name = "task-basic"
labels = { "count": "3" }
trigger_spec {
type = "RECURRING"
disabled = false
max_retries = 3
start_time = "2023-10-02T15:01:23Z"
schedule = "1 * * * *"
}
execution_spec {
service_account = "${data.google_project.project.number}-compute@developer.gserviceaccount.com"
project = "my-project-name"
max_job_execution_lifetime = "100s"
kms_key = "234jn2kjn42k3n423"
}
spark {
python_script_file = "gs://dataproc-examples/pyspark/hello-world/hello-world.py"
}
project = "my-project-name"
}
# VPC network
resource "google_compute_network" "default" {
name = "tf-test-workstation-cluster%{random_suffix}"
auto_create_subnetworks = true
}
data "google_project" "project" {
}
resource "google_dataplex_lake" "example_spark" {
name = "tf-test-lake%{random_suffix}"
location = "us-central1"
project = "my-project-name"
}
resource "google_dataplex_task" "example_spark" {
task_id = "tf-test-task%{random_suffix}"
location = "us-central1"
lake = google_dataplex_lake.example_spark.name
trigger_spec {
type = "ON_DEMAND"
}
description = "task-spark-terraform"
execution_spec {
service_account = "${data.google_project.project.number}-compute@developer.gserviceaccount.com"
args = {
TASK_ARGS = "--output_location,gs://spark-job/task-result, --output_format, json"
}
}
spark {
infrastructure_spec {
batch {
executors_count = 2
max_executors_count = 100
}
container_image {
image = "test-image"
java_jars = ["test-java-jars.jar"]
python_packages = ["gs://bucket-name/my/path/to/lib.tar.gz"]
properties = { "name": "wrench", "mass": "1.3kg", "count": "3" }
}
vpc_network {
network_tags = ["test-network-tag"]
sub_network = google_compute_network.default.id
}
}
file_uris = ["gs://terrafrom-test/test.csv"]
archive_uris = ["gs://terraform-test/test.csv"]
sql_script = "show databases"
}
project = "my-project-name"
}
# VPC network
resource "google_compute_network" "default" {
name = "tf-test-workstation-cluster%{random_suffix}"
auto_create_subnetworks = true
}
data "google_project" "project" {
}
resource "google_dataplex_lake" "example_notebook" {
name = "tf-test-lake%{random_suffix}"
location = "us-central1"
project = "my-project-name"
}
resource "google_dataplex_task" "example_notebook" {
task_id = "tf-test-task%{random_suffix}"
location = "us-central1"
lake = google_dataplex_lake.example_notebook.name
trigger_spec {
type = "RECURRING"
schedule = "1 * * * *"
}
execution_spec {
service_account = "${data.google_project.project.number}-compute@developer.gserviceaccount.com"
args = {
TASK_ARGS = "--output_location,gs://spark-job-jars-anrajitha/task-result, --output_format, json"
}
}
notebook {
notebook = "gs://terraform-test/test-notebook.ipynb"
infrastructure_spec {
batch {
executors_count = 2
max_executors_count = 100
}
container_image {
image = "test-image"
java_jars = ["test-java-jars.jar"]
python_packages = ["gs://bucket-name/my/path/to/lib.tar.gz"]
properties = { "name": "wrench", "mass": "1.3kg", "count": "3" }
}
vpc_network {
network_tags = ["test-network-tag"]
network = google_compute_network.default.id
}
}
file_uris = ["gs://terraform-test/test.csv"]
archive_uris = ["gs://terraform-test/test.csv"]
}
project = "my-project-name"
}
The following arguments are supported:
trigger_spec
-
(Required)
Configuration for the cluster
Structure is documented below.
execution_spec
-
(Required)
Configuration for the cluster
Structure is documented below.
The trigger_spec
block supports:
type
-
(Required)
Trigger type of the user-specified Task
Possible values are: ON_DEMAND
, RECURRING
.
start_time
-
(Optional)
The first run of the task will be after this time. If not specified, the task will run shortly after being submitted if ON_DEMAND and based on the schedule if RECURRING.
disabled
-
(Optional)
Prevent the task from executing. This does not cancel already running tasks. It is intended to temporarily disable RECURRING tasks.
max_retries
-
(Optional)
Number of retry attempts before aborting. Set to zero to never attempt to retry a failed task.
schedule
-
(Optional)
Cron schedule (https://en.wikipedia.org/wiki/Cron) for running tasks periodically. To explicitly set a timezone to the cron tab, apply a prefix in the cron tab: 'CRON_TZ=${IANA_TIME_ZONE}' or 'TZ=${IANA_TIME_ZONE}'. The ${IANA_TIME_ZONE} may only be a valid string from IANA time zone database. For example, CRON_TZ=America/New_York 1 * * * *, or TZ=America/New_York 1 * * * *. This field is required for RECURRING tasks.
The execution_spec
block supports:
args
-
(Optional)
The arguments to pass to the task. The args can use placeholders of the format ${placeholder} as part of key/value string. These will be interpolated before passing the args to the driver. Currently supported placeholders: - ${taskId} - ${job_time} To pass positional args, set the key as TASK_ARGS. The value should be a comma-separated string of all the positional arguments. To use a delimiter other than comma, refer to https://cloud.google.com/sdk/gcloud/reference/topic/escaping. In case of other keys being present in the args, then TASK_ARGS will be passed as the last argument. An object containing a list of 'key': value pairs. Example: { 'name': 'wrench', 'mass': '1.3kg', 'count': '3' }.
service_account
-
(Required)
Service account to use to execute a task. If not provided, the default Compute service account for the project is used.
project
-
(Optional)
The project in which jobs are run. By default, the project containing the Lake is used. If a project is provided, the ExecutionSpec.service_account must belong to this project.
max_job_execution_lifetime
-
(Optional)
The maximum duration after which the job execution is expired. A duration in seconds with up to nine fractional digits, ending with 's'. Example: '3.5s'.
kms_key
-
(Optional)
The Cloud KMS key to use for encryption, of the form: projects/{project_number}/locations/{locationId}/keyRings/{key-ring-name}/cryptoKeys/{key-name}.
description
-
(Optional)
User-provided description of the task.
display_name
-
(Optional)
User friendly display name.
labels
-
(Optional)
User-defined labels for the task.
Note: This field is non-authoritative, and will only manage the labels present in your configuration.
Please refer to the field effective_labels
for all of the labels present on the resource.
spark
-
(Optional)
A service with manual scaling runs continuously, allowing you to perform complex initialization and rely on the state of its memory over time.
Structure is documented below.
notebook
-
(Optional)
A service with manual scaling runs continuously, allowing you to perform complex initialization and rely on the state of its memory over time.
Structure is documented below.
location
-
(Optional)
The location in which the task will be created in.
lake
-
(Optional)
The lake in which the task will be created in.
task_id
-
(Optional)
The task Id of the task.
project
- (Optional) The ID of the project in which the resource belongs.
If it is not provided, the provider project is used.
file_uris
-
(Optional)
Cloud Storage URIs of files to be placed in the working directory of each executor.
archive_uris
-
(Optional)
Cloud Storage URIs of archives to be extracted into the working directory of each executor. Supported file types: .jar, .tar, .tar.gz, .tgz, and .zip.
infrastructure_spec
-
(Optional)
Infrastructure specification for the execution.
Structure is documented below.
main_jar_file_uri
-
(Optional)
The Cloud Storage URI of the jar file that contains the main class. The execution args are passed in as a sequence of named process arguments (--key=value).
main_class
-
(Optional)
The name of the driver's main class. The jar file that contains the class must be in the default CLASSPATH or specified in jar_file_uris. The execution args are passed in as a sequence of named process arguments (--key=value).
python_script_file
-
(Optional)
The Gcloud Storage URI of the main Python file to use as the driver. Must be a .py file. The execution args are passed in as a sequence of named process arguments (--key=value).
sql_script_file
-
(Optional)
A reference to a query file. This can be the Cloud Storage URI of the query file or it can the path to a SqlScript Content. The execution args are used to declare a set of script variables (set key='value';).
sql_script
-
(Optional)
The query text. The execution args are used to declare a set of script variables (set key='value';).
The infrastructure_spec
block supports:
batch
-
(Optional)
Compute resources needed for a Task when using Dataproc Serverless.
Structure is documented below.
container_image
-
(Optional)
Container Image Runtime Configuration.
Structure is documented below.
vpc_network
-
(Optional)
Vpc network.
Structure is documented below.
executors_count
-
(Optional)
Total number of job executors. Executor Count should be between 2 and 100. [Default=2]
max_executors_count
-
(Optional)
Max configurable executors. If maxExecutorsCount > executorsCount, then auto-scaling is enabled. Max Executor Count should be between 2 and 1000. [Default=1000]
The container_image
block supports:
image
-
(Optional)
Container image to use.
java_jars
-
(Optional)
A list of Java JARS to add to the classpath. Valid input includes Cloud Storage URIs to Jar binaries. For example, gs://bucket-name/my/path/to/file.jar
python_packages
-
(Optional)
A list of python packages to be installed. Valid formats include Cloud Storage URI to a PIP installable library. For example, gs://bucket-name/my/path/to/lib.tar.gz
properties
-
(Optional)
Override to common configuration of open source components installed on the Dataproc cluster. The properties to set on daemon config files. Property keys are specified in prefix:property format, for example core:hadoop.tmp.dir. For more information, see Cluster properties.
The vpc_network
block supports:
network_tags
-
(Optional)
List of network tags to apply to the job.
network
-
(Optional)
The Cloud VPC network in which the job is run. By default, the Cloud VPC network named Default within the project is used.
sub_network
-
(Optional)
The Cloud VPC sub-network in which the job is run.
notebook
-
(Required)
Path to input notebook. This can be the Cloud Storage URI of the notebook file or the path to a Notebook Content. The execution args are accessible as environment variables (TASK_key=value).
infrastructure_spec
-
(Optional)
Infrastructure specification for the execution.
Structure is documented below.
file_uris
-
(Optional)
Cloud Storage URIs of files to be placed in the working directory of each executor.
archive_uris
-
(Optional)
Cloud Storage URIs of archives to be extracted into the working directory of each executor. Supported file types: .jar, .tar, .tar.gz, .tgz, and .zip.
The infrastructure_spec
block supports:
batch
-
(Optional)
Compute resources needed for a Task when using Dataproc Serverless.
Structure is documented below.
container_image
-
(Optional)
Container Image Runtime Configuration.
Structure is documented below.
vpc_network
-
(Optional)
Vpc network.
Structure is documented below.
executors_count
-
(Optional)
Total number of job executors. Executor Count should be between 2 and 100. [Default=2]
max_executors_count
-
(Optional)
Max configurable executors. If maxExecutorsCount > executorsCount, then auto-scaling is enabled. Max Executor Count should be between 2 and 1000. [Default=1000]
The container_image
block supports:
image
-
(Optional)
Container image to use.
java_jars
-
(Optional)
A list of Java JARS to add to the classpath. Valid input includes Cloud Storage URIs to Jar binaries. For example, gs://bucket-name/my/path/to/file.jar
python_packages
-
(Optional)
A list of python packages to be installed. Valid formats include Cloud Storage URI to a PIP installable library. For example, gs://bucket-name/my/path/to/lib.tar.gz
properties
-
(Optional)
Override to common configuration of open source components installed on the Dataproc cluster. The properties to set on daemon config files. Property keys are specified in prefix:property format, for example core:hadoop.tmp.dir. For more information, see Cluster properties.
The vpc_network
block supports:
network_tags
-
(Optional)
List of network tags to apply to the job.
network
-
(Optional)
The Cloud VPC network in which the job is run. By default, the Cloud VPC network named Default within the project is used.
sub_network
-
(Optional)
The Cloud VPC sub-network in which the job is run.
In addition to the arguments listed above, the following computed attributes are exported:
id
- an identifier for the resource with format projects/{{project}}/locations/{{location}}/lakes/{{lake}}/tasks/{{task_id}}
name
-
The relative resource name of the task, of the form: projects/{project_number}/locations/{locationId}/lakes/{lakeId}/ tasks/{name}.
uid
-
System generated globally unique ID for the task. This ID will be different if the task is deleted and re-created with the same name.
create_time
-
The time when the task was created.
update_time
-
The time when the task was last updated.
state
-
Current state of the task.
execution_status
-
Configuration for the cluster
Structure is documented below.
terraform_labels
-
The combination of labels configured directly on the resource
and default labels configured on the provider.
effective_labels
-
All of labels (key/value pairs) present on the resource in GCP, including the labels configured through Terraform, other clients and services.
The execution_status
block contains:
update_time
-
(Output)
Last update time of the status.
latest_job
-
(Output)
latest job execution.
Structure is documented below.
The latest_job
block contains:
name
-
(Output)
The relative resource name of the job, of the form: projects/{project_number}/locations/{locationId}/lakes/{lakeId}/tasks/{taskId}/jobs/{jobId}.
uid
-
(Output)
System generated globally unique ID for the job.
start_time
-
(Output)
The time when the job was started.
end_time
-
(Output)
The time when the job ended.
state
-
(Output)
Execution state for the job.
retry_count
-
(Output)
The number of times the job has been retried (excluding the initial attempt).
service
-
(Output)
The underlying service running a job.
service_job
-
(Output)
The full resource name for the job run under a particular service.
message
-
(Output)
Additional information about the current state.
This resource provides the following Timeouts configuration options:
create
- Default is 5 minutes.update
- Default is 5 minutes.delete
- Default is 5 minutes.Task can be imported using any of these accepted formats:
projects/{{project}}/locations/{{location}}/lakes/{{lake}}/tasks/{{task_id}}
{{project}}/{{location}}/{{lake}}/{{task_id}}
{{location}}/{{lake}}/{{task_id}}
In Terraform v1.5.0 and later, use an import
block to import Task using one of the formats above. For example:
import {
id = "projects/{{project}}/locations/{{location}}/lakes/{{lake}}/tasks/{{task_id}}"
to = google_dataplex_task.default
}
When using the terraform import
command, Task can be imported using one of the formats above. For example:
$ terraform import google_dataplex_task.default projects/{{project}}/locations/{{location}}/lakes/{{lake}}/tasks/{{task_id}}
$ terraform import google_dataplex_task.default {{project}}/{{location}}/{{lake}}/{{task_id}}
$ terraform import google_dataplex_task.default {{location}}/{{lake}}/{{task_id}}
This resource supports User Project Overrides.