airflow.contrib.hooks.gcp_dataproc_hook

This module contains a Google Cloud Dataproc hook.

Module Contents

airflow.contrib.hooks.gcp_dataproc_hook.UUID_LENGTH = 9[source]
class airflow.contrib.hooks.gcp_dataproc_hook.DataprocJobStatus[source]

Helper class with Dataproc jobs statuses.

ERROR = ERROR[source]
CANCELLED = CANCALLED[source]
DONE = DONE[source]
class airflow.contrib.hooks.gcp_dataproc_hook._DataProcJob(dataproc_api:Any, project_id:str, job:Dict, region:str='global', job_error_states:Iterable[str]=None, num_retries:int=None)[source]

Bases: airflow.utils.log.logging_mixin.LoggingMixin

wait_for_done(self)[source]

Awaits the Dataproc job to complete.

Returns

True if job was done

Return type

bool

raise_error(self, message=None)[source]

Raises error when Dataproc job resulted in error.

Parameters

message – Custom message for the error.

Raises

Exception

get(self)[source]

Returns Dataproc job.

class airflow.contrib.hooks.gcp_dataproc_hook._DataProcJobBuilder(project_id:str, task_id:str, cluster_name:str, job_type:str, properties:Dict[str, str])[source]
add_labels(self, labels)[source]

Set labels for Dataproc job.

Parameters

labels (dict) – Labels for the job query.

add_variables(self, variables:List[str])[source]

Set variables for Dataproc job.

Parameters

variables (List[str]) – Variables for the job query.

add_args(self, args:List[str])[source]

Set args for Dataproc job.

Parameters

args (List[str]) – Args for the job query.

add_query(self, query:List[str])[source]

Set query uris for Dataproc job.

Parameters

query (List[str]) – URIs for the job queries.

add_query_uri(self, query_uri:str)[source]

Set query uri for Dataproc job.

Parameters

query_uri (str) – URI for the job query.

add_jar_file_uris(self, jars:List[str])[source]

Set jars uris for Dataproc job.

Parameters

jars (List[str]) – List of jars URIs

add_archive_uris(self, archives:List[str])[source]

Set archives uris for Dataproc job.

Parameters

archives (List[str]) – List of archives URIs

add_file_uris(self, files:List[str])[source]

Set file uris for Dataproc job.

Parameters

files (List[str]) – List of files URIs

add_python_file_uris(self, pyfiles:List[str])[source]

Set python file uris for Dataproc job.

Parameters

pyfiles (List[str]) – List of python files URIs

set_main(self, main_jar:Optional[str], main_class:Optional[str])[source]

Set Dataproc main class.

Parameters
  • main_jar (str) – URI for the main file.

  • main_class (str) – Name of the main class.

Raises

Exception

set_python_main(self, main:str)[source]

Set Dataproc main python file uri.

Parameters

main (str) – URI for the python main file.

set_job_name(self, name:str)[source]

Set Dataproc job name.

Parameters

name (str) – Job name.

build(self)[source]

Returns Dataproc job.

Returns

Dataproc job

Return type

dict

class airflow.contrib.hooks.gcp_dataproc_hook._DataProcOperation(dataproc_api:Any, operation:Dict, num_retries:int)[source]

Bases: airflow.utils.log.logging_mixin.LoggingMixin

Continuously polls Dataproc Operation until it completes.

wait_for_done(self)[source]

Awaits Dataproc operation to complete.

Returns

True if operation was done.

Return type

bool

get(self)[source]

Returns Dataproc operation.

Returns

Dataproc operation

_check_done(self)[source]
_raise_error(self)[source]
class airflow.contrib.hooks.gcp_dataproc_hook.DataProcHook(gcp_conn_id:str='google_cloud_default', delegate_to:str=None, api_version:str='v1beta2')[source]

Bases: airflow.contrib.hooks.gcp_api_base_hook.GoogleCloudBaseHook

Hook for Google Cloud Dataproc APIs.

All the methods in the hook where project_id is used must be called with keyword arguments rather than positional.

Parameters
  • gcp_conn_id (str) – The connection ID to use when fetching connection info.

  • delegate_to (str) – The account to impersonate, if any. For this to work, the service account making the request must have domain-wide delegation enabled.

  • api_version (str) – Version of Google Cloud API

get_conn(self)[source]

Returns a Google Cloud Dataproc service object.

get_cluster(self, project_id:str, region:str, cluster_name:str)[source]

Returns Google Cloud Dataproc cluster.

Parameters
  • project_id (str) – The id of Google Cloud Dataproc project.

  • region (str) – The region of Google Dataproc cluster.

  • cluster_name (str) – The name of the Dataproc cluster.

Returns

Dataproc cluster

Return type

dict

submit(self, project_id:str, job:Dict, region:str='global', job_error_states:Iterable[str]=None)[source]

Submits Google Cloud Dataproc job.

Parameters
  • project_id (str) – The id of Google Cloud Dataproc project.

  • job (dict) – The job to be submitted

  • region (str) – The region of Google Dataproc cluster.

  • job_error_states (List[str]) – Job states that should be considered error states.

Raises

Excepion

create_job_template(self, task_id:str, cluster_name:str, job_type:str, properties:Dict[str, str])[source]

Creates Google Cloud Dataproc job template.

Parameters
  • task_id (str) – id of the task

  • cluster_name (str) – Dataproc cluster name.

  • job_type (str) – Type of Dataproc job.

  • properties (dict) – Additional properties of the job.

Returns

Dataproc Job

wait(self, operation:Dict)[source]

Awaits for Google Cloud Dataproc Operation to complete.

cancel(self, project_id:str, job_id:str, region:str='global')[source]

Cancel a Google Cloud DataProc job.

Parameters
  • project_id (str) – Name of the project the job belongs to

  • job_id (int) – Identifier of the job to cancel

  • region (str) – Region used for the job

Returns

A Job json dictionary representing the canceled job