mrjob.emr - run on EMR

Job Runner

class mrjob.emr.EMRJobRunner(**kwargs)

Runs an MRJob on Amazon Elastic MapReduce. Invoked when you run your job with -r emr.

EMRJobRunner runs your job in an EMR job flow, which is basically a temporary Hadoop cluster. Normally, it creates a job flow just for your job; it’s also possible to run your job in a specific job flow by setting emr_job_flow_id or to automatically choose a waiting job flow, creating one if none exists, by setting pool_emr_job_flows.

Input, support, and jar files can be either local or on S3; use s3://... URLs to refer to files on S3.

This class has some useful utilities for talking directly to S3 and EMR, so you may find it useful to instantiate it without a script:

from mrjob.emr import EMRJobRunner

emr_conn = EMRJobRunner().make_emr_conn()
job_flows = emr_conn.describe_jobflows()
...

EMR Utilities

EMRJobRunner.make_emr_conn()

Create a connection to EMR.

Returns:a boto.emr.connection.EmrConnection, wrapped in a mrjob.retry.RetryWrapper
mrjob.emr.describe_all_job_flows(emr_conn, states=None, jobflow_ids=None, created_after=None, created_before=None)

Iteratively call EmrConnection.describe_job_flows() until we really get all the available job flow information. Currently, 2 months of data is available through the EMR API.

This is a way of getting around the limits of the API, both on number of job flows returned, and how far back in time we can go.

Parameters:
  • states (list) – A list of strings with job flow states wanted
  • jobflow_ids (list) – A list of job flow IDs
  • created_after (datetime) – Bound on job flow creation time
  • created_before (datetime) – Bound on job flow creation time

S3 Utilities

mrjob.fs.s3.s3_key_to_uri(s3_key)

Convert a boto Key object into an s3:// URI

class mrjob.emr.S3Filesystem(aws_access_key_id, aws_secret_access_key, s3_endpoint)

Filesystem for Amazon S3 URIs. Typically you will get one of these via EMRJobRunner().fs, composed with SSHFilesystem and LocalFilesystem.

S3Filesystem.make_s3_conn()

Create a connection to S3.

Returns:a boto.s3.connection.S3Connection, wrapped in a mrjob.retry.RetryWrapper
S3Filesystem.get_s3_key(uri, s3_conn=None)

Get the boto Key object matching the given S3 uri, or return None if that key doesn’t exist.

uri is an S3 URI: s3://foo/bar

You may optionally pass in an existing s3 connection through s3_conn.

S3Filesystem.get_s3_keys(uri, s3_conn=None)

Get a stream of boto Key objects for each key inside the given dir on S3.

uri is an S3 URI: s3://foo/bar

You may optionally pass in an existing S3 connection through s3_conn

S3Filesystem.get_s3_folder_keys(uri, s3_conn=None)

Deprecated since version 0.4.0.

Background: EMR used to fake directories on S3 by creating special *_$folder$ keys in S3. That is no longer true, so this method is deprecated.

For example if your job outputs s3://walrus/tmp/output/part-00000, EMR will also create these keys:

  • s3://walrus/tmp_$folder$
  • s3://walrus/tmp/output_$folder$

If you want to grant another Amazon user access to your files so they can use them in S3, you must grant read access on the actual keys, plus any *_$folder$ keys that “contain” your keys; otherwise EMR will error out with a permissions error.

This gets all the *_$folder$ keys associated with the given URI, as boto Key objects.

This does not support globbing.

You may optionally pass in an existing S3 connection through s3_conn.

S3Filesystem.make_s3_key(uri, s3_conn=None)

Create the given S3 key, and return the corresponding boto Key object.

uri is an S3 URI: s3://foo/bar

You may optionally pass in an existing S3 connection through s3_conn.

Table Of Contents

Need help?

Join the mailing list by visiting the Google group page or sending an email to mrjob+subscribe@googlegroups.com.