Runs an MRJob on Amazon Elastic MapReduce. Invoked when you run your job with -r emr.
EMRJobRunner runs your job in an EMR job flow, which is basically a temporary Hadoop cluster. Normally, it creates a job flow just for your job; it’s also possible to run your job in a specific job flow by setting emr_job_flow_id or to automatically choose a waiting job flow, creating one if none exists, by setting pool_emr_job_flows.
Input, support, and jar files can be either local or on S3; use s3://... URLs to refer to files on S3.
This class has some useful utilities for talking directly to S3 and EMR, so you may find it useful to instantiate it without a script:
from mrjob.emr import EMRJobRunner
emr_conn = EMRJobRunner().make_emr_conn()
job_flows = emr_conn.describe_jobflows()
...
Create a connection to EMR.
Returns: | a boto.emr.connection.EmrConnection, wrapped in a mrjob.retry.RetryWrapper |
---|
Iteratively call EmrConnection.describe_job_flows() until we really get all the available job flow information. Currently, 2 months of data is available through the EMR API.
This is a way of getting around the limits of the API, both on number of job flows returned, and how far back in time we can go.
Parameters: |
---|
Convert a boto Key object into an s3:// URI
Filesystem for Amazon S3 URIs. Typically you will get one of these via EMRJobRunner().fs, composed with SSHFilesystem and LocalFilesystem.
Create a connection to S3.
Returns: | a boto.s3.connection.S3Connection, wrapped in a mrjob.retry.RetryWrapper |
---|
Get the boto Key object matching the given S3 uri, or return None if that key doesn’t exist.
uri is an S3 URI: s3://foo/bar
You may optionally pass in an existing s3 connection through s3_conn.
Get a stream of boto Key objects for each key inside the given dir on S3.
uri is an S3 URI: s3://foo/bar
You may optionally pass in an existing S3 connection through s3_conn
Deprecated since version 0.4.0.
Background: EMR used to fake directories on S3 by creating special *_$folder$ keys in S3. That is no longer true, so this method is deprecated.
For example if your job outputs s3://walrus/tmp/output/part-00000, EMR will also create these keys:
If you want to grant another Amazon user access to your files so they can use them in S3, you must grant read access on the actual keys, plus any *_$folder$ keys that “contain” your keys; otherwise EMR will error out with a permissions error.
This gets all the *_$folder$ keys associated with the given URI, as boto Key objects.
This does not support globbing.
You may optionally pass in an existing S3 connection through s3_conn.
Create the given S3 key, and return the corresponding boto Key object.
uri is an S3 URI: s3://foo/bar
You may optionally pass in an existing S3 connection through s3_conn.