EMR runner options

All options from Options available to all runners and Hadoop-related options are available to the emr runner.

Amazon credentials

See Configuring AWS credentials and Configuring SSH credentials for specific instructions about setting these options.

aws_access_key_id (--aws-access-key-id) : string

Default: None

“username” for Amazon web services.

aws_secret_access_key (--aws-secret-access-key) : string

Default: None

your “password” on AWS

ec2_key_pair (--ec2-key-pair) : string

Default: None

name of the SSH key you set up for EMR.

ec2_key_pair_file (--ec2-key-pair-file) : path

Default: None

path to file containing the SSH key for EMR

Job flow creation and configuration

additional_emr_info (--additional-emr-info) : special

Default: None

Special parameters to select additional features, mostly to support beta EMR features. Pass a JSON string on the command line or use data structures in the config file (which is itself basically JSON).

ami_version (--ami-version) : string

Default: 'latest'

EMR AMI version to use. This controls which Hadoop version(s) are available and which version of Python is installed, among other things; see the AWS docs on specifying the AMI version. for details.

aws_availability_zone (--aws-availability-zone) : string

Default: AWS default

Availability zone to run the job in

aws_region (--aws-region) : string

Default: infer from scrach bucket region

region to connect to S3 and EMR on (e.g. us-west-1). If you want to use separate regions for S3 and EMR, set emr_endpoint and s3_endpoint.

emr_endpoint (--emr-endpoint) : string

Default: infer from aws_region

optional host to connect to when communicating with S3 (e.g. us-west-1.elasticmapreduce.amazonaws.com).

hadoop_streaming_jar_on_emr (--hadoop-streaming-jar-on-emr) : string

Default: AWS default

Like hadoop_streaming_jar, except that it points to a path on the EMR instance, rather than to a local file or one on S3. Rarely necessary to set this by hand.

max_hours_idle (--max-hours-idle) : string

Default: None

If we create a persistent job flow, have it automatically terminate itself after it’s been idle this many hours AND we’re within mins_to_end_of_hour of an EC2 billing hour.

New in version 0.4.1.

mins_to_end_of_hour (--mins-to-end-of-hour) : string

Default: 5.0

If max_hours_idle is set, controls how close to the end of an EC2 billing hour the job flow can automatically terminate itself.

New in version 0.4.1.

visible_to_all_users (--visible-to-all-users) : boolean

Default: False

If True, EMR job flows will be visible to all IAM users. If False, the job flow will only be visible to the IAM user that created it.

New in version 0.4.1.

Bootstrapping

These options apply at bootstrap time, before the Hadoop cluster has started. Bootstrap time is a good time to install Debian packages or compile and install another Python binary.

bootstrap (--bootstrap) : string list

Default: []

A list of lines of shell script to run once on each node in your job flow, at bootstrap time.

This option is complex and powerful; the best way to get started is to read the EMR Bootstrapping Cookbook.

Passing expressions like path#name will cause path to be automatically uploaded to the task’s working directory with the filename name, marked as executable, and interpolated into the script by their absolute path on the machine running the script. path may also be a URI, and ~ and environment variables within path will be resolved based on the local environment. name is optional. For details of parsing, see parse_setup_cmd().

Unlike with setup, archives are not supported (unpack them yourself).

Remember to put sudo before commands requiring root privileges!

bootstrap_actions (--bootstrap-actions) : string list

Default: []

A list of raw bootstrap actions (essentially scripts) to run prior to any of the other bootstrap steps. Any arguments should be separated from the command by spaces (we use shlex.split()). If the action is on the local filesystem, we’ll automatically upload it to S3.

This has little advantage over bootstrap; it is included in order to give direct access to the EMR API.

bootstrap_cmds (--bootstrap-cmd) : string list

Default: []

Deprecated since version 0.4.2.

A list of commands to run at bootstrap time. Basically bootstrap without automatic file uploading/interpolation. Can also take commands as lists of arguments.

bootstrap_files (--bootstrap-file) : path list

Default: []

Deprecated since version 0.4.2.

Files to download to the bootstrap working directory before running bootstrap commands. Use the bootstrap option’s file auto-upload/interpolation feature instead.

bootstrap_python_packages (--bootstrap-python-package) : path list

Default: []

Deprecated since version 0.4.2.

Paths of python modules tarballs to install on EMR. Pass pip install path/to/tarballs/*.tar.gz# to bootstrap instead.

bootstrap_scripts (--bootstrap-script) : path list

Default: []

Deprecated since version 0.4.2.

Scripts to upload and then run at bootstrap time. Pass path/to/script# args to bootstrap instead.

Monitoring the job flow

check_emr_status_every (--check-emr-status-every) : string

Default: 30

How often to check on the status of EMR jobs in seconds. If you set this too low, AWS will throttle you.

enable_emr_debugging (--enable-emr-debugging) : boolean

Default: False

store Hadoop logs in SimpleDB

Number and type of instances

ec2_instance_type (--ec2-instance-type) : string

Default: 'm1.small'

What sort of EC2 instance(s) to use on the nodes that actually run tasks (see http://aws.amazon.com/ec2/instance-types/). When you run multiple instances (see num_ec2_instances), the master node is just coordinating the other nodes, so usually the default instance type (m1.small) is fine, and using larger instances is wasteful.

ec2_core_instance_type (--ec2-core-instance-type) : string

Default: 'm1.small'

like ec2_instance_type, but only for the core (also know as “slave”) Hadoop nodes; these nodes run tasks and host HDFS. Usually you just want to use ec2_instance_type.

ec2_core_instance_bid_price (--ec2-core-instance-bid-price) : string

Default: None

When specified and not “0”, this creates the master Hadoop node as a spot instance at this bid price. You usually only want to set bid price for task instances.

ec2_master_instance_type (--ec2-master-instance-type) : string

Default: 'm1.small'

like ec2_instance_type, but only for the master Hadoop node. This node hosts the task tracker and HDFS, and runs tasks if there are no other nodes. Usually you just want to use ec2_instance_type.

ec2_master_instance_bid_price (--ec2-master-instance-bid-price) : string

Default: None

When specified and not “0”, this creates the master Hadoop node as a spot instance at this bid price. You usually only want to set bid price for task instances unless the master instance is your only instance.

ec2_slave_instance_type (--ec2-slave-instance-type) : string

Default: value of ec2_core_instance_type

An alias for ec2_core_instance_type, for consistency with the EMR API.

ec2_task_instance_type (--ec2-task-instance-type) : string

Default: value of ec2_core_instance_type

like ec2_instance_type, but only for the task Hadoop nodes; these nodes run tasks but do not host HDFS. Usually you just want to use ec2_instance_type.

ec2_task_instance_bid_price (--ec2-task-instance-bid-price) : string

Default: None

When specified and not “0”, this creates the master Hadoop node as a spot instance at this bid price. (You usually only want to set bid price for task instances.)

num_ec2_core_instances (--num-ec2-core-instances) : string

Default: 0

Number of core (or “slave”) instances to start up. These run your job and host HDFS. Incompatible with num_ec2_instances. This is in addition to the single master instance.

num_ec2_instances (--num-ec2-instances) : string

Default: 1

Total number of instances to start up; basically the number of core instance you want, plus 1 (there is always one master instance). Incompatible with num_ec2_core_instances and num_ec2_task_instances.

num_ec2_task_instances (--num-ec2-task-instances) : string

Default: 0

Number of task instances to start up. These run your job but do not host HDFS. Incompatible with num_ec2_instances. If you use this, you must set num_ec2_core_instances; EMR does not allow you to run task instances without core instances (because there’s nowhere to host HDFS).

Choosing/creating a job flow to join

emr_job_flow_id (--emr-job-flow-id) : string

Default: automatically create a job flow and use it

The ID of a persistent EMR job flow to run jobs in. It’s fine for other jobs to be using the job flow; we give our job’s steps a unique ID.

emr_job_flow_pool_name (--emr-job-flow-pool-name) : string

Default: 'default'

Specify a pool name to join. Does not imply pool_emr_job_flows.

pool_emr_job_flows (--pool-emr-job-flows) : string

Default: False

Try to run the job on a WAITING pooled job flow with the same bootstrap configuration. Prefer the one with the most compute units. Use S3 to “lock” the job flow and ensure that the job is not scheduled behind another job. If no suitable job flow is WAITING, create a new pooled job flow.

Warning

Do not run this without either setting max_hours_idle or putting mrjob.tools.emr.terminate.idle_job_flows in your crontab; job flows left idle can quickly become expensive!

pool_wait_minutes (--pool-wait-minutes) : string

Default: 0

If pooling is enabled and no job flow is available, retry finding a job flow every 30 seconds until this many minutes have passed, then start a new job flow instead of joining one.

S3 paths and options

s3_endpoint (--s3-endpoint) : string

Default: infer from aws_region

Host to connect to when communicating with S3 (e.g. s3-us-west-1.amazonaws.com).

s3_log_uri (--s3-log-uri) : string

Default: append logs to s3_scratch_uri

Where on S3 to put logs, for example s3://yourbucket/logs/. Logs for your job flow will go into a subdirectory, e.g. s3://yourbucket/logs/j-JOBFLOWID/. in this example s3://yourbucket/logs/j-YOURJOBID/).

s3_scratch_uri (--s3-scratch-uri) : string

Default: tmp/mrjob in the first bucket belonging to you

S3 directory (URI ending in /) to use as scratch space, e.g. s3://yourbucket/tmp/.

s3_sync_wait_time (--s3-sync-wait-time) : string

Default: 5.0

How long to wait for S3 to reach eventual consistency. This is typically less than a second (zero in U.S. West), but the default is 5.0 to be safe.

SSH access and tunneling

ssh_bin (--ssh-bin) : command

Default: 'ssh'

Path to the ssh binary; may include switches (e.g. 'ssh -v' or ['ssh', '-v']). Defaults to ssh

ssh_bind_ports (--ssh-bind-ports) : special

Default: [40001, ..., 40840]

A list of ports that are safe to listen on. The command line syntax looks like 2000[:2001][,2003,2005:2008,etc], where commas separate ranges and colons separate range endpoints.

ssh_tunnel_to_job_tracker (--ssh-tunnel-to-job-tracker) : boolean

Default: False

If True, create an ssh tunnel to the job tracker and listen on a randomly chosen port. This requires you to set ec2_key_pair and ec2_key_pair_file. See Configuring SSH credentials for detailed instructions.

ssh_tunnel_is_open (--ssh-tunnel-is-open) : boolean

Default: False

if True, any host can connect to the job tracker through the SSH tunnel you open. Mostly useful if your browser is running on a different machine from your job runner.