All options from Options available to all runners and Hadoop-related options are available to the emr runner.
See Configuring AWS credentials and Configuring SSH credentials for specific instructions about setting these options.
Default: None
“username” for Amazon web services.
Default: None
your “password” on AWS
Default: None
name of the SSH key you set up for EMR.
Default: None
path to file containing the SSH key for EMR
Default: None
Special parameters to select additional features, mostly to support beta EMR features. Pass a JSON string on the command line or use data structures in the config file (which is itself basically JSON).
Default: 'latest'
EMR AMI version to use. This controls which Hadoop version(s) are available and which version of Python is installed, among other things; see the AWS docs on specifying the AMI version. for details.
Default: AWS default
Availability zone to run the job in
Default: infer from scrach bucket region
region to connect to S3 and EMR on (e.g. us-west-1). If you want to use separate regions for S3 and EMR, set emr_endpoint and s3_endpoint.
Default: infer from aws_region
optional host to connect to when communicating with S3 (e.g. us-west-1.elasticmapreduce.amazonaws.com).
Default: AWS default
Like hadoop_streaming_jar, except that it points to a path on the EMR instance, rather than to a local file or one on S3. Rarely necessary to set this by hand.
Default: None
If we create a persistent job flow, have it automatically terminate itself after it’s been idle this many hours AND we’re within mins_to_end_of_hour of an EC2 billing hour.
New in version 0.4.1.
Default: 5.0
If max_hours_idle is set, controls how close to the end of an EC2 billing hour the job flow can automatically terminate itself.
New in version 0.4.1.
Default: False
If True, EMR job flows will be visible to all IAM users. If False, the job flow will only be visible to the IAM user that created it.
New in version 0.4.1.
These options apply at bootstrap time, before the Hadoop cluster has started. Bootstrap time is a good time to install Debian packages or compile and install another Python binary.
Default: []
A list of lines of shell script to run once on each node in your job flow, at bootstrap time.
This option is complex and powerful; the best way to get started is to read the EMR Bootstrapping Cookbook.
Passing expressions like path#name will cause path to be automatically uploaded to the task’s working directory with the filename name, marked as executable, and interpolated into the script by their absolute path on the machine running the script. path may also be a URI, and ~ and environment variables within path will be resolved based on the local environment. name is optional. For details of parsing, see parse_setup_cmd().
Unlike with setup, archives are not supported (unpack them yourself).
Remember to put sudo before commands requiring root privileges!
Default: []
A list of raw bootstrap actions (essentially scripts) to run prior to any of the other bootstrap steps. Any arguments should be separated from the command by spaces (we use shlex.split()). If the action is on the local filesystem, we’ll automatically upload it to S3.
This has little advantage over bootstrap; it is included in order to give direct access to the EMR API.
Default: []
Deprecated since version 0.4.2.
A list of commands to run at bootstrap time. Basically bootstrap without automatic file uploading/interpolation. Can also take commands as lists of arguments.
Default: []
Deprecated since version 0.4.2.
Files to download to the bootstrap working directory before running bootstrap commands. Use the bootstrap option’s file auto-upload/interpolation feature instead.
Default: 30
How often to check on the status of EMR jobs in seconds. If you set this too low, AWS will throttle you.
Default: False
store Hadoop logs in SimpleDB
Default: 'm1.small'
What sort of EC2 instance(s) to use on the nodes that actually run tasks (see http://aws.amazon.com/ec2/instance-types/). When you run multiple instances (see num_ec2_instances), the master node is just coordinating the other nodes, so usually the default instance type (m1.small) is fine, and using larger instances is wasteful.
Default: 'm1.small'
like ec2_instance_type, but only for the core (also know as “slave”) Hadoop nodes; these nodes run tasks and host HDFS. Usually you just want to use ec2_instance_type.
Default: None
When specified and not “0”, this creates the master Hadoop node as a spot instance at this bid price. You usually only want to set bid price for task instances.
Default: 'm1.small'
like ec2_instance_type, but only for the master Hadoop node. This node hosts the task tracker and HDFS, and runs tasks if there are no other nodes. Usually you just want to use ec2_instance_type.
Default: None
When specified and not “0”, this creates the master Hadoop node as a spot instance at this bid price. You usually only want to set bid price for task instances unless the master instance is your only instance.
Default: value of ec2_core_instance_type
An alias for ec2_core_instance_type, for consistency with the EMR API.
Default: value of ec2_core_instance_type
like ec2_instance_type, but only for the task Hadoop nodes; these nodes run tasks but do not host HDFS. Usually you just want to use ec2_instance_type.
Default: None
When specified and not “0”, this creates the master Hadoop node as a spot instance at this bid price. (You usually only want to set bid price for task instances.)
Default: 0
Number of core (or “slave”) instances to start up. These run your job and host HDFS. Incompatible with num_ec2_instances. This is in addition to the single master instance.
Default: 1
Total number of instances to start up; basically the number of core instance you want, plus 1 (there is always one master instance). Incompatible with num_ec2_core_instances and num_ec2_task_instances.
Default: 0
Number of task instances to start up. These run your job but do not host HDFS. Incompatible with num_ec2_instances. If you use this, you must set num_ec2_core_instances; EMR does not allow you to run task instances without core instances (because there’s nowhere to host HDFS).
Default: automatically create a job flow and use it
The ID of a persistent EMR job flow to run jobs in. It’s fine for other jobs to be using the job flow; we give our job’s steps a unique ID.
Default: 'default'
Specify a pool name to join. Does not imply pool_emr_job_flows.
Default: False
Try to run the job on a WAITING pooled job flow with the same bootstrap configuration. Prefer the one with the most compute units. Use S3 to “lock” the job flow and ensure that the job is not scheduled behind another job. If no suitable job flow is WAITING, create a new pooled job flow.
Warning
Do not run this without either setting max_hours_idle or putting mrjob.tools.emr.terminate.idle_job_flows in your crontab; job flows left idle can quickly become expensive!
Default: 0
If pooling is enabled and no job flow is available, retry finding a job flow every 30 seconds until this many minutes have passed, then start a new job flow instead of joining one.
Default: infer from aws_region
Host to connect to when communicating with S3 (e.g. s3-us-west-1.amazonaws.com).
Default: append logs to s3_scratch_uri
Where on S3 to put logs, for example s3://yourbucket/logs/. Logs for your job flow will go into a subdirectory, e.g. s3://yourbucket/logs/j-JOBFLOWID/. in this example s3://yourbucket/logs/j-YOURJOBID/).
Default: tmp/mrjob in the first bucket belonging to you
S3 directory (URI ending in /) to use as scratch space, e.g. s3://yourbucket/tmp/.
Default: 5.0
How long to wait for S3 to reach eventual consistency. This is typically less than a second (zero in U.S. West), but the default is 5.0 to be safe.
Default: 'ssh'
Path to the ssh binary; may include switches (e.g. 'ssh -v' or ['ssh', '-v']). Defaults to ssh
Default: [40001, ..., 40840]
A list of ports that are safe to listen on. The command line syntax looks like 2000[:2001][,2003,2005:2008,etc], where commas separate ranges and colons separate range endpoints.
Default: False
If True, create an ssh tunnel to the job tracker and listen on a randomly chosen port. This requires you to set ec2_key_pair and ec2_key_pair_file. See Configuring SSH credentials for detailed instructions.
Default: False
if True, any host can connect to the job tracker through the SSH tunnel you open. Mostly useful if your browser is running on a different machine from your job runner.