Configuration quick reference

Setting configuration options

You can set an option by:

  • Passing it on the command line with the switch version (like --some-option)

  • Passing it as a keyword argument to the runner constructor, if you are creating the runner programmatically

  • Putting it in one of the included config files under a runner name, like this:

    runners:
        local:
            python_bin: python2.6  # only used in local runner
        emr:
            python_bin: python2.5  # only used in Elastic MapReduce runner
    

    See Config file format and location for information on where to put config files.

Options that can’t be set from mrjob.conf (all runners)

For some options, it doesn’t make sense to be able to set them in the config file. These can only be specified when calling the constructor of MRJobRunner, as command line options, or sometimes by overriding some attribute or method of your MRJob subclass.

Runner kwargs or command line

Config Command line Default Type
conf_paths -c, –conf-path, –no-conf see find_mrjob_conf() path list
no_output –no-output False boolean
output_dir –output-dir (automatic) string
partitioner –partitioner None string

Other options for all runners

These options can be passed to any runner without an error, though some runners may ignore some options. See the text after the table for specifics.

Config Command line Default Type
base_tmp_dir –base-tmp-dir value of tempfile.gettempdir() path
bootstrap –bootstrap [] string list
bootstrap_mrjob –bootstrap-mrjob, –no-bootstrap-mrjob True boolean
cleanup –cleanup 'ALL' string
cleanup_on_failure –cleanup-on-failure 'NONE' string
cmdenv –cmdenv {} environment variable dict
hadoop_extra_args –hadoop-extra-arg [] string list
hadoop_streaming_jar –hadoop-streaming-jar automatic string
interpreter –interpreter value of python_bin ('python') string
jobconf –jobconf {} dict
label –label script’s module name, or no_script string
owner –owner getpass.getuser(), or no_user if that fails string
python_archives –python-archive [] path list
python_bin –python-bin 'python' command
setup –setup [] string list
setup_cmds –setup_cmd [] string list
setup_scripts –setup-script [] path list
sh_bin –sh-bin sh -e (/bin/sh -e on EMR) command
steps_python_bin –steps-python-bin current Python interpreter command
upload_archives –archive [] path list
upload_files –file [] path list

LocalMRJobRunner takes no additional options, but:

  • bootstrap_mrjob is False by default
  • cmdenv uses the local system path separator instead of : all the time (so ; on Windows, no change elsewhere)
  • python_bin defaults to the current Python interpreter

In addition, it ignores hadoop_input_format, hadoop_output_format, hadoop_streaming_jar, and jobconf

InlineMRJobRunner works like LocalMRJobRunner, only it also ignores bootstrap_mrjob, cmdenv, python_bin, setup_cmds, setup_scripts, steps_python_bin, upload_archives, and upload_files.

Additional options for EMRJobRunner

Config Command line Default Type
additional_emr_info –additional-emr-info None special
ami_version –ami-version 'latest' string
aws_access_key_id –aws-access-key-id None string
aws_availability_zone –aws-availability-zone AWS default string
aws_region –aws-region infer from scrach bucket region string
aws_secret_access_key –aws-secret-access-key None string
bootstrap_actions –bootstrap-actions [] string list
bootstrap_cmds –bootstrap-cmd [] string list
bootstrap_files –bootstrap-file [] path list
bootstrap_python_packages –bootstrap-python-package [] path list
bootstrap_scripts –bootstrap-script [] path list
check_emr_status_every –check-emr-status-every 30 string
ec2_core_instance_bid_price –ec2-core-instance-bid-price None string
ec2_core_instance_type –ec2-core-instance-type 'm1.small' string
ec2_instance_type –ec2-instance-type 'm1.small' string
ec2_key_pair –ec2-key-pair None string
ec2_key_pair_file –ec2-key-pair-file None path
ec2_master_instance_bid_price –ec2-master-instance-bid-price None string
ec2_master_instance_type –ec2-master-instance-type 'm1.small' string
ec2_slave_instance_type –ec2-slave-instance-type value of ec2_core_instance_type string
ec2_task_instance_bid_price –ec2-task-instance-bid-price None string
ec2_task_instance_type –ec2-task-instance-type value of ec2_core_instance_type string
emr_endpoint –emr-endpoint infer from aws_region string
emr_job_flow_id –emr-job-flow-id automatically create a job flow and use it string
emr_job_flow_pool_name –emr-job-flow-pool-name 'default' string
enable_emr_debugging –enable-emr-debugging False boolean
hadoop_streaming_jar_on_emr –hadoop-streaming-jar-on-emr AWS default string
hadoop_version –hadoop-version inferred from environment/AWS string
max_hours_idle –max-hours-idle None string
mins_to_end_of_hour –mins-to-end-of-hour 5.0 string
num_ec2_core_instances –num-ec2-core-instances 0 string
num_ec2_instances –num-ec2-instances 1 string
num_ec2_task_instances –num-ec2-task-instances 0 string
pool_emr_job_flows –pool-emr-job-flows False string
pool_wait_minutes –pool-wait-minutes 0 string
s3_endpoint –s3-endpoint infer from aws_region string
s3_log_uri –s3-log-uri append logs to s3_scratch_uri string
s3_scratch_uri –s3-scratch-uri tmp/mrjob in the first bucket belonging to you string
s3_sync_wait_time –s3-sync-wait-time 5.0 string
ssh_bin –ssh-bin 'ssh' command
ssh_bind_ports –ssh-bind-ports [40001, ..., 40840] special
ssh_tunnel_is_open –ssh-tunnel-is-open False boolean
ssh_tunnel_to_job_tracker –ssh-tunnel-to-job-tracker False boolean
visible_to_all_users –visible-to-all-users False boolean