mrjob v0.4.2 documentation

  • ← Options available to all runners
  • Configuration quick reference →
  • Home
  • Guides

Hadoop-related options¶

Since mrjob is geared toward Hadoop, there are a few Hadoop-specific options. However, due to the difference between the different runners, the Hadoop platform, and Elastic MapReduce, they are not all available for all runners.

Options available to local, hadoop, and emr runners¶

These options are both used by Hadoop and simulated by the local runner to some degree.

hadoop_version (--hadoop-version) : string

Default: inferred from environment/AWS

Set the version of Hadoop to use on EMR or simulate in the local runner. If using EMR, consider setting ami_version instead; only AMI version 1.0 supports multiple versions of Hadoop anyway. If ami_version is not set, we’ll default to Hadoop 0.20 for backwards compatibility with mrjob v0.3.0.

jobconf (--jobconf) : dict

Default: {}

-jobconf args to pass to hadoop streaming. This should be a map from property name to value. Equivalent to passing ['-jobconf', 'KEY1=VALUE1', '-jobconf', 'KEY2=VALUE2', ...] to hadoop_extra_args.

Options available to hadoop and emr runners¶

hadoop_extra_args (--hadoop-extra-arg) : string list

Default: []

Extra arguments to pass to hadoop streaming. This option is called extra_args when passed as a keyword argument to MRJobRunner.

hadoop_streaming_jar (--hadoop-streaming-jar) : string

Default: automatic

Path to a custom hadoop streaming jar. This is optional for the hadoop runner, which will search for it in HADOOP_HOME. The emr runner can take a path either local to your machine or on S3.

label (--label) : string

Default: script’s module name, or no_script

Description of this job to use as the part of its name.

owner (--owner) : string

Default: getpass.getuser(), or no_user if that fails

Who is running this job. Used solely to set the job name.

partitioner (--partitioner) : string

Default: None

Optional name of a Hadoop partitoner class, e.g. 'org.apache.hadoop.mapred.lib.HashPartitioner'. Hadoop Streaming will use this to determine how mapper output should be sorted and distributed to reducers. You can also set this option on your job class with the PARTITIONER attribute or the partitioner() method.

Options available to hadoop runner only¶

check_input_paths (--check-input-paths, --no-check-input-paths) : boolean

Default: True

Option to skip the input path check. With --no-check-input-paths, input paths to the runner will be passed straight through, without checking if they exist.

New in version 0.4.1.

hadoop_bin (--hadoop-bin) : command

Default: hadoop_home plus bin/hadoop

Name/path of your hadoop program (may include arguments).

hadoop_home (--hadoop-home) : path

Default: HADOOP_HOME

Alternative to setting the HADOOP_HOME environment variable.

hdfs_scratch_dir (--hdfs-scratch-dir) : path

Default: tmp/

Scratch space on HDFS. This path does not need to be fully qualified with hdfs:// URIs because it’s understood that it has to be on HDFS.

Table Of Contents

  • Hadoop-related options
    • Options available to local, hadoop, and emr runners
    • Options available to hadoop and emr runners
    • Options available to hadoop runner only

Need help?

Join the mailing list by visiting the Google group page or sending an email to mrjob+subscribe@googlegroups.com.

  • ← Options available to all runners
  • Configuration quick reference →
  • Home
  • Guides
© 2009-2013 Yelp and Contributors. Created using Sphinx 1.2b1 with the better theme.