Since mrjob is geared toward Hadoop, there are a few Hadoop-specific options. However, due to the difference between the different runners, the Hadoop platform, and Elastic MapReduce, they are not all available for all runners.
These options are both used by Hadoop and simulated by the local runner to some degree.
Default: inferred from environment/AWS
Set the version of Hadoop to use on EMR or simulate in the local runner. If using EMR, consider setting ami_version instead; only AMI version 1.0 supports multiple versions of Hadoop anyway. If ami_version is not set, we’ll default to Hadoop 0.20 for backwards compatibility with mrjob v0.3.0.
Default: {}
-jobconf args to pass to hadoop streaming. This should be a map from property name to value. Equivalent to passing ['-jobconf', 'KEY1=VALUE1', '-jobconf', 'KEY2=VALUE2', ...] to hadoop_extra_args.
Default: []
Extra arguments to pass to hadoop streaming. This option is called extra_args when passed as a keyword argument to MRJobRunner.
Default: automatic
Path to a custom hadoop streaming jar. This is optional for the hadoop runner, which will search for it in HADOOP_HOME. The emr runner can take a path either local to your machine or on S3.
Default: script’s module name, or no_script
Description of this job to use as the part of its name.
Default: getpass.getuser(), or no_user if that fails
Who is running this job. Used solely to set the job name.
Default: None
Optional name of a Hadoop partitoner class, e.g. 'org.apache.hadoop.mapred.lib.HashPartitioner'. Hadoop Streaming will use this to determine how mapper output should be sorted and distributed to reducers. You can also set this option on your job class with the PARTITIONER attribute or the partitioner() method.
Default: True
Option to skip the input path check. With --no-check-input-paths, input paths to the runner will be passed straight through, without checking if they exist.
New in version 0.4.1.
Default: hadoop_home plus bin/hadoop
Name/path of your hadoop program (may include arguments).
Default: HADOOP_HOME
Alternative to setting the HADOOP_HOME environment variable.
Default: tmp/
Scratch space on HDFS. This path does not need to be fully qualified with hdfs:// URIs because it’s understood that it has to be on HDFS.
Join the mailing list by visiting the
Google group page
or sending an email to
mrjob+subscribe