We look for mrjob.conf in these locations:
You can specify one or more configuration files with the --conf-path flag. See Options available to all runners for more information.
The point of mrjob.conf is to let you set up things you want every job to have access to so that you don’t have to think about it. For example:
mrjob.conf is just a YAML- or JSON-encoded dictionary containing default values to pass in to the constructors of the various runner classes. Here’s a minimal mrjob.conf:
runners:
emr:
cmdenv:
TZ: America/Los_Angeles
Now whenever you run mr_your_script.py -r emr, EMRJobRunner will automatically set TZ to America/Los_Angeles in your job’s environment when it runs on EMR.
If you don’t have the yaml module installed, you can use JSON in your mrjob.conf instead (JSON is a subset of YAML, so it’ll still work once you install yaml). Here’s how you’d render the above example in JSON:
{
"runners": {
"emr": {
"cmdenv": {
"TZ": "America/Los_Angeles"
}
}
}
}
Options specified on the command-line take precedence over mrjob.conf. Usually this means simply overriding the option in mrjob.conf. However, we know that cmdenv contains environment variables, so we do the right thing. For example, if your mrjob.conf contained:
runners:
emr:
cmdenv:
PATH: /usr/local/bin
TZ: America/Los_Angeles
and you ran your job as:
mr_your_script.py -r emr --cmdenv TZ=Europe/Paris --cmdenv PATH=/usr/sbin
We’d automatically handle the PATH variables and your job’s environment would be:
{'TZ': 'Europe/Paris', 'PATH': '/usr/sbin:/usr/local/bin'}
What’s going on here is that cmdenv is associated with combine_envs(). Each option is associated with an appropriate combiner function that that combines options in an appropriate way.
Combiner functions can also do useful things like expanding environment variables and globs in paths. For example, you could set:
runners:
local:
upload_files: &upload_files
- $DATA_DIR/*.db
hadoop:
upload_files: *upload_files
emr:
upload_files: *upload_files
and every time you ran a job, every job in your .db file in $DATA_DIR would automatically be loaded into your job’s current working directory.
Also, if you specified additional files to upload with --file, those files would be uploaded in addition to the .db files, rather than instead of them.
See Configuration quick reference for the entire dizzying array of configurable options.
The same option may be specified multiple times and be one of several data types. For example, the AWS region may be specified in mrjob.conf, in the arguments to EMRJobRunner, and on the command line. These are the rules used to determine what value to use at runtime.
Values specified “later” refer to an option being specified at a higher priority. For example, a value in mrjob.conf is specified “earlier” than a value passed on the command line.
When there are multiple values, they are “combined with” a combiner function. The combiner function for each data type is listed in its description.
When these are specified more than once, the last non-None value is used.
The values of these options are specified as lists. When specified more than once, the lists are concatenated together.
The values of these options are specified as dictionaries. When specified more than once, each has custom behavior described below.
Values specified later override values specified earlier, except for those with keys ending in ``PATH``, in which values are concatenated and separated by a colon (:) rather than overwritten. The later value comes first.
For example, this config:
runners: {emr: {cmdenv: {PATH: "/usr/bin"}}}
when run with this command:
python my_job.py --cmdenv PATH=/usr/local/bin
will result in the following value of cmdenv:
/usr/local/bin:/usr/bin
The one exception to this behavior is in the local runner, which uses the local system separator (on Windows ;, on everything else still :) instead of always using :.
If you have several standard configurations, you may want to have several config files “inherit” from a base config file. For example, you may have one set of AWS credentials, but two code bases and default instance sizes. To accomplish this, use the include option:
~/mrjob.very-large.conf:
include: ~/.mrjob.base.conf
runners:
emr:
num_ec2_core_instances: 20
ec2_core_instance_type: m1.xlarge
~/mrjob.very-small.conf:
include: $HOME/.mrjob.base.conf
runners:
emr:
num_ec2_core_instances: 2
ec2_core_instance_type: m1.small
~/.mrjob.base.conf:
runners:
emr:
aws_access_key_id: HADOOPHADOOPBOBADOOP
aws_region: us-west-1
aws_secret_access_key: MEMIMOMADOOPBANANAFANAFOFADOOPHADOOP
Options that are lists, commands, dictionaries, etc. combine the same way they do between the config files and the command line (with combiner functions).
You can use $ENVIRONMENT_VARIABLES and ~/file_in_your_home_dir inside include.
You can inherit from multiple config files by passing include a list instead of a string. Files on the right will have precedence over files on the left. To continue the above examples, this config:
~/.mrjob.everything.conf
include:
- ~/.mrjob.very-small.conf
- ~/.mrjob.very-large.conf
will be equivalent to this one:
~/.mrjob.everything-2.conf
runners:
emr:
aws_access_key_id: HADOOPHADOOPBOBADOOP
aws_region: us-west-1
aws_secret_access_key: MEMIMOMADOOPBANANAFANAFOFADOOPHADOOP
num_ec2_core_instances: 20
ec2_core_instace_type: m1.xlarge
In this case, ~/.mrjob.very-large.conf has taken precedence over ~/.mrjob.very-small.conf.