Many jobs have significant external dependencies, both libraries and other source code.
Combining shell syntax with Hadoop’s DistributedCache notation, mrjob’s setup option provides a powerful, dynamic alternative to pre-installing your Hadoop dependencies on every node.
All our mrjob.conf examples below are for the hadoop runner, but these work equally well with the emr runner. Also, if you are using EMR, take a look at the EMR Bootstrapping Cookbook.
First you need to make a tarball of your source tree. Make sure that the root of your source tree is at the root of the tarball’s file listing (e.g. the module foo.bar appears as foo/bar.py and not your-src-dir/foo/bar.py).
For reference, here is a command line that will put an entire source directory into a tarball:
tar -C your-src-code -f your-src-code.tar.gz -z -c .
Then, run your job with:
--setup 'export PYTHONPATH=$PYTHONPATH:your-src-dir.tar.gz#/'
If every job you run is going to want to use your-src-code.tar.gz, you can do this in your mrjob.conf:
runners:
hadoop:
setup:
- export PYTHONPATH=$PYTHONPATH:your-src-code.tar.gz#/
--setup 'cd your-src-dir.tar.gz#/' --setup 'make'
or, in mrjob.conf:
runners:
hadoop:
setup:
- cd your-src-dir.tar.gz#
- make
If Hadoop runs multiple tasks on the same node, your source dir will be shared between them. This is not a problem; mrjob automatically adds locking around setup commands to ensure that multiple copies of your setup script don’t run simultaneously.
Best practice for one or a few files is to use passthrough options; see add_passthrough_option().
You can also use upload_files to upload file(s) into a task’s working directory (or upload_archives for tarballs and other archives).
If you’re a setup purist, you can also do something like this:
--setup 'true your-file#desired-name'
since true has no effect and ignores its arguments.
What if you can’t install the libraries you need on your Hadoop cluster?
You could do something like this in your mrjob.conf:
runners:
hadoop:
setup:
- virtualenv venv
- . venv/bin/activate
- pip install mr3po simplejson
However, now the locking feature that protects make becomes a liability; each task on the same node has its own virtualenv, but one task has to finish setting up before the next can start.
The solution is to share the virtualenv between all tasks on the same machine, something like this:
runners:
hadoop:
setup:
- VENV=/tmp/$mapreduce_job_id
- if [ -e $VENV ]; then virtualenv $VENV; fi
- . $VENV/bin/activate
- pip install mr3po simplejson
With older versions of Hadoop (0.20 and earlier, and the 1.x series), you’d want to use $mapred_job_id.
If you have a lot of dependencies, best practice is to make a pip requirements file and use the -r switch:
--setup 'pip install -r path/to/requirements.txt#'
Note that pip can also install from tarballs (which is useful for custom-built packages):
--setup 'pip install $MY_PYTHON_PKGS/*.tar.gz#'