Bootstrapping allows you to configure EMR machines to your needs.
First you need to install pip:
--bootstrap 'sudo apt-get install -y python-pip'
Then install the packages you want:
--bootstrap 'sudo pip install --upgrade mr3po simplejson'
Or, equivalently, in mrjob.conf:
runners:
emr:
bootstrap:
- sudo apt-get install -y python-pip
- sudo pip install boto mr3po
mrjob relies on simplejson for rapid encoding and decoding of data.
To use the latest (fastest) version, do:
--bootstrap 'sudo pip install --upgrade simplejson'
If you have a lot of dependencies, best practice is to make a pip requirements file and use the -r switch:
--bootstrap 'sudo pip install -r path/to/requirements.txt#'
Note that pip can also install from tarballs (which is useful for custom-built packages):
--bootstrap 'sudo pip install $MY_PYTHON_PKGS/*.tar.gz#'
As we did with pip, you can use apt-get to install any package from the Debian archive. For example, to install Python 3:
--bootstrap 'sudo apt-get install -y python3'
If you have particular .deb files you want to install, do:
--bootstrap 'sudo dpkg -i path/to/packages/*.deb#'
To upgrade Python on EMR, you will probably have to build it from source (Debian packages tend to lag the current versions of software, and EMR AMIs tend to lag the current version of Debian).
First, download the latest version of the Python source here.
Then add this to your mrjob.conf:
runners:
emr:
bootstrap:
- tar xfz path/to/Python-x.y.z.tgz#
- cd Python-x.y.z
- ./configure && make && sudo make install
bootstrap_mrjob runs last, so mrjob will get bootstrapped into your newly upgraded version of Python. If you use other bootstrap commands to install/upgrade Python libraries, you should also run them after upgrading Python.
You can use bootstrap and setup together.
Generally, you want to use bootstrap for things that are part of your general production environment, and setup for things that are specific to your particular job. This makes things work as expected if you are Pooling Job Flows.