For a complete list of changes, see CHANGES.txt
JarSteps, once experimental, are now fully integrated into multi-step jobs, and work with both the Hadoop and EMR runners. You can now use powerful Java libraries such as Mahout in your MRJobs. For more information, see Jar steps.
Many options for setting up your task’s environment (--python-archive, setup-cmd and --setup-script) have been replaced by a powerful --setup option. See the Job Environment Setup Cookbook for examples.
Similarly, many options for bootstrapping nodes on EMR (--bootstrap-cmd, --bootstrap-file, --bootstrap-python-package and --bootstrap-script) have been replaced by a single --bootstrap option. See the EMR Bootstrapping Cookbook.
This release also contains many bugfixes, including problems with boto 2.10.0+, bz2 decompression, and Python 2.5.
The SORT_VALUES option enables secondary sort, ensuring that your reducer(s) receive values in sorted order. This allows you to do things with reducers that would otherwise involve storing all the values in memory, such as:
The max_hours_idle option allows you to spin up EMR job flows that will terminate themselves after being idle for a certain amount of time, in a way that optimizes EMR/EC2’s full-hour billing model.
For development (not production), we now recommend always using job flow pooling, with max_hours_idle enabled. Update your mrjob.conf like this:
runners:
emr:
max_hours_idle: 0.25
pool_emr_job_flows: true
Warning
If you enable pooling without max_hours_idle (or cronning terminate_idle_job_flows), pooled job flows will stay active forever, costing you money!
You can now use --no-check-input-paths with the Hadoop runner to allow jobs to run even if hadoop fs -ls can’t see their input files (see check_input_paths).
Two bits of straggling deprecated functionality were removed:
This version also contains numerous bugfixes and natural extensions of existing functionality; many more things will now Just Work (see CHANGES.txt).
The default runner is now inline instead of local. This change will speed up debugging for many users. Use local if you need to simulate more features of Hadoop.
The EMR tools can now be accessed more easily via the mrjob command. Learn more here.
Job steps are much richer now:
If you Ctrl+C from the command line, your job will be terminated if you give it time. If you’re running on EMR, that should prevent most accidental runaway jobs. More info
mrjob v0.4 requires boto 2.2.
We removed all deprecated functionality from v0.2:
We love contributions, so we wrote some guidelines to help you help us. See you on Github!
The pool_wait_minutes (--pool-wait-minutes) option lets your job delay itself in case a job flow becomes available. Reference: Configuration quick reference
The JOB and JOB_FLOW cleanup options tell mrjob to clean up the job and/or the job flow on failure (including Ctrl+C). See CLEANUP_CHOICES for more information.
You can now include one config file from another.
The EMR instance type/number options have changed to support spot instances:
There is also a new ami_version option to change the AMI your job flow uses for its nodes.
For more information, see mrjob.emr.EMRJobRunner.__init__().
The new report_long_jobs tool alerts on jobs that have run for more than X hours.
Support for Combiners
You can now use combiners in your job. Like mapper() and reducer(), you can redefine combiner() in your subclass to add a single combiner step to run after your mapper but before your reducer. (MRWordFreqCount does this to improve performance.) combiner_init() and combiner_final() are similar to their mapper and reducer equivalents.
You can also add combiners to custom steps by adding keyword argumens to your call to steps().
More info: One-step jobs, Multi-step jobs
*_init(), *_final() for mappers, reducers, combiners
Mappers, reducers, and combiners have *_init() and *_final() methods that are run before and after the input is run through the main function (e.g. mapper_init() and mapper_final()).
More info: One-step jobs, Multi-step jobs
Custom Option Parsers
It is now possible to define your own option types and actions using a custom OptionParser subclass.
More info: Custom option types
Job Flow Pooling
EMR jobs can pull job flows out of a “pool” of similarly configured job flows. This can make it easier to use a small set of job flows across multiple automated jobs, save time and money while debugging, and generally make your life simpler.
More info: Pooling Job Flows
SSH Log Fetching
mrjob attempts to fetch counters and error logs for EMR jobs via SSH before trying to use S3. This method is faster, more reliable, and works with persistent job flows.
More info: Configuring SSH credentials
New EMR Tool: fetch_logs
If you want to fetch the counters or error logs for a job after the fact, you can use the new fetch_logs tool.
More info: mrjob.tools.emr.fetch_logs
New EMR Tool: mrboss
If you want to run a command on all nodes and inspect the output, perhaps to see what processes are running, you can use the new mrboss tool.
More info: mrjob.tools.emr.mrboss
Configuration
The search path order for mrjob.conf has changed. The new order is:
- The location specified by MRJOB_CONF
- ~/.mrjob.conf
- ~/.mrjob (deprecated)
- mrjob.conf in any directory in PYTHONPATH (deprecated)
- /etc/mrjob.conf
If your mrjob.conf path is deprecated, use this table to fix it:
Old Location New Location ~/.mrjob ~/.mrjob.conf somewhere in PYTHONPATH Specify in MRJOB_CONF More info: mrjob.conf
Defining Jobs (MRJob)
Mapper, combiner, and reducer methods no longer need to contain a yield statement if they emit no data.
The --hadoop-*-format switches are deprecated. Instead, set your job’s Hadoop formats with HADOOP_INPUT_FORMAT/HADOOP_OUTPUT_FORMAT or hadoop_input_format()/hadoop_output_format(). Hadoop formats can no longer be set from mrjob.conf.
In addition to --jobconf, you can now set jobconf values with the JOBCONF attribute or the jobconf() method. To read jobconf values back, use mrjob.compat.jobconf_from_env(), which ensures that the correct name is used depending on which version of Hadoop is active.
You can now set the Hadoop partioner class with --partitioner, the PARTITIONER attribute, or the partitioner() method.
More info: Hadoop configuration
Protocols
Protocols can now be anything with a read() and write() method. Unlike previous versions of mrjob, they can be instance methods rather than class methods. You should use instance methods when defining your own protocols.
The --*protocol switches and DEFAULT_*PROTOCOL are deprecated. Instead, use the *_PROTOCOL attributes or redefine the *_protocol() methods.
Protocols now cache the decoded values of keys. Informal testing shows up to 30% speed improvements.
More info: Protocols
Running Jobs
All Modes
All runners are Hadoop-version aware and use the correct jobconf and combiner invocation styles. This change should decrease the number of warnings in Hadoop 0.20 environments.
All *_bin configuration options (hadoop_bin, python_bin, and ssh_bin) take lists instead of strings so you can add arguments (like ['python', '-v']). More info: Configuration quick reference
Cleanup options have been split into cleanup and cleanup_on_failure. There are more granular values for both of these options.
Most limitations have been lifted from passthrough options, including the former inability to use custom types and actions. More info: Custom option types
The job_name_prefix option is gone (was deprecated).
All URIs are passed through to Hadoop where possible. This should relax some requirements about what URIs you can use.
Steps with no mapper use cat instead of going through a no-op mapper.
Compressed files can be streamed with the cat() method.
EMR Mode
The default Hadoop version on EMR is now 0.20 (was 0.18).
The ec2_instance_type option only sets the instance type for slave nodes when there are multiple EC2 instance. This is because the master node can usually remain small without affecting the performance of the job.
Inline Mode
Inline mode now supports the cmdenv option.Local Mode
Local mode now runs 2 mappers and 2 reducers in parallel by default.
There is preliminary support for simulating some jobconf variables. The current list of supported variables is:
- mapreduce.job.cache.archives
- mapreduce.job.cache.files
- mapreduce.job.cache.local.archives
- mapreduce.job.cache.local.files
- mapreduce.job.id
- mapreduce.job.local.dir
- mapreduce.map.input.file
- mapreduce.map.input.length
- mapreduce.map.input.start
- mapreduce.task.attempt.id
- mapreduce.task.id
- mapreduce.task.ismap
- mapreduce.task.output.dir
- mapreduce.task.partition
Other Stuff
boto 2.0+ is now required.
The Debian packaging has been removed from the repostory.