mrjob is the easiest route to writing Python programs that run on Hadoop. If you use mrjob, you’ll be able to test your code locally without installing Hadoop or run it on a cluster of your choice.
Additionally, mrjob has extensive integration with Amazon Elastic MapReduce. Once you’re set up, it’s as easy to run your job in the cloud as it is to run it on your laptop.
Here are a number of features of mrjob that make writing MapReduce jobs easier:
If you don’t want to be a Hadoop expert but need the computing power of MapReduce, mrjob might be just the thing for you.
Where X is any other library that helps Hadoop and Python interface with each other.
The flip side to mrjob’s ease of use is that it doesn’t give you the same level of access to Hadoop APIs that Dumbo and Pydoop do. It’s simplified a great deal. But that hasn’t stopped several companies, including Yelp, from using it for day-to-day heavy lifting. For common (and many uncommon) cases, the abstractions help rather than hinder.
Other libraries can be faster if you use typedbytes. There have been several attempts at integrating it with mrjob, and it may land eventually, but it doesn’t exist yet.