Configuring your AWS credentials allows mrjob to run your jobs on Elastic MapReduce and use S3.
Now you can either set the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY, or set aws_access_key_id and aws_secret_access_key in your mrjob.conf file like this:
runners:
emr:
aws_access_key_id: <your key ID>
aws_secret_access_key: <your secret>
Configuring your SSH credentials lets mrjob open an SSH tunnel to your jobs’ master nodes to view live progress, see the job tracker in your browser, and fetch error logs quickly.
Make sure the Region dropdown (upper left) matches the region you want to run jobs in (usually “US East”).
Click on Key Pairs (lower left)
Click on Create Key Pair (center).
Name your key pair EMR (any name will work but that’s what we’re using in this example)
Save EMR.pem wherever you like (~/.ssh is a good place)
Run chmod og-rwx /path/to/EMR.pem so that ssh will be happy
Add the following entries to your mrjob.conf:
runners:
emr:
ec2_key_pair: EMR
ec2_key_pair_file: /path/to/EMR.pem # ~/ and $ENV_VARS allowed here
ssh_tunnel_to_job_tracker: true
Running a job on EMR is just like running it locally or on your own Hadoop cluster, with the following changes:
This the output of this command should be identical to the output shown in Fundamentals, but it should take much longer:
> python word_count.py -r emr README.txt “chars” 3654 “lines” 123 “words” 417
If you’d rather have your output go to somewhere deterministic on S3, which you probably do, use --output-dir:
> python word_count.py -r emr README.rst \
> --output-dir=s3://my-bucket/wc_out/
It’s also likely that since you know where your output is on S3, you don’t want output streamed back to your local machine. For that, use -no-output:
> python word_count.py -r emr README.rst \
> --output-dir=s3://my-bucket/wc_out/ \
> --no-output
There are many other ins and outs of effectively using mrjob with EMR. See Advanced EMR usage for some of the ins, but the outs are left as an exercise for the reader. This is a strictly no-outs body of documentation!
When you create a job flow on EMR, you’ll have the option of specifying a number and type of EC2 instances, which are basically virtual machines. Each instance type has different memory, CPU, I/O and network characteristics, and costs a different amount of money. See Instance Types and Pricing for details.
Instances perform one of three roles:
There’s a special case where your job flow only has a single master instance, in which case the master instance schedules tasks, runs them, and hosts HDFS.
By default, mrjob runs a single m1.small, which is a cheap but not very powerful instance type. This can be quite adequate for testing your code on a small subset of your data, but otherwise give little advantage over running a job locally. To get more performance out of your job, you can either add more instances, use more powerful instances, or both.
Here are some things to consider when tuning your instance settings:
The basic way to control type and number of instances is with the ec2_instance_type and num_ec2_instances options, on the command line like this:
--ec2_instance_type c1.medium --num-ec2-instances 5
or in mrjob.conf, like this:
runners:
emr:
ec2_instance_type: c1.medium
num_ec2_instances: 5
In most cases, your master instance type doesn’t need to be larger than``m1.small`` to schedule tasks, so ec2_instance_type only applies to instances that actually run tasks. (In this example, there are 1 m1.small master instance, and 4 c1.medium core instances.) You will need a larger master instance if you have a very large number of input files; in this case, use the ec2_master_instance_type option.
If you want to run task instances, you instead must specify the number of core and task instances directly with the num_ec2_core_instances and num_ec2_task_instances options. There are also ec2_core_instance_type and ec2_task_instance_type options if you want to set these directly.