EMR job flow management tools

Each tool can be invoked two ways: from the mrjob command, or by running the Python module directly. Both ways are given in each example.

audit_usage

Audit EMR usage over the past 2 weeks, sorted by job flow name and user.

Usage:

mrjob audit-emr-usage > report
python -m mrjob.tools.emr.audit_usage > report

Options:

-h, --help            show this help message and exit
-v, --verbose         print more messages to stderr
-q, --quiet           Don't log status messages; just print the report.
-c CONF_PATH, --conf-path=CONF_PATH
                      Path to alternate mrjob.conf file to read from
--no-conf             Don't load mrjob.conf even if it's available
--max-days-ago=MAX_DAYS_AGO
                      Max number of days ago to look at jobs. By default, we
                      go back as far as EMR supports (currently about 2
                      months)

create_job_flow

Create a persistent EMR job flow, using bootstrap scripts and other configs from mrjob.conf, and print the job flow ID to stdout.

Usage:

mrjob create-job-flow
python -m mrjob.tools.emr.create_job_flow

WARNING: do not run this without having mrjob.tools.emr.terminate_idle_job_flows in your crontab; job flows left idle can quickly become expensive!

fetch_logs

List, display, and parse Hadoop logs associated with EMR job flows. Useful for debugging failed jobs for which mrjob did not display a useful error message or for inspecting jobs whose output has been lost.

Usage:

mrjob fetch-logs -[l|L|a|A|--counters] [-s STEP_NUM] JOB_FLOW_ID
python -m mrjob.tools.emr.fetch_logs -[l|L|a|A|--counters] [-s STEP_NUM] JOB_FLOW_ID

Options:

-a, --cat             Cat log files MRJob finds relevant
-A, --cat-all         Cat all log files to JOB_FLOW_ID/
-c CONF_PATH, --conf-path=CONF_PATH
                      Path to alternate mrjob.conf file to read from
--counters            Show counters from the job flow
--ec2-key-pair-file=EC2_KEY_PAIR_FILE
                      Path to file containing SSH key for EMR
-h, --help            show this help message and exit
-l, --list            List log files MRJob finds relevant
-L, --list-all        List all log files
--no-conf             Don't load mrjob.conf even if it's available
-q, --quiet           Don't print anything to stderr
-s STEP_NUM, --step-num=STEP_NUM
                      Limit results to a single step. To be used with --list
                      and --cat.
-v, --verbose         print more messages to stderr

mrboss

Run a command on the master and all slaves. Store stdout and stderr for results in OUTPUT_DIR.

Usage:

python -m mrjob.tools.emr.mrboss JOB_FLOW_ID [options] "command string"

Options:

-c CONF_PATH, --conf-path=CONF_PATH
--ec2-key-pair-file=EC2_KEY_PAIR_FILE
                      Path to file containing SSH key for EMR
-h, --help            show this help message and exit
--no-conf             Don't load mrjob.conf even if it's available
-o, --output-dir      Specify an output directory (default: JOB_FLOW_ID)
-q, --quiet           Don't print anything to stderr
-v, --verbose         print more messages to stderr

report_long_jobs

Report jobs running for more than a certain number of hours (by default, 24.0). This can help catch buggy jobs and Hadoop/EMR operational issues.

Suggested usage: run this as a daily cron job with the -q option:

0 0 * * * mrjob report-long-jobs
0 0 * * * python -m mrjob.tools.emr.report_long_jobs -q

Options:

-h, --help            show this help message and exit
-v, --verbose         print more messages to stderr
-q, --quiet           Don't log status messages; just print the report.
-c CONF_PATH, --conf-path=CONF_PATH
                      Path to alternate mrjob.conf file to read from
--no-conf             Don't load mrjob.conf even if it's available
--min-hours=MIN_HOURS
                      Minimum number of hours a job can run before we report
                      it. Default: 24.0

s3_tmpwatch

Delete all files in a given URI that are older than a specified time. The time parameter defines the threshold for removing files. If the file has not been accessed for time, the file is removed. The time argument is a number with an optional single-character suffix specifying the units: m for minutes, h for hours, d for days. If no suffix is specified, time is in hours.

Suggested usage: run this as a cron job with the -q option:

0 0 * * * mrjob s3-tmpwatch -q 30d s3://your-bucket/tmp/
0 0 * * * python -m mrjob.tools.emr.s3_tmpwatch -q 30d s3://your-bucket/tmp/

Usage:

mrjob s3-tmpwatch [options] <time-untouched> <URIs>
python -m mrjob.tools.emr.s3_tmpwatch [options] <time-untouched> <URIs>

Options:

-h, --help            show this help message and exit
-v, --verbose         Print more messages
-q, --quiet           Report only fatal errors.
-c CONF_PATH, --conf-path=CONF_PATH
                      Path to alternate mrjob.conf file to read from
--no-conf             Don't load mrjob.conf even if it's available
-t, --test            Don't actually delete any files; just log that we
                      would

terminate_job_flow

Terminate an existing EMR job flow.

Usage:

mrjob terminate-job-flow [options] j-JOBFLOWID
python -m mrjob.tools.emr.terminate_job_flow [options] j-JOBFLOWID

Terminate an existing EMR job flow.

Options:

-h, --help            show this help message and exit
-v, --verbose         print more messages to stderr
-q, --quiet           don't print anything
-c CONF_PATH, --conf-path=CONF_PATH
                      Path to alternate mrjob.conf file to read from
--no-conf             Don't load mrjob.conf even if it's available

terminate_idle_job_flows

Terminate idle EMR job flows that meet the criteria passed in on the command line (or, by default, job flows that have been idle for one hour).

Suggested usage: run this as a cron job with the -q option:

*/30 * * * * mrjob terminate-idle-job-flows -q
*/30 * * * * python -m mrjob.tools.emr.terminate_idle_job_flows -q

Options:

-h, --help            show this help message and exit
-v, --verbose         Print more messages
-q, --quiet           Don't print anything to stderr; just print IDs of
                      terminated job flows and idle time information to
                      stdout. Use twice to print absolutely nothing.
-c CONF_PATH, --conf-path=CONF_PATH
                      Path to alternate mrjob.conf file to read from
--no-conf             Don't load mrjob.conf even if it's available
--max-hours-idle=MAX_HOURS_IDLE
                      Max number of hours a job flow can go without
                      bootstrapping, running a step, or having a new step
                      created. This will fire even if there are pending
                      steps which EMR has failed to start. Make sure you set
                      this higher than the amount of time your jobs can take
                      to start instances and bootstrap.
--mins-to-end-of-hour=MINS_TO_END_OF_HOUR
                      Terminate job flows that are within this many minutes
                      of the end of a full hour since the job started
                      running AND have no pending steps.
--unpooled-only       Only terminate un-pooled job flows
--pooled-only         Only terminate pooled job flows
--pool-name=POOL_NAME
                      Only terminate job flows in the given named pool.
--dry-run             Don't actually kill idle jobs; just log that we would