Each tool can be invoked two ways: from the mrjob command, or by running the Python module directly. Both ways are given in each example.
Audit EMR usage over the past 2 weeks, sorted by job flow name and user.
Usage:
mrjob audit-emr-usage > report
python -m mrjob.tools.emr.audit_usage > report
Options:
-h, --help show this help message and exit
-v, --verbose print more messages to stderr
-q, --quiet Don't log status messages; just print the report.
-c CONF_PATH, --conf-path=CONF_PATH
Path to alternate mrjob.conf file to read from
--no-conf Don't load mrjob.conf even if it's available
--max-days-ago=MAX_DAYS_AGO
Max number of days ago to look at jobs. By default, we
go back as far as EMR supports (currently about 2
months)
Create a persistent EMR job flow, using bootstrap scripts and other configs from mrjob.conf, and print the job flow ID to stdout.
Usage:
mrjob create-job-flow
python -m mrjob.tools.emr.create_job_flow
WARNING: do not run this without having mrjob.tools.emr.terminate_idle_job_flows in your crontab; job flows left idle can quickly become expensive!
List, display, and parse Hadoop logs associated with EMR job flows. Useful for debugging failed jobs for which mrjob did not display a useful error message or for inspecting jobs whose output has been lost.
Usage:
mrjob fetch-logs -[l|L|a|A|--counters] [-s STEP_NUM] JOB_FLOW_ID
python -m mrjob.tools.emr.fetch_logs -[l|L|a|A|--counters] [-s STEP_NUM] JOB_FLOW_ID
Options:
-a, --cat Cat log files MRJob finds relevant
-A, --cat-all Cat all log files to JOB_FLOW_ID/
-c CONF_PATH, --conf-path=CONF_PATH
Path to alternate mrjob.conf file to read from
--counters Show counters from the job flow
--ec2-key-pair-file=EC2_KEY_PAIR_FILE
Path to file containing SSH key for EMR
-h, --help show this help message and exit
-l, --list List log files MRJob finds relevant
-L, --list-all List all log files
--no-conf Don't load mrjob.conf even if it's available
-q, --quiet Don't print anything to stderr
-s STEP_NUM, --step-num=STEP_NUM
Limit results to a single step. To be used with --list
and --cat.
-v, --verbose print more messages to stderr
Run a command on the master and all slaves. Store stdout and stderr for results in OUTPUT_DIR.
Usage:
python -m mrjob.tools.emr.mrboss JOB_FLOW_ID [options] "command string"
Options:
-c CONF_PATH, --conf-path=CONF_PATH
--ec2-key-pair-file=EC2_KEY_PAIR_FILE
Path to file containing SSH key for EMR
-h, --help show this help message and exit
--no-conf Don't load mrjob.conf even if it's available
-o, --output-dir Specify an output directory (default: JOB_FLOW_ID)
-q, --quiet Don't print anything to stderr
-v, --verbose print more messages to stderr
Report jobs running for more than a certain number of hours (by default, 24.0). This can help catch buggy jobs and Hadoop/EMR operational issues.
Suggested usage: run this as a daily cron job with the -q option:
0 0 * * * mrjob report-long-jobs
0 0 * * * python -m mrjob.tools.emr.report_long_jobs -q
Options:
-h, --help show this help message and exit
-v, --verbose print more messages to stderr
-q, --quiet Don't log status messages; just print the report.
-c CONF_PATH, --conf-path=CONF_PATH
Path to alternate mrjob.conf file to read from
--no-conf Don't load mrjob.conf even if it's available
--min-hours=MIN_HOURS
Minimum number of hours a job can run before we report
it. Default: 24.0
Delete all files in a given URI that are older than a specified time. The time parameter defines the threshold for removing files. If the file has not been accessed for time, the file is removed. The time argument is a number with an optional single-character suffix specifying the units: m for minutes, h for hours, d for days. If no suffix is specified, time is in hours.
Suggested usage: run this as a cron job with the -q option:
0 0 * * * mrjob s3-tmpwatch -q 30d s3://your-bucket/tmp/
0 0 * * * python -m mrjob.tools.emr.s3_tmpwatch -q 30d s3://your-bucket/tmp/
Usage:
mrjob s3-tmpwatch [options] <time-untouched> <URIs>
python -m mrjob.tools.emr.s3_tmpwatch [options] <time-untouched> <URIs>
Options:
-h, --help show this help message and exit
-v, --verbose Print more messages
-q, --quiet Report only fatal errors.
-c CONF_PATH, --conf-path=CONF_PATH
Path to alternate mrjob.conf file to read from
--no-conf Don't load mrjob.conf even if it's available
-t, --test Don't actually delete any files; just log that we
would
Terminate an existing EMR job flow.
Usage:
mrjob terminate-job-flow [options] j-JOBFLOWID
python -m mrjob.tools.emr.terminate_job_flow [options] j-JOBFLOWID
Terminate an existing EMR job flow.
Options:
-h, --help show this help message and exit
-v, --verbose print more messages to stderr
-q, --quiet don't print anything
-c CONF_PATH, --conf-path=CONF_PATH
Path to alternate mrjob.conf file to read from
--no-conf Don't load mrjob.conf even if it's available
Terminate idle EMR job flows that meet the criteria passed in on the command line (or, by default, job flows that have been idle for one hour).
Suggested usage: run this as a cron job with the -q option:
*/30 * * * * mrjob terminate-idle-job-flows -q
*/30 * * * * python -m mrjob.tools.emr.terminate_idle_job_flows -q
Options:
-h, --help show this help message and exit
-v, --verbose Print more messages
-q, --quiet Don't print anything to stderr; just print IDs of
terminated job flows and idle time information to
stdout. Use twice to print absolutely nothing.
-c CONF_PATH, --conf-path=CONF_PATH
Path to alternate mrjob.conf file to read from
--no-conf Don't load mrjob.conf even if it's available
--max-hours-idle=MAX_HOURS_IDLE
Max number of hours a job flow can go without
bootstrapping, running a step, or having a new step
created. This will fire even if there are pending
steps which EMR has failed to start. Make sure you set
this higher than the amount of time your jobs can take
to start instances and bootstrap.
--mins-to-end-of-hour=MINS_TO_END_OF_HOUR
Terminate job flows that are within this many minutes
of the end of a full hour since the job started
running AND have no pending steps.
--unpooled-only Only terminate un-pooled job flows
--pooled-only Only terminate pooled job flows
--pool-name=POOL_NAME
Only terminate job flows in the given named pool.
--dry-run Don't actually kill idle jobs; just log that we would