The mrjob command

The mrjob command has two purposes:

  1. To provide easy access to EMR tools
  2. To eventually let you run Hadoop Streaming jobs written in languages other than Python

EMR tools

mrjob audit-emr-usage [options]

Audit EMR usage over the past 2 weeks, sorted by job flow name and user.

Alias for mrjob.tools.emr.audit_usage.

mrjob create-job-flow [options]

Create a persistent EMR job flow, using bootstrap scripts and other configs from mrjob.conf, and print the job flow ID to stdout.

Alias for mrjob.tools.emr.create_job_flow.

mrjob fetch-logs (job flow ID) [options]

List, display, and parse Hadoop logs associated with EMR job flows. Useful for debugging failed jobs for which mrjob did not display a useful error message or for inspecting jobs whose output has been lost.

Alias for mrjob.tools.emr.fetch_logs.

mrjob report-long-jobs [options]

Report jobs running for more than a certain number of hours (by default, 24.0). This can help catch buggy jobs and Hadoop/EMR operational issues.

Alias for mrjob.tools.emr.report_long_jobs.

mrjob s3-tmpwatch [options]

Delete all files in a given URI that are older than a specified time.

Alias for mrjob.tools.emr.s3_tmpwatch.

mrjob terminate-idle-job-flows [options]

Terminate idle EMR job flows that meet the criteria passed in on the command line (or, by default, job flows that have been idle for one hour).

Alias for mrjob.tools.emr.terminate_idle_job_flows.

mrjob terminate-job-flow (job flow ID)

Terminate an existing EMR job flow.

Alias for mrjob.tools.emr.terminate_job_flow.

Running jobs

mrjob run (path to script or executable) [options]
Run a job. Takes same options as invoking a Python job. See Options available to all runners, Hadoop-related options, and EMR runner options. While you can use this command to invoke your jobs, you can just as easily call python my_job.py [options].

Table Of Contents

Need help?

Join the mailing list by visiting the Google group page or sending an email to mrjob+subscribe@googlegroups.com.