Abstract base class for all runners
All runners take the following keyword arguments:
Parameters: |
|
---|
Run the job, and block until it finishes.
Raise an exception if there are any problems.
Stream raw lines from the job’s output. You can parse these using the read() method of the appropriate HadoopStreamingProtocol class.
Clean up running jobs, scratch dirs, and logs, subject to the cleanup option passed to the constructor.
If you create your runner in a with block, cleanup() will be called automatically:
with mr_job.make_runner() as runner:
...
# cleanup() called automatically here
Parameters: | mode – override cleanup passed into the constructor. Should be a list of strings from CLEANUP_CHOICES |
---|
cleanup options:
'ALL': delete local scratch, remote scratch, and logs; stop job flow if on EMR and the job is not done when cleanup is run.
'LOCAL_SCRATCH': delete local scratch only
'LOGS': delete logs only
'NONE': delete nothing
'REMOTE_SCRATCH': delete remote scratch only
'SCRATCH': delete local and remote scratch, but not logs
'JOB': stop job if on EMR and the job is not done when cleanup runs
on cleanup
'IF_SUCCESSFUL' (deprecated): same as ALL. Not supported for cleanup_on_failure.
Get counters associated with this run in this form:
[{'group name': {'counter1': 1, 'counter2': 2}},
{'group name': ...}]
The list contains an entry for every step of the current job, ignoring earlier steps in the same job flow.
Return the version number of the Hadoop environment as a string if Hadoop is being used or simulated. Return None if not applicable.
EMRJobRunner infers this from the job flow. HadoopJobRunner gets this from hadoop version. LocalMRJobRunner has an additional hadoop_version option to specify which version it simulates, with a default of 0.20. InlineMRJobRunner does not simulate Hadoop at all.
Filesystem object for the local filesystem. Methods on Filesystem objects will be forwarded to MRJobRunner until mrjob 0.5, but this behavior is deprecated.
Some simple filesystem operations that are common across the local filesystem, S3, HDFS, and remote machines via SSH. Different runners provide functionality for different filesystems via their fs attribute. The hadoop and emr runners provide support for multiple protocols using CompositeFilesystem.
Protocol support:
cat all files matching path_glob, decompressing if necessary
Get the total size of files matching path_glob
Corresponds roughly to: hadoop fs -dus path_glob
Recursively list all files in the given path.
We don’t return directories for compatibility with S3 (which has no concept of them)
Corresponds roughly to: hadoop fs -lsr path_glob
Generate the md5 sum of the file at path
Create the given dir and its subdirs (if they don’t already exist).
Corresponds roughly to: hadoop fs -mkdir path
Does the given path exist?
Corresponds roughly to: hadoop fs -test -e path_glob
Recursively delete the given file/directory, if it exists
Corresponds roughly to: hadoop fs -rmr path_glob
Make an empty file in the given location. Raises an error if a non-zero length file already exists in that location.
Correponds to: hadoop fs -touchz path