mrjob.util - general utility functions

Utility functions for MRJob that have no external dependencies.

mrjob.util.args_for_opt_dest_subset(option_parser, args, dests=None)

For the given OptionParser and list of command line arguments args, yield values in args that correspond to option destinations in the set of strings dests. If dests is None, return args as parsed by OptionParser.

mrjob.util.bash_wrap(cmd_str)

Escape single quotes in a shell command string and wrap it with bash -c '<string>'.

This low-tech replacement works because we control the surrounding string and single quotes are the only character in a single-quote string that needs escaping.

mrjob.util.buffer_iterator_to_line_iterator(iterator)

boto’s file iterator splits by buffer size instead of by newline. This wrapper puts them back into lines.

Warning

This may append a newline to your last chunk of data. In v0.5.0 it will not, for better compatibility with file objects.

mrjob.util.bunzip2_stream(fileobj, bufsize=1024)

Decompress gzipped data on the fly.

Parameters:
  • fileobj – object supporting read()
  • bufsize – number of bytes to read from fileobj at a time.

Warning

This yields lines for backwards compatibility only; in v0.5.0 it will yield arbitrary chunks of data as part of supporting non-line-based protocols (see Issue #715). If you want lines, wrap this in buffer_iterator_to_line_iterator().

mrjob.util.cmd_line(args)

build a command line that works in a shell.

mrjob.util.expand_path(path)

Resolve ~ (home dir) and environment variables in path.

If path is None, return None.

mrjob.util.extract_dir_for_tar(archive_path, compression='gz')

Deprecated since version 0.4.0.

Get the name of the directory the tar at archive_path extracts into.

Parameters:
  • archive_path (str) – path to archive file
  • compression (str) – Compression type to use. This can be one of '', bz2, or gz.
mrjob.util.file_ext(path)

return the file extension, including the .

>>> file_ext('foo.tar.gz')
'.tar.gz'
mrjob.util.gunzip_stream(fileobj, bufsize=1024)

Decompress gzipped data on the fly.

Parameters:
  • fileobj – object supporting read()
  • bufsize – number of bytes to read from fileobj at a time. The default is the same as in gzip.

Warning

This yields decompressed chunks; it does not split on lines. To get lines, wrap this in buffer_iterator_to_line_iterator().

mrjob.util.hash_object(obj)

Generate a hash (currently md5) of the repr of the object

mrjob.util.is_ironpython = False

Deprecated since version 0.4.

mrjob.util.log_to_null(name=None)

Set up a null handler for the given stream, to suppress “no handlers could be found” warnings.

mrjob.util.log_to_stream(name=None, stream=None, format=None, level=None, debug=False)

Set up logging.

Parameters:
  • name (str) – name of the logger, or None for the root logger
  • stderr (file object) – stream to log to (default is sys.stderr)
  • format (str) – log message format (default is ‘%(message)s’)
  • level – log level to use
  • debug (bool) – quick way of setting the log level: if true, use logging.DEBUG, otherwise use logging.INFO
mrjob.util.parse_and_save_options(option_parser, args)

DEPRECATED. To be removed in v0.5.

Duplicate behavior of OptionParser, but capture the strings required to reproduce the same values. Ref. optparse.py lines 1414-1548 (python 2.6.5)

mrjob.util.populate_option_groups_with_options(assignments, indexed_options)

Given a dictionary mapping OptionGroup and OptionParser objects to a list of strings represention option dests, populate the objects with options from indexed_options (generated by scrape_options_and_index_by_dest()) in alphabetical order by long option name. This function primarily exists to serve scrape_options_into_new_groups().

Parameters:
  • assignments (dict of the form {my_option_parser: ('verbose', 'help', ...), my_option_group: (...)}) – specification of which parsers/groups should get which options
  • indexed_options (dict generated by util.scrape_options_and_index_by_dest()) – options to use when populating the parsers/groups
mrjob.util.read_file(path, fileobj=None, yields_lines=True, cleanup=None)

Yields lines from a file, possibly decompressing it based on file extension.

Currently we handle compressed files with the extensions .gz and .bz2.

Parameters:
  • path (string) – file path. Need not be a path on the local filesystem (URIs are okay) as long as you specify fileobj too.
  • fileobj – file object to read from. Need not be seekable. If this is omitted, we open(path).
  • yields_lines – Does iterating over fileobj yield lines (like file objects are supposed to)? If not, set this to False (useful for boto.s3.Key)
  • cleanup – Optional callback to call with no arguments when EOF is reached or an exception is thrown.
mrjob.util.read_input(path, stdin=None)

Stream input the way Hadoop would.

  • Resolve globs (foo_*.gz).
  • Decompress .gz and .bz2 files.
  • If path is '-', read from stdin
  • If path is a directory, recursively read its contents.

You can redefine stdin for ease of testing. stdin can actually be any iterable that yields lines (e.g. a list).

mrjob.util.safeeval(expr, globals=None, locals=None)

Like eval, but with nearly everything in the environment blanked out, so that it’s difficult to cause mischief.

globals and locals are optional dictionaries mapping names to values for those names (just like in eval()).

mrjob.util.save_current_environment(*args, **kwds)

Context manager that saves os.environ and loads it back again after execution

mrjob.util.save_cwd(*args, **kwds)

Context manager that saves the current working directory, and chdir’s back to it after execution.

mrjob.util.scrape_options_and_index_by_dest(*parsers_and_groups)

Scrapes optparse options from OptionParser and OptionGroup objects and builds a dictionary of dest_var: [option1, option2, ...]. This function primarily exists to serve scrape_options_into_new_groups().

An example return value: {'verbose': [<verbose_on_option>, <verbose_off_option>], 'files': [<file_append_option>]}

Parameters:parsers_and_groups (OptionParser or OptionGroup) – Parsers and groups to scrape option objects from
Returns:dict of the form {dest_var: [option1, option2, ...], ...}
mrjob.util.scrape_options_into_new_groups(source_groups, assignments)

Puts options from the OptionParser and OptionGroup objects in source_groups into the keys of assignments according to the values of assignments. An example:

Parameters:
  • source_groups (list of OptionParser and OptionGroup objects) – parsers/groups to scrape options from
  • assignments (dict with keys that are OptionParser and OptionGroup objects and values that are lists of strings) – map empty parsers/groups to lists of destination names that they should contain options for
mrjob.util.shlex_split(s)

Wrapper around shlex.split(), but convert to str if Python version < 2.7.3 when unicode support was added.

mrjob.util.strip_microseconds(delta)

Return the given datetime.timedelta, without microseconds.

Useful for printing datetime.timedelta objects.

mrjob.util.tar_and_gzip(dir, out_path, filter=None, prefix='')

Tar and gzip the given dir to a tarball at out_path.

If we encounter symlinks, include the actual file, not the symlink.

Parameters:
  • dir (str) – dir to tar up
  • out_path (str) – where to write the tarball too
  • filter – if defined, a function that takes paths (relative to dir and returns True if we should keep them
  • prefix (str) – subdirectory inside the tarball to put everything into (e.g. 'mrjob')
mrjob.util.unarchive(archive_path, dest)

Extract the contents of a tar or zip file at archive_path into the directory dest.

Parameters:
  • archive_path (str) – path to archive file
  • dest (str) – path to directory where archive will be extracted

dest will be created if it doesn’t already exist.

tar files can be gzip compressed, bzip2 compressed, or uncompressed. Files within zip files can be deflated or stored.

Need help?

Join the mailing list by visiting the Google group page or sending an email to mrjob+subscribe@googlegroups.com.