Utilities for setting up the environment jobs run in by uploading files and running setup scripts.
The general idea is to use Hadoop DistributedCache-like syntax to find and parse expressions like /path/to/file#name_in_working_dir into “path dictionaries” like {'type': 'file', 'path': '/path/to/file', 'name': 'name_in_working_dir'}}.
You can then pass these into a WorkingDirManager to keep track of which files need to be uploaded, catch name collisions, and assign names to unnamed paths (e.g. /path/to/file#). Note that WorkingDirManager.name() can take a path dictionary as keyword arguments.
If you need to upload files from the local filesystem to a place where Hadoop can see them (HDFS or S3), we provide UploadDirManager.
Path dictionaries are meant to be immutable; all state is handled by manager classes.
Manage the working dir for the master bootstrap script. Identical to WorkingDirManager except that it doesn’t support archives.
Represents a directory on HDFS or S3 where we want to upload local files for consumption by Hadoop.
UploadDirManager tries to give files the same name as their filename in the path (for ease of debugging), but handles collisions gracefully.
UploadDirManager assumes URIs to not need to be uploaded and thus does not store them. uri() maps URIs to themselves.
Returns: | the URI assigned to the path |
---|
Get a map from path to URI for all paths that were added, so we can figure out which files we need to upload.
Get the URI for the given path. If path is a URI, just return it.
Represents the working directory of hadoop tasks (or bootstrap commands on EMR).
To support Hadoop’s distributed cache, paths can be for ordinary files, or for archives (which are automatically uncompressed into a directory by Hadoop).
When adding a file, you may optionally assign it a name; if you don’t; we’ll lazily assign it a name as needed. Name collisions are not allowed, so being lazy makes it easier to avoid unintended collisions.
If you wish, you may assign multiple names to the same file, or add a path as both a file and an archive (though not mapped to the same name).
Add a path as either a file or an archive, optionally assigning it a name.
Parameters: |
|
---|
Get the name for a path previously added to this WorkingDirManager, assigning one as needed.
This is primarily for getting the name of auto-named files. If the file was added with an assigned name, you must include it (and we’ll just return name).
We won’t ever give an auto-name that’s the same an assigned name (even for the same path and type).
Parameters: |
|
---|
Get a map from name (in the setup directory) to path for all known files/archives, so we can build -file and -archive options to Hadoop (or fake them in a bootstrap script).
Parameters: | type – either 'archive' or 'file' |
---|
Get a set of all paths tracked by this WorkingDirManager.
Come up with a unique name for path.
Parameters: |
|
---|
If the proposed name is taken, we add a number to the end of the filename, keeping the extension the same. For example:
>>> name_uniquely('foo.tar.gz', set(['foo.tar.gz']))
'foo-1.tar.gz'
Parse hash paths from old setup/bootstrap options.
This is similar to parsing hash paths out of shell commands (see parse_setup_cmd()) except that we pass in path type explicitly, and we don’t always require the # character.
Parameters: |
|
---|
Parse a setup/bootstrap command, finding and pulling out Hadoop Distributed Cache-style paths (“hash paths”).
Parameters: | cmd (string) – shell command to parse |
---|---|
Returns: | a list containing dictionaries (parsed hash paths) and strings (parts of the original command, left unparsed) |
Hash paths look like path#name, where path is either a local path or a URI pointing to something we want to upload to Hadoop/EMR, and name is the name we want it to have when we upload it; name is optional (no name means to pick a unique one).
If name is followed by a trailing slash, that indicates path is an archive (e.g. a tarball), and should be unarchived into a directory on the remote system. The trailing slash will also be kept as part of the original command.
Parsed hash paths are dicitionaries with the keys path, name, and type (either 'file' or 'archive').
Most of the time, this function will just do what you expect. Rules for finding hash paths:
If you really want to include forbidden characters, you may use backslash escape sequences in path and name. (We can’t guarantee Hadoop/EMR will accept them though!). Also, remember that shell syntax allows you to concatenate strings like""this.
Environment variables and ~ (home dir) in path will be resolved (use backslash escapes to stop this). We don’t resolve name because it doesn’t make sense. Environment variables and ~ elsewhere in the command are considered to be part of the script and will be resolved on the remote system.