App Engine Python SDK  v1.6.9 rev.445
The Python runtime is available as an experimental Preview feature.
Public Member Functions | Static Public Attributes | List of all members
google.appengine.ext.mapreduce.input_readers._GoogleCloudStorageInputReader Class Reference
Inheritance diagram for google.appengine.ext.mapreduce.input_readers._GoogleCloudStorageInputReader:
google.appengine.ext.mapreduce.input_readers.InputReader google.appengine.ext.mapreduce.json_util.JsonMixin google.appengine.ext.mapreduce.input_readers._GoogleCloudStorageRecordInputReader

Public Member Functions

def __init__
 
def validate
 
def split_input
 
def from_json
 
def to_json
 
def next
 
def __str__
 
- Public Member Functions inherited from google.appengine.ext.mapreduce.input_readers.InputReader
def __iter__
 
def next
 
def from_json
 
def to_json
 
def split_input
 
def validate
 
- Public Member Functions inherited from google.appengine.ext.mapreduce.json_util.JsonMixin
def to_json_str
 
def from_json_str
 

Static Public Attributes

string BUCKET_NAME_PARAM = "bucket_name"
 
string OBJECT_NAMES_PARAM = "objects"
 
string BUFFER_SIZE_PARAM = "buffer_size"
 
string DELIMITER_PARAM = "delimiter"
 
- Static Public Attributes inherited from google.appengine.ext.mapreduce.input_readers.InputReader
 expand_parameters = False
 
string NAMESPACE_PARAM = "namespace"
 
string NAMESPACES_PARAM = "namespaces"
 

Detailed Description

Input reader from Google Cloud Storage using the cloudstorage library.

This class is expected to be subclassed with a reader that understands
user-level records.

Required configuration in the mapper_spec.input_reader dictionary.
  BUCKET_NAME_PARAM: name of the bucket to use (with no extra delimiters or
    suffixed such as directories.
  OBJECT_NAMES_PARAM: a list of object names or prefixes. All objects must be
    in the BUCKET_NAME_PARAM bucket. If the name ends with a * it will be
    treated as prefix and all objects with matching names will be read.
    Entries should not start with a slash unless that is part of the object's
    name. An example list could be:
    ["my-1st-input-file", "directory/my-2nd-file", "some/other/dir/input-*"]
    To retrieve all files "*" will match every object in the bucket. If a file
    is listed twice or is covered by multiple prefixes it will be read twice,
    there is no deduplication.

Optional configuration in the mapper_sec.input_reader dictionary.
  BUFFER_SIZE_PARAM: the size of the read buffer for each file handle.
  DELIMITER_PARAM: if specified, turn on the shallow splitting mode.
    The delimiter is used as a path separator to designate directory
    hierarchy. Matching of prefixes from OBJECT_NAME_PARAM
    will stop at the first directory instead of matching
    all files under the directory. This allows MR to process bucket with
    hundreds of thousands of files.

Constructor & Destructor Documentation

def google.appengine.ext.mapreduce.input_readers._GoogleCloudStorageInputReader.__init__ (   self,
  filenames,
  index = 0,
  buffer_size = None,
  _account_id = None,
  delimiter = None 
)
Initialize a GoogleCloudStorageInputReader instance.

Args:
  filenames: A list of Google Cloud Storage filenames of the form
'/bucket/objectname'.
  index: Index of the next filename to read.
  buffer_size: The size of the read buffer, None to use default.
  _account_id: Internal use only. See cloudstorage documentation.
  delimiter: Delimiter used as path separator. See class doc for details.

Member Function Documentation

def google.appengine.ext.mapreduce.input_readers._GoogleCloudStorageInputReader.next (   self)
Returns the next input from this input reader, a block of bytes.

Non existent files will be logged and skipped. The file might have been
removed after input splitting.

Returns:
  The next input from this input reader in the form of a cloudstorage
  ReadBuffer that supports a File-like interface (read, readline, seek,
  tell, and close). An error may be raised if the file can not be opened.

Raises:
  StopIteration: The list of files has been exhausted.
def google.appengine.ext.mapreduce.input_readers._GoogleCloudStorageInputReader.split_input (   cls,
  mapper_spec 
)
Returns a list of input readers.

An equal number of input files are assigned to each shard (+/- 1). If there
are fewer files than shards, fewer than the requested number of shards will
be used. Input files are currently never split (although for some formats
could be and may be split in a future implementation).

Args:
  mapper_spec: an instance of model.MapperSpec.

Returns:
  A list of InputReaders. None when no input data can be found.
def google.appengine.ext.mapreduce.input_readers._GoogleCloudStorageInputReader.validate (   cls,
  mapper_spec 
)
Validate mapper specification.

Args:
  mapper_spec: an instance of model.MapperSpec

Raises:
  BadReaderParamsError: if the specification is invalid for any reason such
as missing the bucket name or providing an invalid bucket name.

The documentation for this class was generated from the following file: