App Engine Python SDK  v1.6.9 rev.445
The Python runtime is available as an experimental Preview feature.
Public Member Functions | Static Public Attributes | List of all members
google.appengine.ext.mapreduce.file_formats.FileFormat Class Reference
Inheritance diagram for google.appengine.ext.mapreduce.file_formats.FileFormat:
google.appengine.ext.mapreduce.file_formats._BinaryFormat google.appengine.ext.mapreduce.file_formats._TextFormat google.appengine.ext.mapreduce.file_formats._ZipFormat google.appengine.ext.mapreduce.file_formats._Base64Format google.appengine.ext.mapreduce.file_formats._CSVFormat google.appengine.ext.mapreduce.file_formats._LinesFormat

Public Member Functions

def __init__
 
def get_current_file
 
def get_index
 
def increment_index
 
def get_cache
 
def default_instance
 
def __repr__
 
def __str__
 
def checkpoint
 
def to_json
 
def from_json
 
def can_split
 
def split
 
def __iter__
 
def preprocess
 
def next
 
def get_next
 

Static Public Attributes

tuple ARGUMENTS = set()
 
string NAME = '_file'
 

Detailed Description

FileFormat can operate/iterate on files of a specific format.

Life cycle of FileFormat:
  1. Two ways that FileFormat is created: file_format_root.split creates
     FileFormat from scratch. FileFormatRoot.from_json creates FileFormat
     from serialized json str. Either way, it is associated with a
     FileFormatRoot. It should never be instantiated directly.
  2. Root acts as a coordinator among FileFormats. Root initializes
     its many fields so that FileFormat knows how to iterate over its inputs.
  3. Its next() method is used to iterate.
  4. It keeps iterating until either root calls its to_json() or root
     sends it a StopIteration.

How to define a new format:
  1. Subclass this.
  2. Override NAME and ARGUMENTS. file_format_parser._FileFormatParser
     uses them to validate a format string contains only legal
     names and arguments.
  3. Optionally override preprocess(). See method doc.
  4. Override get_next(). Used by next() to fetch the next content to
     return. See method.
  5. Optionally override split() if this format supports it. See method.
  6. Write unit tests. Tricky logics (to/from_json, advance
     current input file) are shared. Thus as long as you respected
     get_next()'s pre/post conditions, tests are very simple.
  7. Register your format at FORMATS.

Attributes:
  ARGUMENTS: a set of acceptable arguments to this format. Used for parsing
      this format.
  NAME: the name of this format. Used for parsing this format.

Constructor & Destructor Documentation

def google.appengine.ext.mapreduce.file_formats.FileFormat.__init__ (   self,
  index,
  index_range = None,
  kwargs 
)
Initialize.

Args:
  index: the index of the subfile to read from the current file.
  index_range: a tuple [start_index, end_index) that if defined, should
bound index. When index is end_index, current file is consumed.
  kwargs: kwargs for a specific FileFormat. What arguments are accepted
and their semantics depend on each subclass's interpretation.

Raises:
  ValueError: if some argument is not expected by the format.

Member Function Documentation

def google.appengine.ext.mapreduce.file_formats.FileFormat.can_split (   cls)
Indicates whether this format support splitting within a file boundary.

Returns:
  True if a FileFormat allows its inputs to be splitted into
different shards.
def google.appengine.ext.mapreduce.file_formats.FileFormat.checkpoint (   self)
Save _index before updating it to support potential rollback.
def google.appengine.ext.mapreduce.file_formats.FileFormat.default_instance (   cls,
  kwargs 
)
Create an default instance of FileFormat.

Used by parser to create default instances.

Args:
  kwargs: kwargs parser parsed from user input.

Returns:
  A default instance of FileFormat.
def google.appengine.ext.mapreduce.file_formats.FileFormat.from_json (   cls,
  json 
)
Deserialize from json compatible structure.
def google.appengine.ext.mapreduce.file_formats.FileFormat.get_cache (   self)
Get cache to store expensive objects.

Some formats need expensive initialization to even start iteration.
They can store the initialized objects into the cache and try to retrieve
the objects from the cache at later iterations.

For example, a zip format needs to create a ZipFile object to iterate over
the zipfile. It can avoid doing so on every "next" call by storing the
ZipFile into cache.

Cache does not guarantee persistence. It is cleared at pickles.
It is also intentionally cleared after the currently iterated file is
entirely consumed.

Returns:
  A dict to store temporary objects.
def google.appengine.ext.mapreduce.file_formats.FileFormat.get_current_file (   self)
Get the current file to iterate upon.

Returns:
  A Python file object. This file is already seeked to the position from
  last iteration. If read raises EOF, that means the file is exhausted.
def google.appengine.ext.mapreduce.file_formats.FileFormat.get_index (   self)
Get index.

If the format is an archive format, get_index() tells the format which
subfile from current file should it process. This value is maintained
across pickles and resets to 0 when a new file starts.

Returns:
  index of the subfile to process from current file.
def google.appengine.ext.mapreduce.file_formats.FileFormat.get_next (   self)
Finds the next content to return.

Expected steps of any implementation:
  1. Call get_current_file() to get the file to iterate on.
  2. If nothing is read, raise EOFError. Otherwise, process the
 contents read in anyway. _kwargs is guaranteed to be a dict
 containing all arguments and values specified by user.
  3. If the format is an archive format, use get_index() to
 see which subfile to read. Call increment_index() if
 finished current subfile. These two methods will make sure
 the index is maintained during (de)serialization.
  4. Return the processed contents either as a file-like object or
 Python str. NO UNICODE.

Returns:
  The str or file like object if got anything to return.

Raises:
  EOFError if no content is found to return.
def google.appengine.ext.mapreduce.file_formats.FileFormat.increment_index (   self)
Increment index.

Increment index value after finished processing the current subfile from
current file.
def google.appengine.ext.mapreduce.file_formats.FileFormat.next (   self)
Returns a file-like object containing next content.

Returns:
  A file-like object containing next content.

Raises:
  ValueError: if content is of none str type.
def google.appengine.ext.mapreduce.file_formats.FileFormat.preprocess (   self,
  file_object 
)
Does preprocessing on the file-like object and returns another one.

Normally a FileFormat directly reads from the file returned by
get_current_file(). But some formats need to preprocess that file entirely
before iteration can starts (e.g. text formats need to decode first).

Args:
  file_object: read from this object and process its content.

Returns:
  a file-like object containing processed contents. This file object will
  be returned by get_current_file() instead. If the returned object
  is newly created, close the old one.
def google.appengine.ext.mapreduce.file_formats.FileFormat.split (   cls,
  desired_size,
  start_index,
  input_file,
  cache 
)
Splits a single chunk of desired_size from file.

FileFormatRoot uses this method to ask FileFormat how to split
one file of this format.

This method takes an opened file and a start_index. If file
size is bigger than desired_size, the method determines a chunk of the
file whose size is close to desired_size. The chuck is indicated by
[start_index, end_index). If the file is smaller than desired_size,
the chunk will include the rest of the input_file.

This method also indicates how many bytes are consumed by this chunk
by returning size_left to the caller.

Args:
  desired_size: desired number of bytes for this split. Positive int.
  start_index: the index to start this split. The index is not necessarily
an offset. In zipfile, for example, it's the index of the member file
in the archive. Non negative int.
  input_file: opened Files API file to split. Do not close this file.
  cache: a dict to cache any object over multiple calls if needed.

Returns:
  Returns a tuple of (size_left, end_index). If end_index equals
  start_index, the file is fully split.
def google.appengine.ext.mapreduce.file_formats.FileFormat.to_json (   self)
Serialize states to a json compatible structure.

The documentation for this class was generated from the following file: