![]() |
App Engine Python SDK
v1.6.9 rev.445
The Python runtime is available as an experimental Preview feature.
|
Public Member Functions | |
def | __init__ |
def | get_current_file |
def | get_index |
def | increment_index |
def | get_cache |
def | default_instance |
def | __repr__ |
def | __str__ |
def | checkpoint |
def | to_json |
def | from_json |
def | can_split |
def | split |
def | __iter__ |
def | preprocess |
def | next |
def | get_next |
Static Public Attributes | |
tuple | ARGUMENTS = set() |
string | NAME = '_file' |
FileFormat can operate/iterate on files of a specific format. Life cycle of FileFormat: 1. Two ways that FileFormat is created: file_format_root.split creates FileFormat from scratch. FileFormatRoot.from_json creates FileFormat from serialized json str. Either way, it is associated with a FileFormatRoot. It should never be instantiated directly. 2. Root acts as a coordinator among FileFormats. Root initializes its many fields so that FileFormat knows how to iterate over its inputs. 3. Its next() method is used to iterate. 4. It keeps iterating until either root calls its to_json() or root sends it a StopIteration. How to define a new format: 1. Subclass this. 2. Override NAME and ARGUMENTS. file_format_parser._FileFormatParser uses them to validate a format string contains only legal names and arguments. 3. Optionally override preprocess(). See method doc. 4. Override get_next(). Used by next() to fetch the next content to return. See method. 5. Optionally override split() if this format supports it. See method. 6. Write unit tests. Tricky logics (to/from_json, advance current input file) are shared. Thus as long as you respected get_next()'s pre/post conditions, tests are very simple. 7. Register your format at FORMATS. Attributes: ARGUMENTS: a set of acceptable arguments to this format. Used for parsing this format. NAME: the name of this format. Used for parsing this format.
def google.appengine.ext.mapreduce.file_formats.FileFormat.__init__ | ( | self, | |
index, | |||
index_range = None , |
|||
kwargs | |||
) |
Initialize. Args: index: the index of the subfile to read from the current file. index_range: a tuple [start_index, end_index) that if defined, should bound index. When index is end_index, current file is consumed. kwargs: kwargs for a specific FileFormat. What arguments are accepted and their semantics depend on each subclass's interpretation. Raises: ValueError: if some argument is not expected by the format.
def google.appengine.ext.mapreduce.file_formats.FileFormat.can_split | ( | cls | ) |
Indicates whether this format support splitting within a file boundary. Returns: True if a FileFormat allows its inputs to be splitted into different shards.
def google.appengine.ext.mapreduce.file_formats.FileFormat.checkpoint | ( | self | ) |
Save _index before updating it to support potential rollback.
def google.appengine.ext.mapreduce.file_formats.FileFormat.default_instance | ( | cls, | |
kwargs | |||
) |
Create an default instance of FileFormat. Used by parser to create default instances. Args: kwargs: kwargs parser parsed from user input. Returns: A default instance of FileFormat.
def google.appengine.ext.mapreduce.file_formats.FileFormat.from_json | ( | cls, | |
json | |||
) |
Deserialize from json compatible structure.
def google.appengine.ext.mapreduce.file_formats.FileFormat.get_cache | ( | self | ) |
Get cache to store expensive objects. Some formats need expensive initialization to even start iteration. They can store the initialized objects into the cache and try to retrieve the objects from the cache at later iterations. For example, a zip format needs to create a ZipFile object to iterate over the zipfile. It can avoid doing so on every "next" call by storing the ZipFile into cache. Cache does not guarantee persistence. It is cleared at pickles. It is also intentionally cleared after the currently iterated file is entirely consumed. Returns: A dict to store temporary objects.
def google.appengine.ext.mapreduce.file_formats.FileFormat.get_current_file | ( | self | ) |
Get the current file to iterate upon. Returns: A Python file object. This file is already seeked to the position from last iteration. If read raises EOF, that means the file is exhausted.
def google.appengine.ext.mapreduce.file_formats.FileFormat.get_index | ( | self | ) |
Get index. If the format is an archive format, get_index() tells the format which subfile from current file should it process. This value is maintained across pickles and resets to 0 when a new file starts. Returns: index of the subfile to process from current file.
def google.appengine.ext.mapreduce.file_formats.FileFormat.get_next | ( | self | ) |
Finds the next content to return. Expected steps of any implementation: 1. Call get_current_file() to get the file to iterate on. 2. If nothing is read, raise EOFError. Otherwise, process the contents read in anyway. _kwargs is guaranteed to be a dict containing all arguments and values specified by user. 3. If the format is an archive format, use get_index() to see which subfile to read. Call increment_index() if finished current subfile. These two methods will make sure the index is maintained during (de)serialization. 4. Return the processed contents either as a file-like object or Python str. NO UNICODE. Returns: The str or file like object if got anything to return. Raises: EOFError if no content is found to return.
def google.appengine.ext.mapreduce.file_formats.FileFormat.increment_index | ( | self | ) |
Increment index. Increment index value after finished processing the current subfile from current file.
def google.appengine.ext.mapreduce.file_formats.FileFormat.next | ( | self | ) |
Returns a file-like object containing next content. Returns: A file-like object containing next content. Raises: ValueError: if content is of none str type.
def google.appengine.ext.mapreduce.file_formats.FileFormat.preprocess | ( | self, | |
file_object | |||
) |
Does preprocessing on the file-like object and returns another one. Normally a FileFormat directly reads from the file returned by get_current_file(). But some formats need to preprocess that file entirely before iteration can starts (e.g. text formats need to decode first). Args: file_object: read from this object and process its content. Returns: a file-like object containing processed contents. This file object will be returned by get_current_file() instead. If the returned object is newly created, close the old one.
def google.appengine.ext.mapreduce.file_formats.FileFormat.split | ( | cls, | |
desired_size, | |||
start_index, | |||
input_file, | |||
cache | |||
) |
Splits a single chunk of desired_size from file. FileFormatRoot uses this method to ask FileFormat how to split one file of this format. This method takes an opened file and a start_index. If file size is bigger than desired_size, the method determines a chunk of the file whose size is close to desired_size. The chuck is indicated by [start_index, end_index). If the file is smaller than desired_size, the chunk will include the rest of the input_file. This method also indicates how many bytes are consumed by this chunk by returning size_left to the caller. Args: desired_size: desired number of bytes for this split. Positive int. start_index: the index to start this split. The index is not necessarily an offset. In zipfile, for example, it's the index of the member file in the archive. Non negative int. input_file: opened Files API file to split. Do not close this file. cache: a dict to cache any object over multiple calls if needed. Returns: Returns a tuple of (size_left, end_index). If end_index equals start_index, the file is fully split.
def google.appengine.ext.mapreduce.file_formats.FileFormat.to_json | ( | self | ) |
Serialize states to a json compatible structure.