Protocols deserialize and serialize the input and output of tasks to raw bytes for Hadoop to distribute to the next task or to write as output. For more information, see Protocols and Writing custom protocols.
Encode (key, value) as two JSONs separated by a tab.
Note that JSON has some limitations; dictionary keys must be strings, and there’s no distinction between lists and tuples.
Encode value as a JSON and discard key (key is read in as None).
Encode (key, value) as two string-escaped pickles separated by a tab.
We string-escape the pickles to avoid having to deal with stray \t and \n characters, which would confuse Hadoop Streaming.
Ugly, but should work for any type.
Encode value as a string-escaped pickle and discard key (key is read in as None).
Encode (key, value) as key and value separated by a tab (key and value should be bytestrings).
If key or value is None, don’t include a tab. When decoding a line with no tab in it, value will be None.
When reading from a line with multiple tabs, we break on the first one.
Your key should probably not be None or have tab characters in it, but we don’t check.
Read in a line as (None, line). Write out (key, value) as value. value must be a str.
The default way for a job to read its initial input.
Encode (key, value) as two reprs separated by a tab.
This only works for basic types (we use mrjob.util.safeeval()).
Encode value as a repr and discard key (key is read in as None).
This only works for basic types (we use mrjob.util.safeeval()).