mrjob.parse - log parsing

Utilities for parsing errors, counters, and status messages.

mrjob.parse.counter_unescape(escaped_string)

Fix names of counters and groups emitted by Hadoop 0.20+ logs, which use escape sequences for more characters than most decoders know about (e.g. ().).

Parameters:escaped_string (str) – string from a counter log line
mrjob.parse.find_hadoop_java_stack_trace(lines)

Scan a log file or other iterable for a java stack trace from Hadoop, and return it as a list of lines.

In logs from EMR, we find java stack traces in task-attempts/*/syslog

Sample stack trace:

2010-07-27 18:25:48,397 WARN org.apache.hadoop.mapred.TaskTracker (main): Error running child
java.lang.OutOfMemoryError: Java heap space
        at org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:270)
        at org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:332)
        at org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:147)
        at org.apache.hadoop.mapred.Merger$MergeQueue.adjustPriorityQueue(Merger.java:238)
        at org.apache.hadoop.mapred.Merger$MergeQueue.next(Merger.java:255)
        at org.apache.hadoop.mapred.Merger.writeFile(Merger.java:86)
        at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:377)
        at org.apache.hadoop.mapred.Merger.merge(Merger.java:58)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:277)
        at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2216)

(We omit the “Error running child” line from the results)

mrjob.parse.find_input_uri_for_mapper(lines)

Scan a log file or other iterable for the path of an input file for the first mapper on Hadoop. Just returns the path, or None if no match.

In logs from EMR, we find python tracebacks in task-attempts/*/syslog

Matching log lines look like:

2010-07-27 17:54:54,344 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening 's3://yourbucket/logs/2010/07/23/log2-00077.gz' for reading
mrjob.parse.find_interesting_hadoop_streaming_error(lines)

Scan a log file or other iterable for a hadoop streaming error other than “Job not Successful!”. Return the error as a string, or None if nothing found.

In logs from EMR, we find java stack traces in steps/*/syslog

Example line:

2010-07-27 19:53:35,451 ERROR org.apache.hadoop.streaming.StreamJob (main): Error launching job , Output path already exists : Output directory s3://yourbucket/logs/2010/07/23/ already exists and is not empty
mrjob.parse.find_job_log_multiline_error(lines)

Scan a log file for an arbitrary multi-line error. Return it as a list of lines, or None of nothing was found.

Here is an example error:

MapAttempt TASK_TYPE="MAP" TASKID="task_201106280040_0001_m_000218" TASK_ATTEMPT_ID="attempt_201106280040_0001_m_000218_5" TASK_STATUS="FAILED" FINISH_TIME="1309246900665" HOSTNAME="/default-rack/ip-10-166-239-133.us-west-1.compute.internal" ERROR="Error initializing attempt_201106280040_0001_m_000218_5:
java.io.IOException: Cannot run program "bash": java.io.IOException: error=12, Cannot allocate memory
    at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
    at org.apache.hadoop.util.Shell.run(Shell.java:134)
    at org.apache.hadoop.fs.DF.getAvailable(DF.java:73)
    at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:296)
    at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
    at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:648)
    at org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:1320)
    at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:956)
    at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1357)
    at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2361)
Caused by: java.io.IOException: java.io.IOException: error=12, Cannot allocate memory
    at java.lang.UNIXProcess.<init>(UNIXProcess.java:148)
    at java.lang.ProcessImpl.start(ProcessImpl.java:65)
    at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
    ... 10 more
"

The first line returned will only include the text after ERROR=", and discard the final line with just ".

These errors are parsed from jobs/*.jar.

mrjob.parse.find_python_traceback(lines)

Scan a log file or other iterable for a Python traceback, and return it as a list of lines.

In logs from EMR, we find python tracebacks in task-attempts/*/stderr

mrjob.parse.find_timeout_error(lines)

Scan a log file or other iterable for a timeout error from Hadoop. Return the number of seconds the job ran for before timing out, or None if nothing found.

In logs from EMR, we find timeouterrors in jobs/*.jar

Example line:

Task TASKID="task_201010202309_0001_m_000153" TASK_TYPE="MAP" TASK_STATUS="FAILED" FINISH_TIME="1287618918658" ERROR="Task attempt_201010202309_0001_m_000153_3 failed to report status for 602 seconds. Killing!"
mrjob.parse.is_s3_uri(uri)

Return True if uri can be parsed into an S3 URI, False otherwise.

mrjob.parse.is_uri(uri)

Return True if uri is any sort of URI.

mrjob.parse.parse_hadoop_counters_from_line(line, hadoop_version=None)

Parse Hadoop counter values from a log line.

The counter log line format changed significantly between Hadoop 0.18 and 0.20, so this function switches between parsers for them.

Parameters:line (str) – log line containing counter data
Returns:(counter_dict, step_num) or (None, None)
mrjob.parse.parse_key_value_list(kv_string_list, error_fmt, error_func)

Parse a list of strings like KEY=VALUE into a dictionary.

Parameters:
  • kv_string_list ([str]) – Parse a list of strings like KEY=VALUE into a dictionary.
  • error_fmt (str) – Format string accepting one %s argument which is the malformed (i.e. not KEY=VALUE) string
  • error_func (function(str)) – Function to call when a malformed string is encountered.
mrjob.parse.parse_mr_job_stderr(stderr, counters=None)

Parse counters and status messages out of MRJob output.

Parameters:data – a filehandle, a list of lines, or a str containing data

Returns a dictionary with the keys counters, statuses, other:

  • counters: counters so far; same format as above
  • statuses: a list of status messages encountered
  • other: lines that aren’t either counters or status messages
mrjob.parse.parse_port_range_list(range_list_str)

Parse a port range list of the form (start[:end])(,(start[:end]))*

mrjob.parse.parse_s3_uri(uri)

Parse an S3 URI into (bucket, key)

>>> parse_s3_uri('s3://walrus/tmp/')
('walrus', 'tmp/')

If uri is not an S3 URI, raise a ValueError

Need help?

Join the mailing list by visiting the Google group page or sending an email to mrjob+subscribe@googlegroups.com.