Hadoop Cookbook

Increasing the task timeout

Warning

Some EMR AMIs appear to not support setting parameters like timeout with jobconf at run time. Instead, you must use Bootstrap-time configuration.

If your mappers or reducers take a long time to process a single step, you may want to increase the amount of time Hadoop lets them run before failing them as timeouts. You can do this with jobconf and the version-appropriate Hadoop environment variable. For example, this configuration will set the timeout to one hour:

runners:
  hadoop: # this will work for both hadoop and emr
    jobconf:
      # Hadoop 0.18
      mapred.task.timeout: 3600000
      # Hadoop 0.21+
      mapreduce.task.timeout: 3600000

mrjob will convert your jobconf options between Hadoop versions if necessary. In this example, either jobconf line could be removed and the timeout would still be changed when using either version of Hadoop.

Writing compressed output

To save space, you can have Hadoop automatically save your job’s output as compressed files. This can be done using the same method as changing the task timeout, with jobconf and the appropriate environment variables. This example uses the Hadoop 0.21+ version:

runners:
  hadoop: # this will work for both hadoop and emr
    jobconf:
      # "true" must be a string argument, not a boolean! (#323)
      mapreduce.output.compress: "true"
      mapreduce.output.compression.codec: org.apache.hadoop.io.compress.GzipCodec

Table Of Contents

Need help?

Join the mailing list by visiting the Google group page or sending an email to mrjob+subscribe@googlegroups.com.