pyspark.SparkContext.hadoopRDD

SparkContext.hadoopRDD(inputFormatClass, keyClass, valueClass, keyConverter=None, valueConverter=None, conf=None, batchSize=0)[source]

Read an ‘old’ Hadoop InputFormat with arbitrary key and value class, from an arbitrary Hadoop configuration, which is passed in as a Python dict. This will be converted into a Configuration in Java. The mechanism is the same as for sc.sequenceFile.

Parameters
  • inputFormatClass – fully qualified classname of Hadoop InputFormat (e.g. “org.apache.hadoop.mapred.TextInputFormat”)

  • keyClass – fully qualified classname of key Writable class (e.g. “org.apache.hadoop.io.Text”)

  • valueClass – fully qualified classname of value Writable class (e.g. “org.apache.hadoop.io.LongWritable”)

  • keyConverter – (None by default)

  • valueConverter – (None by default)

  • conf – Hadoop configuration, passed in as a dict (None by default)

  • batchSize – The number of Python objects represented as a single Java object. (default 0, choose batchSize automatically)