pyspark.RDD

class pyspark.RDD(jrdd, ctx, jrdd_deserializer=AutoBatchedSerializer(PickleSerializer()))[source]

A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel.

__init__(jrdd, ctx, jrdd_deserializer=AutoBatchedSerializer(PickleSerializer()))[source]

Initialize self. See help(type(self)) for accurate signature.

Methods

Attributes