pyspark.RDD.sample

RDD.sample(withReplacement, fraction, seed=None)[source]

Return a sampled subset of this RDD.

Parameters
  • withReplacement – can elements be sampled multiple times (replaced when sampled out)

  • fraction – expected size of the sample as a fraction of this RDD’s size without replacement: probability that each element is chosen; fraction must be [0, 1] with replacement: expected number of times each element is chosen; fraction must be >= 0

  • seed – seed for the random number generator

Note

This is not guaranteed to provide exactly the fraction specified of the total count of the given DataFrame.

>>> rdd = sc.parallelize(range(100), 4)
>>> 6 <= rdd.sample(False, 0.1, 81).count() <= 14
True