pyspark.sql.DataFrame.sample¶
-
DataFrame.
sample
(withReplacement=None, fraction=None, seed=None)[source]¶ Returns a sampled subset of this
DataFrame
.- Parameters
withReplacement – Sample with replacement or not (default
False
).fraction – Fraction of rows to generate, range [0.0, 1.0].
seed – Seed for sampling (default a random seed).
Note
This is not guaranteed to provide exactly the fraction specified of the total count of the given
DataFrame
.Note
fraction is required and, withReplacement and seed are optional.
>>> df = spark.range(10) >>> df.sample(0.5, 3).count() 7 >>> df.sample(fraction=0.5, seed=3).count() 7 >>> df.sample(withReplacement=True, fraction=0.5, seed=3).count() 1 >>> df.sample(1.0).count() 10 >>> df.sample(fraction=1.0).count() 10 >>> df.sample(False, fraction=1.0).count() 10
New in version 1.3.