RandomRDDs

class pyspark.mllib.random.RandomRDDs[source]

Generator methods for creating RDDs comprised of i.i.d samples from some distribution.

New in version 1.1.0.

Methods

Methods Documentation

static exponentialRDD(sc, mean, size, numPartitions=None, seed=None)[source]

Generates an RDD comprised of i.i.d. samples from the Exponential distribution with the input mean.

Parameters
  • sc – SparkContext used to create the RDD.

  • mean – Mean, or 1 / lambda, for the Exponential distribution.

  • size – Size of the RDD.

  • numPartitions – Number of partitions in the RDD (default: sc.defaultParallelism).

  • seed – Random seed (default: a random long integer).

Returns

RDD of float comprised of i.i.d. samples ~ Exp(mean).

>>> mean = 2.0
>>> x = RandomRDDs.exponentialRDD(sc, mean, 1000, seed=2)
>>> stats = x.stats()
>>> stats.count()
1000
>>> abs(stats.mean() - mean) < 0.5
True
>>> from math import sqrt
>>> abs(stats.stdev() - sqrt(mean)) < 0.5
True

New in version 1.3.0.

static exponentialVectorRDD(sc, mean, numRows, numCols, numPartitions=None, seed=None)[source]

Generates an RDD comprised of vectors containing i.i.d. samples drawn from the Exponential distribution with the input mean.

Parameters
  • sc – SparkContext used to create the RDD.

  • mean – Mean, or 1 / lambda, for the Exponential distribution.

  • numRows – Number of Vectors in the RDD.

  • numCols – Number of elements in each Vector.

  • numPartitions – Number of partitions in the RDD (default: sc.defaultParallelism)

  • seed – Random seed (default: a random long integer).

Returns

RDD of Vector with vectors containing i.i.d. samples ~ Exp(mean).

>>> import numpy as np
>>> mean = 0.5
>>> rdd = RandomRDDs.exponentialVectorRDD(sc, mean, 100, 100, seed=1)
>>> mat = np.mat(rdd.collect())
>>> mat.shape
(100, 100)
>>> abs(mat.mean() - mean) < 0.5
True
>>> from math import sqrt
>>> abs(mat.std() - sqrt(mean)) < 0.5
True

New in version 1.3.0.

static gammaRDD(sc, shape, scale, size, numPartitions=None, seed=None)[source]

Generates an RDD comprised of i.i.d. samples from the Gamma distribution with the input shape and scale.

Parameters
  • sc – SparkContext used to create the RDD.

  • shape – shape (> 0) parameter for the Gamma distribution

  • scale – scale (> 0) parameter for the Gamma distribution

  • size – Size of the RDD.

  • numPartitions – Number of partitions in the RDD (default: sc.defaultParallelism).

  • seed – Random seed (default: a random long integer).

Returns

RDD of float comprised of i.i.d. samples ~ Gamma(shape, scale).

>>> from math import sqrt
>>> shape = 1.0
>>> scale = 2.0
>>> expMean = shape * scale
>>> expStd = sqrt(shape * scale * scale)
>>> x = RandomRDDs.gammaRDD(sc, shape, scale, 1000, seed=2)
>>> stats = x.stats()
>>> stats.count()
1000
>>> abs(stats.mean() - expMean) < 0.5
True
>>> abs(stats.stdev() - expStd) < 0.5
True

New in version 1.3.0.

static gammaVectorRDD(sc, shape, scale, numRows, numCols, numPartitions=None, seed=None)[source]

Generates an RDD comprised of vectors containing i.i.d. samples drawn from the Gamma distribution.

Parameters
  • sc – SparkContext used to create the RDD.

  • shape – Shape (> 0) of the Gamma distribution

  • scale – Scale (> 0) of the Gamma distribution

  • numRows – Number of Vectors in the RDD.

  • numCols – Number of elements in each Vector.

  • numPartitions – Number of partitions in the RDD (default: sc.defaultParallelism).

  • seed – Random seed (default: a random long integer).

Returns

RDD of Vector with vectors containing i.i.d. samples ~ Gamma(shape, scale).

>>> import numpy as np
>>> from math import sqrt
>>> shape = 1.0
>>> scale = 2.0
>>> expMean = shape * scale
>>> expStd = sqrt(shape * scale * scale)
>>> mat = np.matrix(RandomRDDs.gammaVectorRDD(sc, shape, scale, 100, 100, seed=1).collect())
>>> mat.shape
(100, 100)
>>> abs(mat.mean() - expMean) < 0.1
True
>>> abs(mat.std() - expStd) < 0.1
True

New in version 1.3.0.

static logNormalRDD(sc, mean, std, size, numPartitions=None, seed=None)[source]

Generates an RDD comprised of i.i.d. samples from the log normal distribution with the input mean and standard distribution.

Parameters
  • sc – SparkContext used to create the RDD.

  • mean – mean for the log Normal distribution

  • std – std for the log Normal distribution

  • size – Size of the RDD.

  • numPartitions – Number of partitions in the RDD (default: sc.defaultParallelism).

  • seed – Random seed (default: a random long integer).

Returns

RDD of float comprised of i.i.d. samples ~ log N(mean, std).

>>> from math import sqrt, exp
>>> mean = 0.0
>>> std = 1.0
>>> expMean = exp(mean + 0.5 * std * std)
>>> expStd = sqrt((exp(std * std) - 1.0) * exp(2.0 * mean + std * std))
>>> x = RandomRDDs.logNormalRDD(sc, mean, std, 1000, seed=2)
>>> stats = x.stats()
>>> stats.count()
1000
>>> abs(stats.mean() - expMean) < 0.5
True
>>> from math import sqrt
>>> abs(stats.stdev() - expStd) < 0.5
True

New in version 1.3.0.

static logNormalVectorRDD(sc, mean, std, numRows, numCols, numPartitions=None, seed=None)[source]

Generates an RDD comprised of vectors containing i.i.d. samples drawn from the log normal distribution.

Parameters
  • sc – SparkContext used to create the RDD.

  • mean – Mean of the log normal distribution

  • std – Standard Deviation of the log normal distribution

  • numRows – Number of Vectors in the RDD.

  • numCols – Number of elements in each Vector.

  • numPartitions – Number of partitions in the RDD (default: sc.defaultParallelism).

  • seed – Random seed (default: a random long integer).

Returns

RDD of Vector with vectors containing i.i.d. samples ~ log N(mean, std).

>>> import numpy as np
>>> from math import sqrt, exp
>>> mean = 0.0
>>> std = 1.0
>>> expMean = exp(mean + 0.5 * std * std)
>>> expStd = sqrt((exp(std * std) - 1.0) * exp(2.0 * mean + std * std))
>>> m = RandomRDDs.logNormalVectorRDD(sc, mean, std, 100, 100, seed=1).collect()
>>> mat = np.matrix(m)
>>> mat.shape
(100, 100)
>>> abs(mat.mean() - expMean) < 0.1
True
>>> abs(mat.std() - expStd) < 0.1
True

New in version 1.3.0.

static normalRDD(sc, size, numPartitions=None, seed=None)[source]

Generates an RDD comprised of i.i.d. samples from the standard normal distribution.

To transform the distribution in the generated RDD from standard normal to some other normal N(mean, sigma^2), use RandomRDDs.normal(sc, n, p, seed).map(lambda v: mean + sigma * v)

Parameters
  • sc – SparkContext used to create the RDD.

  • size – Size of the RDD.

  • numPartitions – Number of partitions in the RDD (default: sc.defaultParallelism).

  • seed – Random seed (default: a random long integer).

Returns

RDD of float comprised of i.i.d. samples ~ N(0.0, 1.0).

>>> x = RandomRDDs.normalRDD(sc, 1000, seed=1)
>>> stats = x.stats()
>>> stats.count()
1000
>>> abs(stats.mean() - 0.0) < 0.1
True
>>> abs(stats.stdev() - 1.0) < 0.1
True

New in version 1.1.0.

static normalVectorRDD(sc, numRows, numCols, numPartitions=None, seed=None)[source]

Generates an RDD comprised of vectors containing i.i.d. samples drawn from the standard normal distribution.

Parameters
  • sc – SparkContext used to create the RDD.

  • numRows – Number of Vectors in the RDD.

  • numCols – Number of elements in each Vector.

  • numPartitions – Number of partitions in the RDD (default: sc.defaultParallelism).

  • seed – Random seed (default: a random long integer).

Returns

RDD of Vector with vectors containing i.i.d. samples ~ N(0.0, 1.0).

>>> import numpy as np
>>> mat = np.matrix(RandomRDDs.normalVectorRDD(sc, 100, 100, seed=1).collect())
>>> mat.shape
(100, 100)
>>> abs(mat.mean() - 0.0) < 0.1
True
>>> abs(mat.std() - 1.0) < 0.1
True

New in version 1.1.0.

static poissonRDD(sc, mean, size, numPartitions=None, seed=None)[source]

Generates an RDD comprised of i.i.d. samples from the Poisson distribution with the input mean.

Parameters
  • sc – SparkContext used to create the RDD.

  • mean – Mean, or lambda, for the Poisson distribution.

  • size – Size of the RDD.

  • numPartitions – Number of partitions in the RDD (default: sc.defaultParallelism).

  • seed – Random seed (default: a random long integer).

Returns

RDD of float comprised of i.i.d. samples ~ Pois(mean).

>>> mean = 100.0
>>> x = RandomRDDs.poissonRDD(sc, mean, 1000, seed=2)
>>> stats = x.stats()
>>> stats.count()
1000
>>> abs(stats.mean() - mean) < 0.5
True
>>> from math import sqrt
>>> abs(stats.stdev() - sqrt(mean)) < 0.5
True

New in version 1.1.0.

static poissonVectorRDD(sc, mean, numRows, numCols, numPartitions=None, seed=None)[source]

Generates an RDD comprised of vectors containing i.i.d. samples drawn from the Poisson distribution with the input mean.

Parameters
  • sc – SparkContext used to create the RDD.

  • mean – Mean, or lambda, for the Poisson distribution.

  • numRows – Number of Vectors in the RDD.

  • numCols – Number of elements in each Vector.

  • numPartitions – Number of partitions in the RDD (default: sc.defaultParallelism)

  • seed – Random seed (default: a random long integer).

Returns

RDD of Vector with vectors containing i.i.d. samples ~ Pois(mean).

>>> import numpy as np
>>> mean = 100.0
>>> rdd = RandomRDDs.poissonVectorRDD(sc, mean, 100, 100, seed=1)
>>> mat = np.mat(rdd.collect())
>>> mat.shape
(100, 100)
>>> abs(mat.mean() - mean) < 0.5
True
>>> from math import sqrt
>>> abs(mat.std() - sqrt(mean)) < 0.5
True

New in version 1.1.0.

static uniformRDD(sc, size, numPartitions=None, seed=None)[source]

Generates an RDD comprised of i.i.d. samples from the uniform distribution U(0.0, 1.0).

To transform the distribution in the generated RDD from U(0.0, 1.0) to U(a, b), use RandomRDDs.uniformRDD(sc, n, p, seed).map(lambda v: a + (b - a) * v)

Parameters
  • sc – SparkContext used to create the RDD.

  • size – Size of the RDD.

  • numPartitions – Number of partitions in the RDD (default: sc.defaultParallelism).

  • seed – Random seed (default: a random long integer).

Returns

RDD of float comprised of i.i.d. samples ~ U(0.0, 1.0).

>>> x = RandomRDDs.uniformRDD(sc, 100).collect()
>>> len(x)
100
>>> max(x) <= 1.0 and min(x) >= 0.0
True
>>> RandomRDDs.uniformRDD(sc, 100, 4).getNumPartitions()
4
>>> parts = RandomRDDs.uniformRDD(sc, 100, seed=4).getNumPartitions()
>>> parts == sc.defaultParallelism
True

New in version 1.1.0.

static uniformVectorRDD(sc, numRows, numCols, numPartitions=None, seed=None)[source]

Generates an RDD comprised of vectors containing i.i.d. samples drawn from the uniform distribution U(0.0, 1.0).

Parameters
  • sc – SparkContext used to create the RDD.

  • numRows – Number of Vectors in the RDD.

  • numCols – Number of elements in each Vector.

  • numPartitions – Number of partitions in the RDD.

  • seed – Seed for the RNG that generates the seed for the generator in each partition.

Returns

RDD of Vector with vectors containing i.i.d samples ~ U(0.0, 1.0).

>>> import numpy as np
>>> mat = np.matrix(RandomRDDs.uniformVectorRDD(sc, 10, 10).collect())
>>> mat.shape
(10, 10)
>>> mat.max() <= 1.0 and mat.min() >= 0.0
True
>>> RandomRDDs.uniformVectorRDD(sc, 10, 10, 4).getNumPartitions()
4

New in version 1.1.0.