RandomRDDs¶
-
class
pyspark.mllib.random.
RandomRDDs
[source]¶ Generator methods for creating RDDs comprised of i.i.d samples from some distribution.
New in version 1.1.0.
Methods
Methods Documentation
-
static
exponentialRDD
(sc, mean, size, numPartitions=None, seed=None)[source]¶ Generates an RDD comprised of i.i.d. samples from the Exponential distribution with the input mean.
- Parameters
sc – SparkContext used to create the RDD.
mean – Mean, or 1 / lambda, for the Exponential distribution.
size – Size of the RDD.
numPartitions – Number of partitions in the RDD (default: sc.defaultParallelism).
seed – Random seed (default: a random long integer).
- Returns
RDD of float comprised of i.i.d. samples ~ Exp(mean).
>>> mean = 2.0 >>> x = RandomRDDs.exponentialRDD(sc, mean, 1000, seed=2) >>> stats = x.stats() >>> stats.count() 1000 >>> abs(stats.mean() - mean) < 0.5 True >>> from math import sqrt >>> abs(stats.stdev() - sqrt(mean)) < 0.5 True
New in version 1.3.0.
-
static
exponentialVectorRDD
(sc, mean, numRows, numCols, numPartitions=None, seed=None)[source]¶ Generates an RDD comprised of vectors containing i.i.d. samples drawn from the Exponential distribution with the input mean.
- Parameters
sc – SparkContext used to create the RDD.
mean – Mean, or 1 / lambda, for the Exponential distribution.
numRows – Number of Vectors in the RDD.
numCols – Number of elements in each Vector.
numPartitions – Number of partitions in the RDD (default: sc.defaultParallelism)
seed – Random seed (default: a random long integer).
- Returns
RDD of Vector with vectors containing i.i.d. samples ~ Exp(mean).
>>> import numpy as np >>> mean = 0.5 >>> rdd = RandomRDDs.exponentialVectorRDD(sc, mean, 100, 100, seed=1) >>> mat = np.mat(rdd.collect()) >>> mat.shape (100, 100) >>> abs(mat.mean() - mean) < 0.5 True >>> from math import sqrt >>> abs(mat.std() - sqrt(mean)) < 0.5 True
New in version 1.3.0.
-
static
gammaRDD
(sc, shape, scale, size, numPartitions=None, seed=None)[source]¶ Generates an RDD comprised of i.i.d. samples from the Gamma distribution with the input shape and scale.
- Parameters
sc – SparkContext used to create the RDD.
shape – shape (> 0) parameter for the Gamma distribution
scale – scale (> 0) parameter for the Gamma distribution
size – Size of the RDD.
numPartitions – Number of partitions in the RDD (default: sc.defaultParallelism).
seed – Random seed (default: a random long integer).
- Returns
RDD of float comprised of i.i.d. samples ~ Gamma(shape, scale).
>>> from math import sqrt >>> shape = 1.0 >>> scale = 2.0 >>> expMean = shape * scale >>> expStd = sqrt(shape * scale * scale) >>> x = RandomRDDs.gammaRDD(sc, shape, scale, 1000, seed=2) >>> stats = x.stats() >>> stats.count() 1000 >>> abs(stats.mean() - expMean) < 0.5 True >>> abs(stats.stdev() - expStd) < 0.5 True
New in version 1.3.0.
-
static
gammaVectorRDD
(sc, shape, scale, numRows, numCols, numPartitions=None, seed=None)[source]¶ Generates an RDD comprised of vectors containing i.i.d. samples drawn from the Gamma distribution.
- Parameters
sc – SparkContext used to create the RDD.
shape – Shape (> 0) of the Gamma distribution
scale – Scale (> 0) of the Gamma distribution
numRows – Number of Vectors in the RDD.
numCols – Number of elements in each Vector.
numPartitions – Number of partitions in the RDD (default: sc.defaultParallelism).
seed – Random seed (default: a random long integer).
- Returns
RDD of Vector with vectors containing i.i.d. samples ~ Gamma(shape, scale).
>>> import numpy as np >>> from math import sqrt >>> shape = 1.0 >>> scale = 2.0 >>> expMean = shape * scale >>> expStd = sqrt(shape * scale * scale) >>> mat = np.matrix(RandomRDDs.gammaVectorRDD(sc, shape, scale, 100, 100, seed=1).collect()) >>> mat.shape (100, 100) >>> abs(mat.mean() - expMean) < 0.1 True >>> abs(mat.std() - expStd) < 0.1 True
New in version 1.3.0.
-
static
logNormalRDD
(sc, mean, std, size, numPartitions=None, seed=None)[source]¶ Generates an RDD comprised of i.i.d. samples from the log normal distribution with the input mean and standard distribution.
- Parameters
sc – SparkContext used to create the RDD.
mean – mean for the log Normal distribution
std – std for the log Normal distribution
size – Size of the RDD.
numPartitions – Number of partitions in the RDD (default: sc.defaultParallelism).
seed – Random seed (default: a random long integer).
- Returns
RDD of float comprised of i.i.d. samples ~ log N(mean, std).
>>> from math import sqrt, exp >>> mean = 0.0 >>> std = 1.0 >>> expMean = exp(mean + 0.5 * std * std) >>> expStd = sqrt((exp(std * std) - 1.0) * exp(2.0 * mean + std * std)) >>> x = RandomRDDs.logNormalRDD(sc, mean, std, 1000, seed=2) >>> stats = x.stats() >>> stats.count() 1000 >>> abs(stats.mean() - expMean) < 0.5 True >>> from math import sqrt >>> abs(stats.stdev() - expStd) < 0.5 True
New in version 1.3.0.
-
static
logNormalVectorRDD
(sc, mean, std, numRows, numCols, numPartitions=None, seed=None)[source]¶ Generates an RDD comprised of vectors containing i.i.d. samples drawn from the log normal distribution.
- Parameters
sc – SparkContext used to create the RDD.
mean – Mean of the log normal distribution
std – Standard Deviation of the log normal distribution
numRows – Number of Vectors in the RDD.
numCols – Number of elements in each Vector.
numPartitions – Number of partitions in the RDD (default: sc.defaultParallelism).
seed – Random seed (default: a random long integer).
- Returns
RDD of Vector with vectors containing i.i.d. samples ~ log N(mean, std).
>>> import numpy as np >>> from math import sqrt, exp >>> mean = 0.0 >>> std = 1.0 >>> expMean = exp(mean + 0.5 * std * std) >>> expStd = sqrt((exp(std * std) - 1.0) * exp(2.0 * mean + std * std)) >>> m = RandomRDDs.logNormalVectorRDD(sc, mean, std, 100, 100, seed=1).collect() >>> mat = np.matrix(m) >>> mat.shape (100, 100) >>> abs(mat.mean() - expMean) < 0.1 True >>> abs(mat.std() - expStd) < 0.1 True
New in version 1.3.0.
-
static
normalRDD
(sc, size, numPartitions=None, seed=None)[source]¶ Generates an RDD comprised of i.i.d. samples from the standard normal distribution.
To transform the distribution in the generated RDD from standard normal to some other normal N(mean, sigma^2), use
RandomRDDs.normal(sc, n, p, seed).map(lambda v: mean + sigma * v)
- Parameters
sc – SparkContext used to create the RDD.
size – Size of the RDD.
numPartitions – Number of partitions in the RDD (default: sc.defaultParallelism).
seed – Random seed (default: a random long integer).
- Returns
RDD of float comprised of i.i.d. samples ~ N(0.0, 1.0).
>>> x = RandomRDDs.normalRDD(sc, 1000, seed=1) >>> stats = x.stats() >>> stats.count() 1000 >>> abs(stats.mean() - 0.0) < 0.1 True >>> abs(stats.stdev() - 1.0) < 0.1 True
New in version 1.1.0.
-
static
normalVectorRDD
(sc, numRows, numCols, numPartitions=None, seed=None)[source]¶ Generates an RDD comprised of vectors containing i.i.d. samples drawn from the standard normal distribution.
- Parameters
sc – SparkContext used to create the RDD.
numRows – Number of Vectors in the RDD.
numCols – Number of elements in each Vector.
numPartitions – Number of partitions in the RDD (default: sc.defaultParallelism).
seed – Random seed (default: a random long integer).
- Returns
RDD of Vector with vectors containing i.i.d. samples ~ N(0.0, 1.0).
>>> import numpy as np >>> mat = np.matrix(RandomRDDs.normalVectorRDD(sc, 100, 100, seed=1).collect()) >>> mat.shape (100, 100) >>> abs(mat.mean() - 0.0) < 0.1 True >>> abs(mat.std() - 1.0) < 0.1 True
New in version 1.1.0.
-
static
poissonRDD
(sc, mean, size, numPartitions=None, seed=None)[source]¶ Generates an RDD comprised of i.i.d. samples from the Poisson distribution with the input mean.
- Parameters
sc – SparkContext used to create the RDD.
mean – Mean, or lambda, for the Poisson distribution.
size – Size of the RDD.
numPartitions – Number of partitions in the RDD (default: sc.defaultParallelism).
seed – Random seed (default: a random long integer).
- Returns
RDD of float comprised of i.i.d. samples ~ Pois(mean).
>>> mean = 100.0 >>> x = RandomRDDs.poissonRDD(sc, mean, 1000, seed=2) >>> stats = x.stats() >>> stats.count() 1000 >>> abs(stats.mean() - mean) < 0.5 True >>> from math import sqrt >>> abs(stats.stdev() - sqrt(mean)) < 0.5 True
New in version 1.1.0.
-
static
poissonVectorRDD
(sc, mean, numRows, numCols, numPartitions=None, seed=None)[source]¶ Generates an RDD comprised of vectors containing i.i.d. samples drawn from the Poisson distribution with the input mean.
- Parameters
sc – SparkContext used to create the RDD.
mean – Mean, or lambda, for the Poisson distribution.
numRows – Number of Vectors in the RDD.
numCols – Number of elements in each Vector.
numPartitions – Number of partitions in the RDD (default: sc.defaultParallelism)
seed – Random seed (default: a random long integer).
- Returns
RDD of Vector with vectors containing i.i.d. samples ~ Pois(mean).
>>> import numpy as np >>> mean = 100.0 >>> rdd = RandomRDDs.poissonVectorRDD(sc, mean, 100, 100, seed=1) >>> mat = np.mat(rdd.collect()) >>> mat.shape (100, 100) >>> abs(mat.mean() - mean) < 0.5 True >>> from math import sqrt >>> abs(mat.std() - sqrt(mean)) < 0.5 True
New in version 1.1.0.
-
static
uniformRDD
(sc, size, numPartitions=None, seed=None)[source]¶ Generates an RDD comprised of i.i.d. samples from the uniform distribution U(0.0, 1.0).
To transform the distribution in the generated RDD from U(0.0, 1.0) to U(a, b), use
RandomRDDs.uniformRDD(sc, n, p, seed).map(lambda v: a + (b - a) * v)
- Parameters
sc – SparkContext used to create the RDD.
size – Size of the RDD.
numPartitions – Number of partitions in the RDD (default: sc.defaultParallelism).
seed – Random seed (default: a random long integer).
- Returns
RDD of float comprised of i.i.d. samples ~ U(0.0, 1.0).
>>> x = RandomRDDs.uniformRDD(sc, 100).collect() >>> len(x) 100 >>> max(x) <= 1.0 and min(x) >= 0.0 True >>> RandomRDDs.uniformRDD(sc, 100, 4).getNumPartitions() 4 >>> parts = RandomRDDs.uniformRDD(sc, 100, seed=4).getNumPartitions() >>> parts == sc.defaultParallelism True
New in version 1.1.0.
-
static
uniformVectorRDD
(sc, numRows, numCols, numPartitions=None, seed=None)[source]¶ Generates an RDD comprised of vectors containing i.i.d. samples drawn from the uniform distribution U(0.0, 1.0).
- Parameters
sc – SparkContext used to create the RDD.
numRows – Number of Vectors in the RDD.
numCols – Number of elements in each Vector.
numPartitions – Number of partitions in the RDD.
seed – Seed for the RNG that generates the seed for the generator in each partition.
- Returns
RDD of Vector with vectors containing i.i.d samples ~ U(0.0, 1.0).
>>> import numpy as np >>> mat = np.matrix(RandomRDDs.uniformVectorRDD(sc, 10, 10).collect()) >>> mat.shape (10, 10) >>> mat.max() <= 1.0 and mat.min() >= 0.0 True >>> RandomRDDs.uniformVectorRDD(sc, 10, 10, 4).getNumPartitions() 4
New in version 1.1.0.
-
static