StandardScaler

class pyspark.mllib.feature.StandardScaler(withMean=False, withStd=True)[source]

Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set.

Parameters
  • withMean – False by default. Centers the data with mean before scaling. It will build a dense output, so take care when applying to sparse input.

  • withStd – True by default. Scales the data to unit standard deviation.

>>> vs = [Vectors.dense([-2.0, 2.3, 0]), Vectors.dense([3.8, 0.0, 1.9])]
>>> dataset = sc.parallelize(vs)
>>> standardizer = StandardScaler(True, True)
>>> model = standardizer.fit(dataset)
>>> result = model.transform(dataset)
>>> for r in result.collect(): r
DenseVector([-0.7071, 0.7071, -0.7071])
DenseVector([0.7071, -0.7071, 0.7071])
>>> int(model.std[0])
4
>>> int(model.mean[0]*10)
9
>>> model.withStd
True
>>> model.withMean
True

New in version 1.2.0.

Methods

Methods Documentation

fit(dataset)[source]

Computes the mean and variance and stores as a model to be used for later scaling.

Parameters

dataset – The data used to compute the mean and variance to build the transformation model.

Returns

a StandardScalarModel

New in version 1.2.0.