StreamingKMeansModel¶
-
class
pyspark.mllib.clustering.
StreamingKMeansModel
(clusterCenters, clusterWeights)[source]¶ Clustering model which can perform an online update of the centroids.
The update formula for each centroid is given by
c_t+1 = ((c_t * n_t * a) + (x_t * m_t)) / (n_t + m_t)
n_t+1 = n_t * a + m_t
where
c_t: Centroid at the n_th iteration.
- n_t: Number of samples (or) weights associated with the centroid
at the n_th iteration.
x_t: Centroid of the new data closest to c_t.
m_t: Number of samples (or) weights of the new data closest to c_t
c_t+1: New centroid.
n_t+1: New number of weights.
a: Decay Factor, which gives the forgetfulness.
Note
If a is set to 1, it is the weighted mean of the previous and new data. If it set to zero, the old centroids are completely forgotten.
- Parameters
clusterCenters – Initial cluster centers.
clusterWeights – List of weights assigned to each cluster.
>>> initCenters = [[0.0, 0.0], [1.0, 1.0]] >>> initWeights = [1.0, 1.0] >>> stkm = StreamingKMeansModel(initCenters, initWeights) >>> data = sc.parallelize([[-0.1, -0.1], [0.1, 0.1], ... [0.9, 0.9], [1.1, 1.1]]) >>> stkm = stkm.update(data, 1.0, u"batches") >>> stkm.centers array([[ 0., 0.], [ 1., 1.]]) >>> stkm.predict([-0.1, -0.1]) 0 >>> stkm.predict([0.9, 0.9]) 1 >>> stkm.clusterWeights [3.0, 3.0] >>> decayFactor = 0.0 >>> data = sc.parallelize([DenseVector([1.5, 1.5]), DenseVector([0.2, 0.2])]) >>> stkm = stkm.update(data, 0.0, u"batches") >>> stkm.centers array([[ 0.2, 0.2], [ 1.5, 1.5]]) >>> stkm.clusterWeights [1.0, 1.0] >>> stkm.predict([0.2, 0.2]) 0 >>> stkm.predict([1.5, 1.5]) 1
New in version 1.5.0.
Methods
Attributes
Methods Documentation
-
computeCost
(rdd)¶ Return the K-means cost (sum of squared distances of points to their nearest center) for this model on the given data.
- Parameters
rdd – The RDD of points to compute the cost on.
New in version 1.4.0.
-
classmethod
load
(sc, path)¶ Load a model from the given path.
New in version 1.4.0.
-
predict
(x)¶ Find the cluster that each of the points belongs to in this model.
- Parameters
x – A data point (or RDD of points) to determine cluster index.
- Returns
Predicted cluster index or an RDD of predicted cluster indices if the input is an RDD.
New in version 0.9.0.
-
save
(sc, path)¶ Save this model to the given path.
New in version 1.4.0.
-
update
(data, decayFactor, timeUnit)[source]¶ Update the centroids, according to data
- Parameters
data – RDD with new data for the model update.
decayFactor – Forgetfulness of the previous centroids.
timeUnit – Can be “batches” or “points”. If points, then the decay factor is raised to the power of number of new points and if batches, then decay factor will be used as is.
New in version 1.5.0.
Attributes Documentation
-
clusterCenters
¶ Get the cluster centers, represented as a list of NumPy arrays.
New in version 1.0.0.
-
clusterWeights
¶ Return the cluster weights.
New in version 1.5.0.
-
k
¶ Total number of clusters.
New in version 1.4.0.