PowerIterationClustering

class pyspark.mllib.clustering.PowerIterationClustering[source]

Power Iteration Clustering (PIC), a scalable graph clustering algorithm developed by [[http://www.cs.cmu.edu/~frank/papers/icml2010-pic-final.pdf Lin and Cohen]]. From the abstract: PIC finds a very low-dimensional embedding of a dataset using truncated power iteration on a normalized pair-wise similarity matrix of the data.

New in version 1.5.0.

Methods

Methods Documentation

classmethod train(rdd, k, maxIterations=100, initMode='random')[source]
Parameters
  • rdd – An RDD of (i, j, sij) tuples representing the affinity matrix, which is the matrix A in the PIC paper. The similarity sijmust be nonnegative. This is a symmetric matrix and hence sij= sji For any (i, j) with nonzero similarity, there should be either (i, j, sij) or (j, i, sji) in the input. Tuples with i = j are ignored, because it is assumed sij= 0.0.

  • k – Number of clusters.

  • maxIterations – Maximum number of iterations of the PIC algorithm. (default: 100)

  • initMode – Initialization mode. This can be either “random” to use a random vector as vertex properties, or “degree” to use normalized sum similarities. (default: “random”)

New in version 1.5.0.