KMeans

class pyspark.mllib.clustering.KMeans[source]

New in version 0.9.0.

Methods

Methods Documentation

classmethod train(rdd, k, maxIterations=100, initializationMode='k-means||', seed=None, initializationSteps=2, epsilon=0.0001, initialModel=None)[source]

Train a k-means clustering model.

Parameters
  • rdd – Training points as an RDD of Vector or convertible sequence types.

  • k – Number of clusters to create.

  • maxIterations – Maximum number of iterations allowed. (default: 100)

  • initializationMode – The initialization algorithm. This can be either “random” or “k-means||”. (default: “k-means||”)

  • seed – Random seed value for cluster initialization. Set as None to generate seed based on system time. (default: None)

  • initializationSteps – Number of steps for the k-means|| initialization mode. This is an advanced setting – the default of 2 is almost always enough. (default: 2)

  • epsilon – Distance threshold within which a center will be considered to have converged. If all centers move less than this Euclidean distance, iterations are stopped. (default: 1e-4)

  • initialModel – Initial cluster centers can be provided as a KMeansModel object rather than using the random or k-means|| initializationModel. (default: None)

New in version 0.9.0.