GaussianMixture¶
-
class
pyspark.ml.clustering.
GaussianMixture
(featuresCol='features', predictionCol='prediction', k=2, probabilityCol='probability', tol=0.01, maxIter=100, seed=None, aggregationDepth=2, weightCol=None)[source]¶ GaussianMixture clustering. This class performs expectation maximization for multivariate Gaussian Mixture Models (GMMs). A GMM represents a composite distribution of independent Gaussian distributions with associated “mixing” weights specifying each’s contribution to the composite.
Given a set of sample points, this class will maximize the log-likelihood for a mixture of k Gaussians, iterating until the log-likelihood changes by less than convergenceTol, or until it has reached the max number of iterations. While this process is generally guaranteed to converge, it is not guaranteed to find a global optimum.
Note
For high-dimensional data (with many features), this algorithm may perform poorly. This is due to high-dimensional data (a) making it difficult to cluster at all (based on statistical/theoretical arguments) and (b) numerical issues with Gaussian distributions.
>>> from pyspark.ml.linalg import Vectors
>>> data = [(Vectors.dense([-0.1, -0.05 ]),), ... (Vectors.dense([-0.01, -0.1]),), ... (Vectors.dense([0.9, 0.8]),), ... (Vectors.dense([0.75, 0.935]),), ... (Vectors.dense([-0.83, -0.68]),), ... (Vectors.dense([-0.91, -0.76]),)] >>> df = spark.createDataFrame(data, ["features"]) >>> gm = GaussianMixture(k=3, tol=0.0001, seed=10) >>> gm.getMaxIter() 100 >>> gm.setMaxIter(10) GaussianMixture... >>> gm.getMaxIter() 10 >>> model = gm.fit(df) >>> model.getAggregationDepth() 2 >>> model.getFeaturesCol() 'features' >>> model.setPredictionCol("newPrediction") GaussianMixtureModel... >>> model.predict(df.head().features) 2 >>> model.predictProbability(df.head().features) DenseVector([0.0, 0.4736, 0.5264]) >>> model.hasSummary True >>> summary = model.summary >>> summary.k 3 >>> summary.clusterSizes [2, 2, 2] >>> summary.logLikelihood 8.14636... >>> weights = model.weights >>> len(weights) 3 >>> gaussians = model.gaussians >>> len(gaussians) 3 >>> gaussians[0].mean DenseVector([0.825, 0.8675]) >>> gaussians[0].cov.toArray() array([[ 0.005625 , -0.0050625 ], [-0.0050625 , 0.00455625]]) >>> gaussians[1].mean DenseVector([-0.4777, -0.4096]) >>> gaussians[1].cov.toArray() array([[ 0.1679695 , 0.13181786], [ 0.13181786, 0.10524592]]) >>> gaussians[2].mean DenseVector([-0.4473, -0.3853]) >>> gaussians[2].cov.toArray() array([[ 0.16730412, 0.13112435], [ 0.13112435, 0.10469614]]) >>> model.gaussiansDF.select("mean").head() Row(mean=DenseVector([0.825, 0.8675])) >>> model.gaussiansDF.select("cov").head() Row(cov=DenseMatrix(2, 2, [0.0056, -0.0051, -0.0051, 0.0046], False)) >>> transformed = model.transform(df).select("features", "newPrediction") >>> rows = transformed.collect() >>> rows[4].newPrediction == rows[5].newPrediction True >>> rows[2].newPrediction == rows[3].newPrediction True >>> gmm_path = temp_path + "/gmm" >>> gm.save(gmm_path) >>> gm2 = GaussianMixture.load(gmm_path) >>> gm2.getK() 3 >>> model_path = temp_path + "/gmm_model" >>> model.save(model_path) >>> model2 = GaussianMixtureModel.load(model_path) >>> model2.hasSummary False >>> model2.weights == model.weights True >>> model2.gaussians[0].mean == model.gaussians[0].mean True >>> model2.gaussians[0].cov == model.gaussians[0].cov True >>> model2.gaussians[1].mean == model.gaussians[1].mean True >>> model2.gaussians[1].cov == model.gaussians[1].cov True >>> model2.gaussians[2].mean == model.gaussians[2].mean True >>> model2.gaussians[2].cov == model.gaussians[2].cov True >>> model2.gaussiansDF.select("mean").head() Row(mean=DenseVector([0.825, 0.8675])) >>> model2.gaussiansDF.select("cov").head() Row(cov=DenseMatrix(2, 2, [0.0056, -0.0051, -0.0051, 0.0046], False)) >>> gm2.setWeightCol("weight") GaussianMixture...
New in version 2.0.0.
Methods
Attributes
Methods Documentation
-
clear
(param)¶ Clears a param from the param map if it has been explicitly set.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
- Parameters
extra – Extra parameters to copy to the new instance
- Returns
Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
- Parameters
extra – extra param values
- Returns
merged param map
-
fit
(dataset, params=None)¶ Fits a model to the input dataset with optional parameters.
- Parameters
dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.
- Returns
fitted model(s)
New in version 1.3.0.
-
fitMultiple
(dataset, paramMaps)¶ Fits a model to the input dataset for each param map in paramMaps.
- Parameters
dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
.paramMaps – A Sequence of param maps.
- Returns
A thread safe iterable which contains one model for each param map. Each call to next(modelIterator) will return (index, model) where model was fit using paramMaps[index]. index values may not be sequential.
New in version 2.3.0.
-
getAggregationDepth
()¶ Gets the value of aggregationDepth or its default value.
-
getFeaturesCol
()¶ Gets the value of featuresCol or its default value.
-
getK
()¶ Gets the value of k
New in version 2.0.0.
-
getMaxIter
()¶ Gets the value of maxIter or its default value.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
getPredictionCol
()¶ Gets the value of predictionCol or its default value.
-
getProbabilityCol
()¶ Gets the value of probabilityCol or its default value.
-
getSeed
()¶ Gets the value of seed or its default value.
-
getTol
()¶ Gets the value of tol or its default value.
-
getWeightCol
()¶ Gets the value of weightCol or its default value.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
classmethod
load
(path)¶ Reads an ML instance from the input path, a shortcut of read().load(path).
-
classmethod
read
()¶ Returns an MLReader instance for this class.
-
save
(path)¶ Save this ML instance to the given path, a shortcut of ‘write().save(path)’.
-
set
(param, value)¶ Sets a parameter in the embedded param map.
-
setAggregationDepth
(value)[source]¶ Sets the value of
aggregationDepth
.New in version 3.0.0.
-
setFeaturesCol
(value)[source]¶ Sets the value of
featuresCol
.New in version 2.0.0.
-
setParams
(self, featuresCol='features', predictionCol='prediction', k=2, probabilityCol='probability', tol=0.01, maxIter=100, seed=None, aggregationDepth=2, weightCol=None)[source]¶ Sets params for GaussianMixture.
New in version 2.0.0.
-
setPredictionCol
(value)[source]¶ Sets the value of
predictionCol
.New in version 2.0.0.
-
setProbabilityCol
(value)[source]¶ Sets the value of
probabilityCol
.New in version 2.0.0.
-
write
()¶ Returns an MLWriter instance for this ML instance.
Attributes Documentation
-
aggregationDepth
= Param(parent='undefined', name='aggregationDepth', doc='suggested depth for treeAggregate (>= 2).')¶
-
featuresCol
= Param(parent='undefined', name='featuresCol', doc='features column name.')¶
-
k
= Param(parent='undefined', name='k', doc='Number of independent Gaussians in the mixture model. Must be > 1.')¶
-
maxIter
= Param(parent='undefined', name='maxIter', doc='max number of iterations (>= 0).')¶
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
predictionCol
= Param(parent='undefined', name='predictionCol', doc='prediction column name.')¶
-
probabilityCol
= Param(parent='undefined', name='probabilityCol', doc='Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities.')¶
-
seed
= Param(parent='undefined', name='seed', doc='random seed.')¶
-
tol
= Param(parent='undefined', name='tol', doc='the convergence tolerance for iterative algorithms (>= 0).')¶
-
weightCol
= Param(parent='undefined', name='weightCol', doc='weight column name. If this is not set or empty, we treat all instance weights as 1.0.')¶
-