LDAModel¶
-
class
pyspark.ml.clustering.
LDAModel
(java_model=None)[source]¶ Latent Dirichlet Allocation (LDA) model. This abstraction permits for different underlying representations, including local and distributed data structures.
New in version 2.0.0.
Methods
Attributes
Methods Documentation
-
clear
(param)¶ Clears a param from the param map if it has been explicitly set.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
- Parameters
extra – Extra parameters to copy to the new instance
- Returns
Copy of this instance
-
describeTopics
(maxTermsPerTopic=10)[source]¶ Return the topics described by their top-weighted terms.
New in version 2.0.0.
-
estimatedDocConcentration
()[source]¶ Value for
LDA.docConcentration
estimated from data. If Online LDA was used andLDA.optimizeDocConcentration
was set to false, then this returns the fixed (given) value for theLDA.docConcentration
parameter.New in version 2.0.0.
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
- Parameters
extra – extra param values
- Returns
merged param map
-
getCheckpointInterval
()¶ Gets the value of checkpointInterval or its default value.
-
getDocConcentration
()¶ Gets the value of
docConcentration
or its default value.New in version 2.0.0.
-
getFeaturesCol
()¶ Gets the value of featuresCol or its default value.
-
getKeepLastCheckpoint
()¶ Gets the value of
keepLastCheckpoint
or its default value.New in version 2.0.0.
-
getLearningDecay
()¶ Gets the value of
learningDecay
or its default value.New in version 2.0.0.
-
getLearningOffset
()¶ Gets the value of
learningOffset
or its default value.New in version 2.0.0.
-
getMaxIter
()¶ Gets the value of maxIter or its default value.
-
getOptimizeDocConcentration
()¶ Gets the value of
optimizeDocConcentration
or its default value.New in version 2.0.0.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getParam
(paramName)¶ Gets a param by its name.
-
getSeed
()¶ Gets the value of seed or its default value.
-
getSubsamplingRate
()¶ Gets the value of
subsamplingRate
or its default value.New in version 2.0.0.
-
getTopicConcentration
()¶ Gets the value of
topicConcentration
or its default value.New in version 2.0.0.
-
getTopicDistributionCol
()¶ Gets the value of
topicDistributionCol
or its default value.New in version 2.0.0.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isDistributed
()[source]¶ Indicates whether this instance is of type DistributedLDAModel
New in version 2.0.0.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
logLikelihood
(dataset)[source]¶ Calculates a lower bound on the log likelihood of the entire corpus. See Equation (16) in the Online LDA paper (Hoffman et al., 2010).
WARNING: If this model is an instance of
DistributedLDAModel
(produced whenoptimizer
is set to “em”), this involves collecting a largetopicsMatrix()
to the driver. This implementation may be changed in the future.New in version 2.0.0.
-
logPerplexity
(dataset)[source]¶ Calculate an upper bound on perplexity. (Lower is better.) See Equation (16) in the Online LDA paper (Hoffman et al., 2010).
WARNING: If this model is an instance of
DistributedLDAModel
(produced whenoptimizer
is set to “em”), this involves collecting a largetopicsMatrix()
to the driver. This implementation may be changed in the future.New in version 2.0.0.
-
set
(param, value)¶ Sets a parameter in the embedded param map.
-
setFeaturesCol
(value)[source]¶ Sets the value of
featuresCol
.New in version 3.0.0.
-
setTopicDistributionCol
(value)[source]¶ Sets the value of
topicDistributionCol
.New in version 3.0.0.
-
topicsMatrix
()[source]¶ Inferred topics, where each topic is represented by a distribution over terms. This is a matrix of size vocabSize x k, where each column is a topic. No guarantees are given about the ordering of the topics.
WARNING: If this model is actually a
DistributedLDAModel
instance produced by the Expectation-Maximization (“em”) optimizer, then this method could involve collecting a large amount of data to the driver (on the order of vocabSize x k).New in version 2.0.0.
-
transform
(dataset, params=None)¶ Transforms the input dataset with optional parameters.
- Parameters
dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
params – an optional param map that overrides embedded params.
- Returns
transformed dataset
New in version 1.3.0.
-
vocabSize
()[source]¶ Vocabulary size (number of terms or words in the vocabulary)
New in version 2.0.0.
Attributes Documentation
-
checkpointInterval
= Param(parent='undefined', name='checkpointInterval', doc='set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations. Note: this setting will be ignored if the checkpoint directory is not set in the SparkContext.')¶
-
docConcentration
= Param(parent='undefined', name='docConcentration', doc='Concentration parameter (commonly named "alpha") for the prior placed on documents\' distributions over topics ("theta").')¶
-
featuresCol
= Param(parent='undefined', name='featuresCol', doc='features column name.')¶
-
k
= Param(parent='undefined', name='k', doc='The number of topics (clusters) to infer. Must be > 1.')¶
-
keepLastCheckpoint
= Param(parent='undefined', name='keepLastCheckpoint', doc='(For EM optimizer) If using checkpointing, this indicates whether to keep the last checkpoint. If false, then the checkpoint will be deleted. Deleting the checkpoint can cause failures if a data partition is lost, so set this bit with care.')¶
-
learningDecay
= Param(parent='undefined', name='learningDecay', doc='Learning rate, set as anexponential decay rate. This should be between (0.5, 1.0] to guarantee asymptotic convergence.')¶
-
learningOffset
= Param(parent='undefined', name='learningOffset', doc='A (positive) learning parameter that downweights early iterations. Larger values make early iterations count less')¶
-
maxIter
= Param(parent='undefined', name='maxIter', doc='max number of iterations (>= 0).')¶
-
optimizeDocConcentration
= Param(parent='undefined', name='optimizeDocConcentration', doc='Indicates whether the docConcentration (Dirichlet parameter for document-topic distribution) will be optimized during training.')¶
-
optimizer
= Param(parent='undefined', name='optimizer', doc='Optimizer or inference algorithm used to estimate the LDA model. Supported: online, em')¶
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
seed
= Param(parent='undefined', name='seed', doc='random seed.')¶
-
subsamplingRate
= Param(parent='undefined', name='subsamplingRate', doc='Fraction of the corpus to be sampled and used in each iteration of mini-batch gradient descent, in range (0, 1].')¶
-
topicConcentration
= Param(parent='undefined', name='topicConcentration', doc='Concentration parameter (commonly named "beta" or "eta") for the prior placed on topic\' distributions over terms.')¶
-
topicDistributionCol
= Param(parent='undefined', name='topicDistributionCol', doc='Output column with estimates of the topic mixture distribution for each document (often called "theta" in the literature). Returns a vector of zeros for an empty document.')¶
-