LDAModel¶
-
class
pyspark.mllib.clustering.
LDAModel
(java_model)[source]¶ A clustering model derived from the LDA method.
Latent Dirichlet Allocation (LDA), a topic model designed for text documents. Terminology - “word” = “term”: an element of the vocabulary - “token”: instance of a term appearing in a document - “topic”: multinomial distribution over words representing some concept References: - Original LDA paper (journal version): Blei, Ng, and Jordan. “Latent Dirichlet Allocation.” JMLR, 2003.
>>> from pyspark.mllib.linalg import Vectors >>> from numpy.testing import assert_almost_equal, assert_equal >>> data = [ ... [1, Vectors.dense([0.0, 1.0])], ... [2, SparseVector(2, {0: 1.0})], ... ] >>> rdd = sc.parallelize(data) >>> model = LDA.train(rdd, k=2, seed=1) >>> model.vocabSize() 2 >>> model.describeTopics() [([1, 0], [0.5..., 0.49...]), ([0, 1], [0.5..., 0.49...])] >>> model.describeTopics(1) [([1], [0.5...]), ([0], [0.5...])]
>>> topics = model.topicsMatrix() >>> topics_expect = array([[0.5, 0.5], [0.5, 0.5]]) >>> assert_almost_equal(topics, topics_expect, 1)
>>> import os, tempfile >>> from shutil import rmtree >>> path = tempfile.mkdtemp() >>> model.save(sc, path) >>> sameModel = LDAModel.load(sc, path) >>> assert_equal(sameModel.topicsMatrix(), model.topicsMatrix()) >>> sameModel.vocabSize() == model.vocabSize() True >>> try: ... rmtree(path) ... except OSError: ... pass
New in version 1.5.0.
Methods
Methods Documentation
-
call
(name, *a)¶ Call method of java_model
-
describeTopics
(maxTermsPerTopic=None)[source]¶ Return the topics described by weighted terms.
WARNING: If vocabSize and k are large, this can return a large object!
- Parameters
maxTermsPerTopic – Maximum number of terms to collect for each topic. (default: vocabulary size)
- Returns
Array over topics. Each topic is represented as a pair of matching arrays: (term indices, term weights in topic). Each topic’s terms are sorted in order of decreasing weight.
New in version 1.6.0.
-
classmethod
load
(sc, path)[source]¶ Load the LDAModel from disk.
- Parameters
sc – SparkContext.
path – Path to where the model is stored.
New in version 1.5.0.
-
save
(sc, path)¶ Save this model to the given path.
New in version 1.3.0.
-