Word2Vec¶
-
class
pyspark.ml.feature.
Word2Vec
(vectorSize=100, minCount=5, numPartitions=1, stepSize=0.025, maxIter=1, seed=None, inputCol=None, outputCol=None, windowSize=5, maxSentenceLength=1000)[source]¶ Word2Vec trains a model of Map(String, Vector), i.e. transforms a word into a code for further natural language processing or machine learning process.
>>> sent = ("a b " * 100 + "a c " * 10).split(" ") >>> doc = spark.createDataFrame([(sent,), (sent,)], ["sentence"]) >>> word2Vec = Word2Vec(vectorSize=5, seed=42, inputCol="sentence", outputCol="model") >>> word2Vec.setMaxIter(10) Word2Vec... >>> word2Vec.getMaxIter() 10 >>> word2Vec.clear(word2Vec.maxIter) >>> model = word2Vec.fit(doc) >>> model.getMinCount() 5 >>> model.setInputCol("sentence") Word2VecModel... >>> model.getVectors().show() +----+--------------------+ |word| vector| +----+--------------------+ | a|[0.09511678665876...| | b|[-1.2028766870498...| | c|[0.30153277516365...| +----+--------------------+ ... >>> model.findSynonymsArray("a", 2) [('b', 0.015859870240092278), ('c', -0.5680795907974243)] >>> from pyspark.sql.functions import format_number as fmt >>> model.findSynonyms("a", 2).select("word", fmt("similarity", 5).alias("similarity")).show() +----+----------+ |word|similarity| +----+----------+ | b| 0.01586| | c| -0.56808| +----+----------+ ... >>> model.transform(doc).head().model DenseVector([-0.4833, 0.1855, -0.273, -0.0509, -0.4769]) >>> word2vecPath = temp_path + "/word2vec" >>> word2Vec.save(word2vecPath) >>> loadedWord2Vec = Word2Vec.load(word2vecPath) >>> loadedWord2Vec.getVectorSize() == word2Vec.getVectorSize() True >>> loadedWord2Vec.getNumPartitions() == word2Vec.getNumPartitions() True >>> loadedWord2Vec.getMinCount() == word2Vec.getMinCount() True >>> modelPath = temp_path + "/word2vec-model" >>> model.save(modelPath) >>> loadedModel = Word2VecModel.load(modelPath) >>> loadedModel.getVectors().first().word == model.getVectors().first().word True >>> loadedModel.getVectors().first().vector == model.getVectors().first().vector True
New in version 1.4.0.
Methods
Attributes
Methods Documentation
-
clear
(param)¶ Clears a param from the param map if it has been explicitly set.
-
copy
(extra=None)¶ Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
- Parameters
extra – Extra parameters to copy to the new instance
- Returns
Copy of this instance
-
explainParam
(param)¶ Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
-
explainParams
()¶ Returns the documentation of all params with their optionally default values and user-supplied values.
-
extractParamMap
(extra=None)¶ Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
- Parameters
extra – extra param values
- Returns
merged param map
-
fit
(dataset, params=None)¶ Fits a model to the input dataset with optional parameters.
- Parameters
dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.
- Returns
fitted model(s)
New in version 1.3.0.
-
fitMultiple
(dataset, paramMaps)¶ Fits a model to the input dataset for each param map in paramMaps.
- Parameters
dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
.paramMaps – A Sequence of param maps.
- Returns
A thread safe iterable which contains one model for each param map. Each call to next(modelIterator) will return (index, model) where model was fit using paramMaps[index]. index values may not be sequential.
New in version 2.3.0.
-
getInputCol
()¶ Gets the value of inputCol or its default value.
-
getMaxIter
()¶ Gets the value of maxIter or its default value.
-
getMaxSentenceLength
()¶ Gets the value of maxSentenceLength or its default value.
New in version 2.0.0.
-
getMinCount
()¶ Gets the value of minCount or its default value.
New in version 1.4.0.
-
getNumPartitions
()¶ Gets the value of numPartitions or its default value.
New in version 1.4.0.
-
getOrDefault
(param)¶ Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
-
getOutputCol
()¶ Gets the value of outputCol or its default value.
-
getParam
(paramName)¶ Gets a param by its name.
-
getSeed
()¶ Gets the value of seed or its default value.
-
getStepSize
()¶ Gets the value of stepSize or its default value.
-
getVectorSize
()¶ Gets the value of vectorSize or its default value.
New in version 1.4.0.
-
getWindowSize
()¶ Gets the value of windowSize or its default value.
New in version 2.0.0.
-
hasDefault
(param)¶ Checks whether a param has a default value.
-
hasParam
(paramName)¶ Tests whether this instance contains a param with a given (string) name.
-
isDefined
(param)¶ Checks whether a param is explicitly set by user or has a default value.
-
isSet
(param)¶ Checks whether a param is explicitly set by user.
-
classmethod
load
(path)¶ Reads an ML instance from the input path, a shortcut of read().load(path).
-
classmethod
read
()¶ Returns an MLReader instance for this class.
-
save
(path)¶ Save this ML instance to the given path, a shortcut of ‘write().save(path)’.
-
set
(param, value)¶ Sets a parameter in the embedded param map.
-
setMaxSentenceLength
(value)[source]¶ Sets the value of
maxSentenceLength
.New in version 2.0.0.
-
setNumPartitions
(value)[source]¶ Sets the value of
numPartitions
.New in version 1.4.0.
-
setParams
(self, minCount=5, numPartitions=1, stepSize=0.025, maxIter=1, seed=None, inputCol=None, outputCol=None, windowSize=5, maxSentenceLength=1000)[source]¶ Sets params for this Word2Vec.
New in version 1.4.0.
-
setVectorSize
(value)[source]¶ Sets the value of
vectorSize
.New in version 1.4.0.
-
setWindowSize
(value)[source]¶ Sets the value of
windowSize
.New in version 2.0.0.
-
write
()¶ Returns an MLWriter instance for this ML instance.
Attributes Documentation
-
inputCol
= Param(parent='undefined', name='inputCol', doc='input column name.')¶
-
maxIter
= Param(parent='undefined', name='maxIter', doc='max number of iterations (>= 0).')¶
-
maxSentenceLength
= Param(parent='undefined', name='maxSentenceLength', doc='Maximum length (in words) of each sentence in the input data. Any sentence longer than this threshold will be divided into chunks up to the size.')¶
-
minCount
= Param(parent='undefined', name='minCount', doc="the minimum number of times a token must appear to be included in the word2vec model's vocabulary")¶
-
numPartitions
= Param(parent='undefined', name='numPartitions', doc='number of partitions for sentences of words')¶
-
outputCol
= Param(parent='undefined', name='outputCol', doc='output column name.')¶
-
params
¶ Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
-
seed
= Param(parent='undefined', name='seed', doc='random seed.')¶
-
stepSize
= Param(parent='undefined', name='stepSize', doc='Step size to be used for each iteration of optimization (>= 0).')¶
-
vectorSize
= Param(parent='undefined', name='vectorSize', doc='the dimension of codes after transforming from words')¶
-
windowSize
= Param(parent='undefined', name='windowSize', doc='the window size (context words from [-window, window]). Default value is 5')¶
-