6.2. Feature extraction¶
The sklearn.feature_extraction
module can be used to extract
features in a format supported by machine learning algorithms from datasets
consisting of formats such as text and image.
Note
Feature extraction is very different from Feature selection: the former consists in transforming arbitrary data, such as text or images, into numerical features usable for machine learning. The latter is a machine learning technique applied on these features.
6.2.1. Loading features from dicts¶
The class DictVectorizer
can be used to convert feature
arrays represented as lists of standard Python dict
objects to the
NumPy/SciPy representation used by scikit-learn estimators.
While not particularly fast to process, Python’s dict
has the
advantages of being convenient to use, being sparse (absent features
need not be stored) and storing feature names in addition to values.
DictVectorizer
implements what is called one-of-K or “one-hot”
coding for categorical (aka nominal, discrete) features. Categorical
features are “attribute-value” pairs where the value is restricted
to a list of discrete possibilities without ordering (e.g. topic
identifiers, types of objects, tags, names…).
In the following, “city” is a categorical attribute while “temperature” is a traditional numerical feature:
>>> measurements = [
... {'city': 'Dubai', 'temperature': 33.},
... {'city': 'London', 'temperature': 12.},
... {'city': 'San Francisco', 'temperature': 18.},
... ]
>>> from sklearn.feature_extraction import DictVectorizer
>>> vec = DictVectorizer()
>>> vec.fit_transform(measurements).toarray()
array([[ 1., 0., 0., 33.],
[ 0., 1., 0., 12.],
[ 0., 0., 1., 18.]])
>>> vec.get_feature_names_out()
array(['city=Dubai', 'city=London', 'city=San Francisco', 'temperature'], ...)
DictVectorizer
accepts multiple string values for one
feature, like, e.g., multiple categories for a movie.
Assume a database classifies each movie using some categories (not mandatories) and its year of release.
>>> movie_entry = [{'category': ['thriller', 'drama'], 'year': 2003},
... {'category': ['animation', 'family'], 'year': 2011},
... {'year': 1974}]
>>> vec.fit_transform(movie_entry).toarray()
array([[0.000e+00, 1.000e+00, 0.000e+00, 1.000e+00, 2.003e+03],
[1.000e+00, 0.000e+00, 1.000e+00, 0.000e+00, 2.011e+03],
[0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 1.974e+03]])
>>> vec.get_feature_names_out()
array(['category=animation', 'category=drama', 'category=family',
'category=thriller', 'year'], ...)
>>> vec.transform({'category': ['thriller'],
... 'unseen_feature': '3'}).toarray()
array([[0., 0., 0., 1., 0.]])
DictVectorizer
is also a useful representation transformation
for training sequence classifiers in Natural Language Processing models
that typically work by extracting feature windows around a particular
word of interest.
For example, suppose that we have a first algorithm that extracts Part of Speech (PoS) tags that we want to use as complementary tags for training a sequence classifier (e.g. a chunker). The following dict could be such a window of features extracted around the word ‘sat’ in the sentence ‘The cat sat on the mat.’:
>>> pos_window = [
... {
... 'word-2': 'the',
... 'pos-2': 'DT',
... 'word-1': 'cat',
... 'pos-1': 'NN',
... 'word+1': 'on',
... 'pos+1': 'PP',
... },
... # in a real application one would extract many such dictionaries
... ]
This description can be vectorized into a sparse two-dimensional matrix
suitable for feeding into a classifier (maybe after being piped into a
TfidfTransformer
for normalization):
>>> vec = DictVectorizer()
>>> pos_vectorized = vec.fit_transform(pos_window)
>>> pos_vectorized
<1x6 sparse matrix of type '<... 'numpy.float64'>'
with 6 stored elements in Compressed Sparse ... format>
>>> pos_vectorized.toarray()
array([[1., 1., 1., 1., 1., 1.]])
>>> vec.get_feature_names_out()
array(['pos+1=PP', 'pos-1=NN', 'pos-2=DT', 'word+1=on', 'word-1=cat',
'word-2=the'], ...)
As you can imagine, if one extracts such a context around each individual
word of a corpus of documents the resulting matrix will be very wide
(many one-hot-features) with most of them being valued to zero most
of the time. So as to make the resulting data structure able to fit in
memory the DictVectorizer
class uses a scipy.sparse
matrix by
default instead of a numpy.ndarray
.
6.2.2. Feature hashing¶
The class FeatureHasher
is a high-speed, low-memory vectorizer that
uses a technique known as
feature hashing,
or the “hashing trick”.
Instead of building a hash table of the features encountered in training,
as the vectorizers do, instances of FeatureHasher
apply a hash function to the features
to determine their column index in sample matrices directly.
The result is increased speed and reduced memory usage,
at the expense of inspectability;
the hasher does not remember what the input features looked like
and has no inverse_transform
method.
Since the hash function might cause collisions between (unrelated) features,
a signed hash function is used and the sign of the hash value
determines the sign of the value stored in the output matrix for a feature.
This way, collisions are likely to cancel out rather than accumulate error,
and the expected mean of any output feature’s value is zero. This mechanism
is enabled by default with alternate_sign=True
and is particularly useful
for small hash table sizes (n_features < 10000
). For large hash table
sizes, it can be disabled, to allow the output to be passed to estimators like
MultinomialNB
or
chi2
feature selectors that expect non-negative inputs.
FeatureHasher
accepts either mappings
(like Python’s dict
and its variants in the collections
module),
(feature, value)
pairs, or strings,
depending on the constructor parameter input_type
.
Mapping are treated as lists of (feature, value)
pairs,
while single strings have an implicit value of 1,
so ['feat1', 'feat2', 'feat3']
is interpreted as
[('feat1', 1), ('feat2', 1), ('feat3', 1)]
.
If a single feature occurs multiple times in a sample,
the associated values will be summed
(so ('feat', 2)
and ('feat', 3.5)
become ('feat', 5.5)
).
The output from FeatureHasher
is always a scipy.sparse
matrix
in the CSR format.
Feature hashing can be employed in document classification,
but unlike CountVectorizer
,
FeatureHasher
does not do word
splitting or any other preprocessing except Unicode-to-UTF-8 encoding;
see Vectorizing a large text corpus with the hashing trick, below, for a combined tokenizer/hasher.
As an example, consider a word-level natural language processing task
that needs features extracted from (token, part_of_speech)
pairs.
One could use a Python generator function to extract features:
def token_features(token, part_of_speech):
if token.isdigit():
yield "numeric"
else:
yield "token={}".format(token.lower())
yield "token,pos={},{}".format(token, part_of_speech)
if token[0].isupper():
yield "uppercase_initial"
if token.isupper():
yield "all_uppercase"
yield "pos={}".format(part_of_speech)
Then, the raw_X
to be fed to FeatureHasher.transform
can be constructed using:
raw_X = (token_features(tok, pos_tagger(tok)) for tok in corpus)
and fed to a hasher with:
hasher = FeatureHasher(input_type='string')
X = hasher.transform(raw_X)
to get a scipy.sparse
matrix X
.
Note the use of a generator comprehension, which introduces laziness into the feature extraction: tokens are only processed on demand from the hasher.
Implementation details
Click for more details
FeatureHasher
uses the signed 32-bit variant of MurmurHash3.
As a result (and because of limitations in scipy.sparse
),
the maximum number of features supported is currently \(2^{31} - 1\).
The original formulation of the hashing trick by Weinberger et al. used two separate hash functions \(h\) and \(\xi\) to determine the column index and sign of a feature, respectively. The present implementation works under the assumption that the sign bit of MurmurHash3 is independent of its other bits.
Since a simple modulo is used to transform the hash function to a column index,
it is advisable to use a power of two as the n_features
parameter;
otherwise the features will not be mapped evenly to the columns.