6.2. Feature extraction

The sklearn.feature_extraction module can be used to extract features in a format supported by machine learning algorithms from datasets consisting of formats such as text and image.

Note

Feature extraction is very different from Feature selection: the former consists in transforming arbitrary data, such as text or images, into numerical features usable for machine learning. The latter is a machine learning technique applied on these features.

6.2.1. Loading features from dicts

The class DictVectorizer can be used to convert feature arrays represented as lists of standard Python dict objects to the NumPy/SciPy representation used by scikit-learn estimators.

While not particularly fast to process, Python’s dict has the advantages of being convenient to use, being sparse (absent features need not be stored) and storing feature names in addition to values.

DictVectorizer implements what is called one-of-K or “one-hot” coding for categorical (aka nominal, discrete) features. Categorical features are “attribute-value” pairs where the value is restricted to a list of discrete possibilities without ordering (e.g. topic identifiers, types of objects, tags, names…).

In the following, “city” is a categorical attribute while “temperature” is a traditional numerical feature:

>>> measurements = [
...     {'city': 'Dubai', 'temperature': 33.},
...     {'city': 'London', 'temperature': 12.},
...     {'city': 'San Francisco', 'temperature': 18.},
... ]

>>> from sklearn.feature_extraction import DictVectorizer
>>> vec = DictVectorizer()

>>> vec.fit_transform(measurements).toarray()
array([[ 1.,  0.,  0., 33.],
       [ 0.,  1.,  0., 12.],
       [ 0.,  0.,  1., 18.]])

>>> vec.get_feature_names_out()
array(['city=Dubai', 'city=London', 'city=San Francisco', 'temperature'], ...)

DictVectorizer accepts multiple string values for one feature, like, e.g., multiple categories for a movie.

Assume a database classifies each movie using some categories (not mandatories) and its year of release.

>>> movie_entry = [{'category': ['thriller', 'drama'], 'year': 2003},
...                {'category': ['animation', 'family'], 'year': 2011},
...                {'year': 1974}]
>>> vec.fit_transform(movie_entry).toarray()
array([[0.000e+00, 1.000e+00, 0.000e+00, 1.000e+00, 2.003e+03],
       [1.000e+00, 0.000e+00, 1.000e+00, 0.000e+00, 2.011e+03],
       [0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 1.974e+03]])
>>> vec.get_feature_names_out()
array(['category=animation', 'category=drama', 'category=family',
       'category=thriller', 'year'], ...)
>>> vec.transform({'category': ['thriller'],
...                'unseen_feature': '3'}).toarray()
array([[0., 0., 0., 1., 0.]])

DictVectorizer is also a useful representation transformation for training sequence classifiers in Natural Language Processing models that typically work by extracting feature windows around a particular word of interest.

For example, suppose that we have a first algorithm that extracts Part of Speech (PoS) tags that we want to use as complementary tags for training a sequence classifier (e.g. a chunker). The following dict could be such a window of features extracted around the word ‘sat’ in the sentence ‘The cat sat on the mat.’:

>>> pos_window = [
...     {
...         'word-2': 'the',
...         'pos-2': 'DT',
...         'word-1': 'cat',
...         'pos-1': 'NN',
...         'word+1': 'on',
...         'pos+1': 'PP',
...     },
...     # in a real application one would extract many such dictionaries
... ]

This description can be vectorized into a sparse two-dimensional matrix suitable for feeding into a classifier (maybe after being piped into a TfidfTransformer for normalization):

>>> vec = DictVectorizer()
>>> pos_vectorized = vec.fit_transform(pos_window)
>>> pos_vectorized
<1x6 sparse matrix of type '<... 'numpy.float64'>'
    with 6 stored elements in Compressed Sparse ... format>
>>> pos_vectorized.toarray()
array([[1., 1., 1., 1., 1., 1.]])
>>> vec.get_feature_names_out()
array(['pos+1=PP', 'pos-1=NN', 'pos-2=DT', 'word+1=on', 'word-1=cat',
       'word-2=the'], ...)

As you can imagine, if one extracts such a context around each individual word of a corpus of documents the resulting matrix will be very wide (many one-hot-features) with most of them being valued to zero most of the time. So as to make the resulting data structure able to fit in memory the DictVectorizer class uses a scipy.sparse matrix by default instead of a numpy.ndarray.

6.2.2. Feature hashing

The class FeatureHasher is a high-speed, low-memory vectorizer that uses a technique known as feature hashing, or the “hashing trick”. Instead of building a hash table of the features encountered in training, as the vectorizers do, instances of FeatureHasher apply a hash function to the features to determine their column index in sample matrices directly. The result is increased speed and reduced memory usage, at the expense of inspectability; the hasher does not remember what the input features looked like and has no inverse_transform method.

Since the hash function might cause collisions between (unrelated) features, a signed hash function is used and the sign of the hash value determines the sign of the value stored in the output matrix for a feature. This way, collisions are likely to cancel out rather than accumulate error, and the expected mean of any output feature’s value is zero. This mechanism is enabled by default with alternate_sign=True and is particularly useful for small hash table sizes (n_features < 10000). For large hash table sizes, it can be disabled, to allow the output to be passed to estimators like MultinomialNB or chi2 feature selectors that expect non-negative inputs.

FeatureHasher accepts either mappings (like Python’s dict and its variants in the collections module), (feature, value) pairs, or strings, depending on the constructor parameter input_type. Mapping are treated as lists of (feature, value) pairs, while single strings have an implicit value of 1, so ['feat1', 'feat2', 'feat3'] is interpreted as [('feat1', 1), ('feat2', 1), ('feat3', 1)]. If a single feature occurs multiple times in a sample, the associated values will be summed (so ('feat', 2) and ('feat', 3.5) become ('feat', 5.5)). The output from FeatureHasher is always a scipy.sparse matrix in the CSR format.

Feature hashing can be employed in document classification, but unlike CountVectorizer, FeatureHasher does not do word splitting or any other preprocessing except Unicode-to-UTF-8 encoding; see Vectorizing a large text corpus with the hashing trick, below, for a combined tokenizer/hasher.

As an example, consider a word-level natural language processing task that needs features extracted from (token, part_of_speech) pairs. One could use a Python generator function to extract features:

def token_features(token, part_of_speech):
    if token.isdigit():
        yield "numeric"
    else:
        yield "token={}".format(token.lower())
        yield "token,pos={},{}".format(token, part_of_speech)
    if token[0].isupper():
        yield "uppercase_initial"
    if token.isupper():
        yield "all_uppercase"
    yield "pos={}".format(part_of_speech)

Then, the raw_X to be fed to FeatureHasher.transform can be constructed using:

raw_X = (token_features(tok, pos_tagger(tok)) for tok in corpus)

and fed to a hasher with:

hasher = FeatureHasher(input_type='string')
X = hasher.transform(raw_X)

to get a scipy.sparse matrix X.

Note the use of a generator comprehension, which introduces laziness into the feature extraction: tokens are only processed on demand from the hasher.

Implementation details Click for more details

FeatureHasher uses the signed 32-bit variant of MurmurHash3. As a result (and because of limitations in scipy.sparse), the maximum number of features supported is currently \(2^{31} - 1\).

The original formulation of the hashing trick by Weinberger et al. used two separate hash functions \(h\) and \(\xi\) to determine the column index and sign of a feature, respectively. The present implementation works under the assumption that the sign bit of MurmurHash3 is independent of its other bits.

Since a simple modulo is used to transform the hash function to a column index, it is advisable to use a power of two as the n_features parameter; otherwise the features will not be mapped evenly to the columns.