The ML.MULTI_HOT_ENCODER function

This document describes the ML.MULTI_HOT_ENCODER function, which lets you encode a string array expression by using a multi-hot encoding scheme.

The encoding vocabulary is sorted alphabetically. NULL values and categories that aren't in the vocabulary are encoded with an index value of 0.

When used in the TRANSFORM clause, the vocabulary calculated during training, along with the top k and frequency threshold values that you specified, are automatically used in prediction.

Syntax

ML.MULTI_HOT_ENCODER(array_expression [, top_k] [, frequency_threshold]) OVER()

Arguments

ML.MULTI_HOT_ENCODER takes the following arguments:

Output

ML.MULTI_HOT_ENCODER returns an array of struct values in the form ARRAY<STRUCT<INT64, FLOAT64>>. The first element in the struct provides the index of the encoded string expression, and the second element provides the value of the encoded string expression.

Example

The following example performs multi-hot encoding on a set of string array expressions. It limits the encoding vocabulary to the three categories that occur the most frequently in the data and that also occur one or more times.

SELECT f[OFFSET(0)] AS f0, ML.MULTI_HOT_ENCODER(f, 3, 1) OVER () AS output
FROM
  (
    SELECT ['a', 'b', 'b', 'c', NULL] AS f
    UNION ALL
    SELECT ['c', 'c', 'd', 'd', NULL] AS f
  )
ORDER BY f[OFFSET(0)];

The output looks similar to the following:

+------+-----------------------------+
|  f0  | output.index | output.value |
+------+--------------+--------------+
|  a   |  1           |  1.0         |
|      |  2           |  1.0         |
|      |  3           |  1.0         |
|      |  0           |  1.0         |
|  c   |  3           |  1.0         |
|      |  0           |  1.0         |
+------+-----------------------------+

What's next