The ML.LABEL_ENCODER function

This document describes the ML.LABEL_ENCODER function, which you can use to encode a string expression to an INT64 value in [0, <number of categories>].

The encoding vocabulary is sorted alphabetically. NULL values and categories that aren't in the vocabulary are encoded to 0.

When used in the TRANSFORM clause, the vocabulary values calculated during training, along with the top k and frequency threshold values that you specified, are automatically used in prediction.

Syntax

ML.LABEL_ENCODER(string_expression [, top_k] [, frequency_threshold]) OVER()

ML.LABEL_ENCODER takes the following arguments:

Output

ML.LABEL_ENCODER returns an INT64 value that represents the encoded string expression.

Example

The following example performs label encoding on a set of string expressions. It limits the encoding vocabulary to the two categories that occur the most frequently in the data and that also occur two or more times.

SELECT f, ML.LABEL_ENCODER(f, 2, 2) OVER () AS output
FROM UNNEST([NULL, 'a', 'b', 'b', 'c', 'c', 'c', 'd', 'd']) AS f
ORDER BY f;

The output looks similar to the following:

+------+--------+
|  f   | output |
+------+--------+
| NULL |      0 |
| a    |      0 |
| b    |      1 |
| b    |      1 |
| c    |      2 |
| c    |      2 |
| c    |      2 |
| d    |      0 |
| d    |      0 |
+------+--------+

What's next