The ML.DESCRIBE_DATA function

This document describes the ML.DESCRIBE_DATA function, which you can use to generate descriptive statistics for the columns in a table or subquery. For example, you might want to know statistics for a table of training or serving data that you plan to use with a machine learning (ML) model. You can use the data output by this function for such purposes as feature preprocessing or model monitoring.

Syntax

ML.DESCRIBE_DATA(
  { TABLE `project_id.dataset.table` | (query_statement) },
  STRUCT(
    [num_quantiles AS num_quantiles]
    [, num_array_length_quantiles AS num_array_length_quantiles]
    [, top_k AS top_k])
)

Arguments

ML.DESCRIBE_DATA takes the following arguments:

Details

ML.DESCRIBE_DATA handles input columns as follows:

Output

ML.DESCRIBE_DATA returns one row for each column in the input data. ML.DESCRIBE_DATA output contains the following columns:

Example

The following example returns statistics for a table with five quantiles calculated for numeric columns and three top values returned for non-numeric columns:

SELECT *
FROM ML.DESCRIBE_DATA(
  TABLE `myproject.mydataset.mytable`,
  STRUCT(5 AS num_quantiles, 3 AS top_k)
);

Limitations

Input data for the ML.DESCRIBE_DATA function can only contain columns of the following data types: