The ML.TFDV_DESCRIBE function

This document describes the ML.TFDV_DESCRIBE function, which you can use to generate fine-grained statistics for the columns in a table. For example, you might want to know statistics for a table of training or serving data statistics that you plan to use with a machine learning (ML) model. Calling this function provides the same behavior as calling the TensorFlow TensorFlow tfdv.generate_statistics_from_csv API. You can use the data output by this function for such purposes as feature preprocessing or model monitoring.

Syntax

ML.TFDV_DESCRIBE(
  { TABLE `project_id.dataset.table` | (query_statement) },
  STRUCT(
    [num_histogram_buckets AS num_histogram_buckets]
    [, num_quantiles_histogram_buckets AS num_quantiles_histogram_buckets]
    [, num_values_histogram_buckets AS num_values_histogram_buckets]
    [, num_rank_histogram_buckets AS num_rank_histogram_buckets])
)

Arguments

ML.TFDV_DESCRIBE takes the following arguments:

Output

ML.TFDV_DESCRIBE returns a column named dataset_feature_statistics_list that contains a TensorFlow DatasetFeatureStatisticsList protocol buffer in JSON format.

Example

The following example returns statistics for the penguins public dataset and uses 20 buckets for rank histograms for string values:

SELECT * FROM ML.TFDV_DESCRIBE(
  TABLE `bigquery-public-data.ml_datasets.penguins`,
  STRUCT(20 AS num_rank_histogram_buckets)
);

Limitations

Input data for the ML.TFDV_DESCRIBE function can only contain columns of the following data types: