The ML.VALIDATE_DATA_DRIFT function

This document describes the ML.VALIDATE_DATA_DRIFT function, which you can use to compute the data drift between two sets of serving data. This function computes and compares the statistics for the two data sets, and then identifies where there are anomalous differences between the two data sets. For example, you might want to compare the current serving data to historical serving data from a table snapshot, or to the features served at a particular point in time, which you can get by using the ML.FEATURES_AT_TIME function. You can use the data output by this function for model monitoring.

Syntax

ML.VALIDATE_DATA_DRIFT(
  { TABLE `project_id.dataset.base_table` | (base_query_statement) },
  { TABLE `project_id.dataset.study_table` | (study_query_statement) },
  STRUCT(
    [num_histogram_buckets AS num_histogram_buckets]
    [, num_quantiles_histogram_buckets AS num_quantiles_histogram_buckets]
    [, num_values_histogram_buckets, AS num_values_histogram_buckets,]
    [, num_rank_histogram_buckets AS num_rank_histogram_buckets]
    [, categorical_default_threshold AS categorical_default_threshold]
    [, categorical_metric_type AS categorical_metric_type]
    [, numerical_default_threshold AS numerical_default_threshold]
    [, numerical_metric_type AS numerical_metric_type]
    [, thresholds AS thresholds])
)

Arguments

ML.VALIDATE_DATA_DRIFT takes the following arguments:

Output

ML.VALIDATE_DATA_DRIFT returns one row for each column in the input data. ML.VALIDATE_DATA_DRIFT output contains the following columns:

Example

The following example computes data drift between a snapshot of the serving data table and the current serving data table, with a categorical feature threshold of 0.2:

SELECT *
FROM ML.VALIDATE_DATA_DRIFT(
  TABLE `myproject.mydataset.previous_serving_data`,
  TABLE `myproject.mydataset.serving`,
  STRUCT(0.2 AS categorical_default_threshold)
);

Limitations

ML.VALIDATE_DATA_DRIFT doesn't conduct schema validation between the two sets of input data, and so handles type mismatch as follows:

However, when you run inference on the serving data, the ML.PREDICT function handles schema validation.