The ML.VALIDATE_DATA_SKEW function

This document describes the ML.VALIDATE_DATA_SKEW function, which you can use to compute the data skew between a model's training and serving data. This function computes the statistics for the serving data, compares them to the statistics that were computed for the training data at the time the model was created, and identifies where there are anomalous differences between the two data sets. You can use the data output by this function for model monitoring.

Statistics are only computed for feature columns in the serving data that match feature columns in the training data, in order to achieve better performance and lower cost. For models that were created with use of the TRANSFORM clause, the statistics are based on the raw feature data before feature preprocessing within the TRANSFORM clause.

Syntax

ML.VALIDATE_DATA_SKEW(
  MODEL `project_id.dataset.model`,
  { TABLE `project_id.dataset.table` | (query_statement) },
  STRUCT(
    [categorical_default_threshold AS categorical_default_threshold]
    [, categorical_metric_type AS categorical_metric_type]
    [, numerical_default_threshold AS numerical_default_threshold]
    [, numerical_metric_type AS numerical_metric_type]
    [, thresholds AS thresholds])
)

Arguments

ML.VALIDATE_DATA_SKEW takes the following arguments:

Output

ML.VALIDATE_DATA_SKEW returns one row for each column in the input data. ML.VALIDATE_DATA_SKEW output contains the following columns:

Examples

The following examples demonstrate how to use the ML.VALIDATE_DATA_SKEW function.

Run ML.VALIDATE_DATA_SKEW

The following example computes data skew between the serving data and the training data used to create the model, with a categorical feature threshold of 0.2:

SELECT *
FROM ML.VALIDATE_DATA_SKEW(
  MODEL `myproject.mydataset.mymodel`,
  TABLE `myproject.mydataset.serving`,
  STRUCT(0.2 AS categorical_default_threshold)
);

Automate skew detection

The following example shows how to automate skew detection for a linear regression model:

DECLARE anomalies ARRAY<STRING>;

SET anomalies = (
  SELECT ARRAY_AGG(input)
  FROM
    ML.VALIDATE_DATA_SKEW(
      MODEL mydataset.model_linear_reg,
      TABLE mydataset.serving,
      STRUCT(
        0.3 AS categorical_default_threshold,
        0.2 AS numerical_default_threshold,
        'JENSEN_SHANNON_DIVERGENCE' AS numerical_metric_type,
        [STRUCT('fare', 0.15), STRUCT('company', 0.25)] AS thresholds))
  WHERE is_anomaly
);

IF(ARRAY_LENGTH(anomalies) > 0)
  THEN
    CREATE OR REPLACE MODEL mydataset.model_linear_reg
    TRANSFORM(
      ML.MIN_MAX_SCALER(fare) OVER() AS f1,
      ML.ROBUST_SCALER(pickup_longitude) OVER() AS f2,
      ML.LABEL_ENCODER(company) OVER() AS f3,
      ML.ONE_HOT_ENCODER(payment_type) OVER() AS f4,
      label
    )
    OPTIONS (
      model_type = 'linear_reg',
      max_iterations = 1)
    AS (
      SELECT
        fare,
        pickup_longitude,
        company,
        payment_type,
        2 AS label
      FROM mydataset.new_training_data
    );

    SELECT
      ERROR(
        CONCAT(
          "Found data skew in features: ",
          ARRAY_TO_STRING(anomalies, ", "),
          ". Model is retrained with the latest data."));
  ELSE
    SELECT *
    FROM
      ML.PREDICT(
        MODEL mydataset.model_linear_reg,
        TABLE mydataset.serving);
END IF;

Limitations