The ML.EVALUATE function

This document describes the ML.EVALUATE function, which lets you evaluate model metrics.

Syntax

# Remote models over LLMs:
ML.EVALUATE(
  MODEL `project_id.dataset.model`,
  { TABLE `project_id.dataset.table` | (query_statement) },
    STRUCT(
      [task_type AS task_type]
      [, max_output_tokens AS max_output_tokens]
      [, temperature AS temperature]
      [, top_k AS top_k]
      [, top_p AS top_p])
)

# ARIMA_PLUS and ARIMA_PLUS_XREG models:
ML.EVALUATE(
  MODEL `project_id.dataset.model`
  [, { TABLE `project_id.dataset.table` | (query_statement) }],
    STRUCT(
      [threshold_value AS threshold]
      [, perform_aggregation AS perform_aggregation]
      [, horizon_value AS horizon]
      [, confidence_level AS confidence_level]
      [, trial_id AS trial_id])
)

# All other types of models:
ML.EVALUATE(
  MODEL `project_id.dataset.model`
  [, { TABLE `project_id.dataset.table` | (query_statement) }],
    STRUCT(
      [threshold_value AS threshold]
      [, trial_id AS trial_id])
)

Arguments

ML.EVALUATE takes the following arguments:

Output

ML.EVALUATE returns a single row of metrics applicable to the type of model specified.

For models that return them, the precision, recall, f1_score, log_loss, and roc_auc metrics are macro-averaged for all of the class labels. For a macro-average, metrics are calculated for each label and then an unweighted average is taken of those values.

For models that return the accuracy metric, accuracy is computed as a global total or micro-average. For a micro-average, the metric is calculated globally by counting the total number of correctly predicted rows.

Regression models

Regression models include the following:

ML.EVALUATE returns the following columns for regression models:

Classification models

Classification models include the following:

ML.EVALUATE returns the following columns for classification models:

K-means models

ML.EVALUATE returns the following columns for k-means models:

Matrix factorization models

ML.EVALUATE returns the following columns for matrix factorization models with implicit feedback:

ML.EVALUATE returns the following columns for matrix factorization models with explicit feedback:

PCA models

ML.EVALUATE returns the following column for PCA models:

Time series models

ML.EVALUATE returns the following columns for ARIMA_PLUS or ARIMA_PLUS_XREG models when input data is provided and perform_aggregation is FALSE:

ML.EVALUATE returns the following columns for ARIMA_PLUS or ARIMA_PLUS_XREG models when input data is provided and perform_aggregation is TRUE:

ML.EVALUATE returns the following columns for an ARIMA_PLUS model when input data isn't provided:

Autoencoder models

ML.EVALUATE returns the following columns for autoencoder models:

Remote models over Vertex AI endpoints

ML.EVALUATE returns the following column:

Remote models over Vertex AI LLMs

ML.EVALUATE returns different columns for remote models over Vertex AI LLMs, depending on the task_type value that you specify.

When you specify the TEXT_GENERATION task type, the following columns are returned:

When you specify the CLASSIFICATION task type, the following columns are returned:

When you specify the SUMMARIZATION task type, the following columns are returned:

When you specify the QUESTION_ANSWERING task type, the following columns are returned:

Limitations

ML.EVALUATE is subject to the following limitations:

Costs

When used with remote models over Vertex AI LLMs, ML.EVALUATE costs are calculated based on the following:

Examples

The following examples show how to use ML.EVALUATE.

ML.EVALUATE with no input data specified

The following query evaluates a model with no input data specified:

SELECT
  *
FROM
  ML.EVALUATE(MODEL `mydataset.mymodel`)

ML.EVALUATE with a custom threshold and input data

The following query evaluates a model with input data and a custom threshold of 0.55:

SELECT
  *
FROM
  ML.EVALUATE(MODEL `mydataset.mymodel`,
    (
    SELECT
      custom_label,
      column1,
      column2
    FROM
      `mydataset.mytable`),
    STRUCT(0.55 AS threshold))

ML.EVALUATE to calculate forecasting accuracy of a time series

The following query evaluates the 30-point forecasting accuracy for a time series model:

SELECT
  *
FROM
  ML.EVALUATE(MODEL `mydataset.my_arima_model`,
    (
    SELECT
      timeseries_date,
      timeseries_metric
    FROM
      `mydataset.mytable`),
    STRUCT(TRUE AS perform_aggregation, 30 AS horizon))

ML.EVALUATE to calculate ARIMA_PLUS forecasting accuracy for each forecasted timestamp

The following query evaluates the forecasting accuracy for each of the 30 forecasted points of a time series model. It also computes the prediction interval based on a confidence level of 0.9.

SELECT
  *
FROM
  ML.EVALUATE(MODEL `mydataset.my_arima_model`,
    (
    SELECT
      timeseries_date,
      timeseries_metric
    FROM
      `mydataset.mytable`),
    STRUCT(FALSE AS perform_aggregation, 0.9 AS confidence_level,
    30 AS horizon))

ML.EVALUATE to calculate ARIMA_PLUS_XREG forecasting accuracy for each forecasted timestamp

The following query evaluates the forecasting accuracy for each of the 30 forecasted points of a time series model. It also computes the prediction interval based on a confidence level of 0.9. Note that you need to include the side features for the evaluation data.

SELECT
  *
FROM
  ML.EVALUATE(MODEL `mydataset.my_arima_xreg_model`,
    (
    SELECT
      timeseries_date,
      timeseries_metric,
      feature1,
      feature2
    FROM
      `mydataset.mytable`),
    STRUCT(FALSE AS perform_aggregation, 0.9 AS confidence_level,
    30 AS horizon))

ML.EVALUATE to calculate LLM text generation accuracy

The following query evaluates the LLM text generation accuracy for the classification task type for each label from the evaluation table.

SELECT
  *
FROM
  ML.EVALUATE(MODEL `mydataset.my_llm`,
    (
    SELECT
      prompt,
      label
    FROM
      `mydataset.mytable`),
    STRUCT('classification' AS task_type))

What's next