The ML.DETECT_ANOMALIES function

This document describes the ML.DETECT_ANOMALIES function, which lets you perform anomaly detection in BigQuery ML.

For time series data, ML.DETECT_ANOMALIES runs using ARIMA_PLUS and ARIMA_PLUS_XREG (preview) models. For independent and identically distributed random variables (IID) data, ML.DETECT_ANOMALIES runs using autoencoder, k-means, or principal component analysis (PCA) models.

Syntax

# ARIMA_PLUS and ARIMA_PLUS_XREG models:
ML.DETECT_ANOMALIES(
  MODEL `project_id.dataset.model`
  [, STRUCT(anomaly_prob_threshold AS anomaly_prob_threshold)]
  [, { TABLE `project_id.dataset.table` | (query_statement) }]
)

#Autoencoder, k-means, or PCA models:
ML.DETECT_ANOMALIES(
  MODEL `project_id.dataset.model`,
  STRUCT(contamination AS contamination),
  { TABLE `project_id.dataset.table` | (query_statement) }
)

Arguments

ML.DETECT_ANOMALIES takes the following arguments:

Input

The input requirements for the ML.DETECT_ANOMALIES function depend upon the input model type.

Time series model input

Anomaly detection with ARIMA_PLUS and ARIMA_PLUS_XREG models has the following requirements:

For a list of supported data types for the TIME_SERIES_TIMESTAMP_COL and TIME_SERIES_DATA_COL columns, see Supported data types for time series model inputs.

Autoencoder, k-means, or PCA model input

Anomaly detection with autoencoder, k-means, or PCA models has the following requirements:

For information about how BigQuery ML handles NULL values in the feature column of the input data, see Imputation.

Output

ML.DETECT_ANOMALIES always returns the is_anomaly column that contains the anomaly detection results. Other output columns differ based upon the input model type and input data table.

Time series model output

ARIMA_PLUS and ARIMA_PLUS_XREG model output includes the following columns, followed by the input table columns, if present. Output can include the following:

ML.DETECT_ANOMALIES output for time series models has the following properties:

Autoencoder and PCA model output

Autoencoder and PCA model output includes the following columns, followed by the input table columns:

K-means model output

K-means model output includes the following, followed by the input table columns:

Examples

The following examples show how to use ML.DETECT_ANOMALIES with different input models and settings.

ARIMA_PLUS model without specified settings

The following example detects anomalies using an ARIMA_PLUS model that has the DECOMPOSE_TIME_SERIES training option set to its default value of TRUE, without specifying the anomaly_prob_threshold argument.

SELECT
  *
FROM
  ML.DETECT_ANOMALIES(MODEL `mydataset.my_arima_plus_model`)

If the time series input column names are ts_timestamp and ts_data, then this query returns results similar to the following:

+-------------------------+----------+------------+-------------+-------------+---------------------+
|      ts_timestamp       | ts_data  | is_anomaly | lower_bound | upper_bound | anomaly_probability |
+-------------------------+----------+------------+-------------+-------------+---------------------+
| 2021-01-01 00:00:01 UTC |  125.3   |   FALSE    |  123.5      |  139.1      |  0.93               |
| 2021-01-02 00:00:01 UTC |  145.3   |   TRUE     |  128.5      |  143.1      |  0.96               |
+-------------------------+----------+------------+-------------+-------------+---------------------+

ARIMA_PLUS model with a custom anomaly_prob_threshold value

The following example detects anomalies using an ARIMA_PLUS model that has the DECOMPOSE_TIME_SERIES training option set to its default value of TRUE, using a custom anomaly_prob_threshold value of 0.8:

SELECT
  *
FROM
  ML.DETECT_ANOMALIES(MODEL `mydataset.my_arima_plus_model`,
    STRUCT(0.8 AS anomaly_prob_threshold))

If the time series input column names are ts_timestamp and ts_data, then this query returns results similar to the following:

+-------------------------+----------+------------+-------------+-------------+---------------------+
|      ts_timestamp       | ts_data  | is_anomaly | lower_bound | upper_bound | anomaly_probability |
+-------------------------+----------+------------+-------------+-------------+---------------------+
| 2021-01-01 00:00:01 UTC |  125.3   |    TRUE    |  129.5      |  133.6      |  0.93               |
| 2021-01-02 00:00:01 UTC |  145.3   |    TRUE    |  131.5      |  136.6      |  0.96               |
+-------------------------+----------+------------+-------------+-------------+---------------------+

ARIMA_PLUS model with input data as a query statement

The following example detects anomalies using an ARIMA_PLUS model, using a custom anomaly_prob_threshold value of 0.9 and passing an input data table into the query:

SELECT
  *
FROM
  ML.DETECT_ANOMALIES(MODEL `mydataset.my_arima_plus_model`,
    STRUCT(0.9 AS anomaly_prob_threshold),
    (
      SELECT
        state, city, date, temperature, weather
      FROM
        `mydataset.my_time_series_data_table`))

This example uses the following column values:

This example returns results similar to the following:

+-------+------------+-------------------------+-------------+------------+-------------+-------------+---------------------+---------+
| state |   city     |           date          | temperature | is_anomaly | lower_bound | upper_bound | anomaly_probability | weather |
+-------+------------+-------------------------+-------------+------------+-------------+-------------+---------------------+---------+
| "WA"  | "Kirkland" | 2021-01-01 00:00:00 UTC |   38.1      |   FALSE    |     36.4    |    42.0     |        0.8293       | "sunny" |
| "WA"  | "Kirkland" | 2021-01-02 00:00:00 UTC |   37.1      |   TRUE     |     37.4    |    43.3     |        0.9124       | "rainy" |
+-------+------------+-------------------------+-------------+------------+-------------+-------------+---------------------+---------+

ARIMA_PLUS model with input data as a table

The following example detects anomalies using an ARIMA_PLUS model, using a custom anomaly_prob_threshold value of 0.9 and passing an input data table into the query:

SELECT
  *
FROM
  ML.DETECT_ANOMALIES(MODEL `mydataset.my_arima_plus_model`,
    STRUCT(0.9 AS anomaly_prob_threshold),
    TABLE `mydataset.my_time_series_data_table`)

If the TIME_SERIES_ID_COL column names are state, city, and TIME_SERIES_TIMESTAMP_COL, and the TIME_SERIES_DATA_COL column names are date and temperature, and one additional column weather is in the input data table, then this query returns results similar to the following:

+-------+------------+-------------------------+-------------+------------+-------------+-------------+---------------------+---------+
| state |   city     |           date          | temperature | is_anomaly | lower_bound | upper_bound | anomaly_probability | weather |
+-------+------------+-------------------------+-------------+------------+-------------+-------------+---------------------+---------+
| "WA"  | "Kirkland" | 2021-01-01 00:00:00 UTC |   38.1      |   FALSE    |     36.4    |    42.0     |        0.8293       | "sunny" |
| "WA"  | "Kirkland" | 2021-01-02 00:00:00 UTC |   37.1      |   TRUE     |     37.4    |    43.3     |        0.9124       | "rainy" |
+-------+------------+-------------------------+-------------+------------+-------------+-------------+---------------------+---------+

Autoencoder model

The following example detects anomalies using an autoencoder model and a contamination value of 0.1.

SELECT
  *
FROM
  ML.DETECT_ANOMALIES(MODEL `mydataset.my_autoencoder_model`,
    STRUCT(0.1 AS contamination),
    TABLE `mydataset.mytable`)

If the feature column names are f1 and f2, then this query returns results similar to the following:

+------------+--------------------+---------+--------+
| is_anomaly | mean_squared_error |    f1   |   f2   |
+------------+--------------------+---------+--------+
|   FALSE    |     0.63456        |   120   |  "a"   |
|   TRUE     |     11.342         |  15000  |  "b"   |
+------------+--------------------+---------+--------+

K-means model

The following example detects anomalies using a k-means model and a contamination value of 0.2.

SELECT
  *
FROM
  ML.DETECT_ANOMALIES(MODEL `mydataset.my_kmeans_model`,
    STRUCT(0.2 AS contamination),
    (
      SELECT
        f1,
        f2
      FROM
        `mydataset.mytable`))

This query returns results similar to the following:

+------------+---------------------+-------------+--------+--------+
| is_anomaly | normalized_distance | centroid_id |   f1   |   f2   |
+------------+---------------------+-------------+--------+--------+
|   FALSE    |     0.63456         |     1       |  120   |  "a"   |
|   TRUE     |     6.3243          |     2       | 15000  |  "b"   |
+------------+---------------------+-------------+--------+--------+

PCA model

The following example detects anomalies using a PCA model and a contamination value of 0.1.

SELECT
  *
FROM
  ML.DETECT_ANOMALIES(MODEL `mydataset.my_pca_model`,
    STRUCT(0.1 AS contamination),
    TABLE `mydataset.mytable`)

If the feature column names are f1, f2 and f3, then this query returns results similar to the following:

+------------+--------------------+---------+--------+------+
| is_anomaly | mean_squared_error |    f1   |   f2   |  f3  |
+------------+--------------------+---------+--------+------+
|   FALSE    |     0.63456        |   120   |  "a"   |  0.9 |
|   TRUE     |     11.342         |  15000  |  "b"   |  25  |
+------------+--------------------+---------+--------+------+

Pricing

All queries that use the ML.DETECT_ANOMALIES function are billable, regardless of the pricing model.

What's next