The CREATE MODEL statement for AutoML models

This document describes the CREATE MODEL statement for creating AutoML classification and regression models in BigQuery. AutoML lets you quickly build large-scale machine learning models on tabular data.

BigQuery ML uses the default values for AutoML training options, including data splitting and optimization functions.

For information about the supported SQL statements and functions for each model type, see End-to-end user journey for each model.

CREATE MODEL syntax

{CREATE MODEL | CREATE MODEL IF NOT EXISTS | CREATE OR REPLACE MODEL}
model_name
OPTIONS(model_option_list)
AS query_statement

model_option_list:
MODEL_TYPE = { 'AUTOML_REGRESSOR' | 'AUTOML_CLASSIFIER' }
    [, BUDGET_HOURS = float64_value ]
    [, OPTIMIZATION_OBJECTIVE = string_value ]
    [, INPUT_LABEL_COLS = string_array ]
    [, DATA_SPLIT_COL = string_value ]

CREATE MODEL

Creates and trains a new model in the specified dataset. If the model name exists, CREATE MODEL returns an error.

CREATE MODEL IF NOT EXISTS

Creates and trains a new model only if the model doesn't exist in the specified dataset.

CREATE OR REPLACE MODEL

Creates and trains a model and replaces an existing model with the same name in the specified dataset.

model_name

The name of the model you're creating or replacing. The model name must be unique in the dataset: no other model or table can have the same name. The model name must follow the same naming rules as a BigQuery table. A model name can:

model_name is not case-sensitive.

If you don't have a default project configured, then you must prepend the project ID to the model name in the following format, including backticks:

`[PROJECT_ID].[DATASET].[MODEL]`

For example, `myproject.mydataset.mymodel`.

MODEL_TYPE

Syntax

MODEL_TYPE = { 'AUTOML_REGRESSOR' | 'AUTOML_CLASSIFIER' }

Description

Specifies the model type. This option is required.

Arguments

This option accepts the following values:

BUDGET_HOURS

Syntax

BUDGET_HOURS = float64_value

Description

Sets the training budget in hours for AutoML training.

After training a AutoML model, BigQuery ML compresses the model to ensure it is small enough to import, which can take up to 50% of the training time. The time to compress the model is not included in the training budget time.

Arguments

A FLOAT64 value between 1.0 and 72.0. The default value is 1.0.

OPTIMIZATION_OBJECTIVE

Syntax

OPTIMIZATION_OBJECTIVE = string_value | struct_value

Description

Sets the optimization objective function to use for AutoML training.

For more details on the optimization objective functions, see the AutoML documentation.

Arguments

This option can be specified as a STRING or STRUCT value.

This option accepts the following string values for optimization objective functions:

For example:

OPTIMIZATION_OBJECTIVE = 'MAXIMIZE_AU_ROC'

For binary classification models, you can alternatively specify a struct value for this option. The struct must contain a STRING value and a FLOAT64 value in one of the following combinations:

For example:

OPTIMIZATION_OBJECTIVE = STRUCT('MAXIMIZE_PRECISION_AT_RECALL', 0.3)

INPUT_LABEL_COLS

Syntax

INPUT_LABEL_COLS = string_array

Description

The name of the label column in the training data.

Arguments

A one-element ARRAY of string values. Defaults to label.

Supported data types for input_label_cols include the following:

Model type Supported label types
automl_regressor INT64
NUMERIC
BIGNUMERIC
FLOAT64
automl_classifier Any groupable data type

DATA_SPLIT_COL

Syntax

DATA_SPLIT_COL = string_value

Description

The name of the column to use to sort input data into the training, validation, or test set. Defaults to random splitting.

Arguments

The string value must be the name of one of the columns in the training data. This column must have either a timestamp or string data type. This column is passed directly to AutoML.

If you use a string column, rows are assigned to the appropriate dataset based on the column's value, which must be one of the following options:

For more information about how to use these values, see Manual split.

Timestamp columns are used to perform a chronological split.

Supported data types for input columns

For columns other than the label column, any groupable data type is supported. The BigQuery column type is used to determine the feature column type in AutoML.

BigQuery type AutoML type
INT64
NUMERIC
BIGNUMERIC
FLOAT64
NUMERIC or TIMESTAMP if AutoML determines that it is a UNIX timestamp
BOOL CATEGORICAL
STRING
BYTES
Either CATEGORICAL or TEXT, auto-selected by AutoML.
TIMESTAMP
DATETIME
TIME
DATE
Either TIMESTAMP, CATEGORICAL, or TEXT, auto-selected by AutoML.

To force a numeric column to be treated as categorical, use the CAST function to cast it to a BigQuery string. Arrays of supported types are allowed and remain arrays during AutoML training.

Limitations

AutoML models have the following limitations:

CREATE MODEL example

The following example creates a model named mymodel in mydataset in your default project. It uses the public nyc-tlc.yellow.trips taxi trip data available in BigQuery. The job takes approximately 3 hours to complete, including training, model compression, temporary data movement (to AutoML), and setup tasks.

Create the model:

CREATE OR REPLACE MODEL `project_id.mydataset.mymodel`
       OPTIONS(model_type='AUTOML_REGRESSOR',
               input_label_cols=['fare_amount'],
               budget_hours=1.0)
AS SELECT
  (tolls_amount + fare_amount) AS fare_amount,
  pickup_longitude,
  pickup_latitude,
  dropoff_longitude,
  dropoff_latitude,
  passenger_count
FROM `nyc-tlc.yellow.trips`
WHERE ABS(MOD(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING)), 100000)) = 1
AND
  trip_distance > 0
  AND fare_amount >= 2.5 AND fare_amount <= 100.0
  AND pickup_longitude > -78
  AND pickup_longitude < -70
  AND dropoff_longitude > -78
  AND dropoff_longitude < -70
  AND pickup_latitude > 37
  AND pickup_latitude < 45
  AND dropoff_latitude > 37
  AND dropoff_latitude < 45
  AND passenger_count > 0

Run predictions:

SELECT * FROM ML.PREDICT(MODEL `project_id.mydataset.mymodel`, (
    SELECT * FROM `nyc-tlc.yellow.trips` LIMIT 100))

Supported regions

Training AutoML models is not supported in all BigQuery ML regions. For a complete list of supported regions and multi-regions, see the BigQuery ML locations.