Working with collation

GoogleSQL for BigQuery supports collation. You can learn more about collation in this topic.

About collation

Collation determines how strings are sorted and compared in collation-supported operations. If you would like to customize collation for a collation-supported operation, you must assign a collation specification to at least one string in the operation. Some operations can't use collation, but can pass collation through them.

Operations affected by collation

When an operation is affected by collation, this means that the operation takes into consideration collation during the operation. These query operations are affected by collation when sorting and comparing strings:

Operations
Collation-supported comparison operations
Join operations
ORDER BY
GROUP BY
WINDOW for window functions
Collation-supported scalar functions
Collation-supported aggregate functions
Set operations

Operations that propagate collation

Collation can pass through some query operations to other parts of a query. When collation passes through an operation in a query, this is known as propagation. During propagation:

GoogleSQL has several functions, operators, and expressions that can propagate collation.

In the following example, the 'und:ci' collation specification is propagated from the character column to the ORDER BY operation.

-- With collation
SELECT *
FROM UNNEST([
  COLLATE('B', 'und:ci'),
  'b',
  'a'
]) AS character
ORDER BY character

/*-----------*
 | character |
 +-----------+
 | a         |
 | B         |
 | b         |
 *-----------*/
-- Without collation
SELECT *
FROM UNNEST([
  'B',
  'b',
  'a'
]) AS character
ORDER BY character

/*-----------*
 | character |
 +-----------+
 | B         |
 | a         |
 | b         |
 *-----------*/

Functions

These functions let collation propagate through them:

Function Notes
AEAD.DECRYPT_STRING
ANY_VALUE
ARRAY_AGG Collation on input arguments are propagated as collation on the array element.
ARRAY_TO_STRING Collation on array elements are propagated to output.
COLLATE
CONCAT
FORMAT Collation from format_string to the returned string is propagated.
FORMAT_DATE Collation from format_string to the returned string is propagated.
FORMAT_DATETIME Collation from format_string to the returned string is propagated.
FORMAT_TIME Collation from format_string to the returned string is propagated.
FORMAT_TIMESTAMP Collation from format_string to the returned string is propagated.
GREATEST
LAG
LEAD
LEAST
LEFT
LOWER
LPAD
MAX
MIN
NET.HOST
NET.PUBLIC_SUFFIX
NET.REG_DOMAIN
NTH_VALUE
NORMALIZE
NORMALIZE_AND_CASEFOLD
REPEAT
REPLACE
REVERSE
RIGHT
RPAD
SOUNDEX
SPLIT Collation on input arguments are propagated as collation on the array element.
STRING_AGG
SUBSTR
UPPER

Operators

These operators let collation propagate through them:

Operator Notes
|| concatenation operator
Array subscript operator Propagated to output.
Set operators Collation of an output column is decided by the collations of input columns at the same position.
STRUCT field access operator When getting a STRUCT, collation on the STRUCT field is propagated as the output collation.
UNNEST Collation on the input array element is propagated to output.

Expressions

These expressions let collation propagate through them:

Expression Notes
ARRAY When you construct an ARRAY, collation on input arguments is propagated on the elements in the ARRAY.
CASE
CASE expr
COALESCE
IF
IFNULL
NULLIF
STRUCT When you construct a STRUCT, collation on input arguments is propagated on the fields in the STRUCT.

Additional features that support collation

These features in BigQuery generally support collation:

Feature Notes
Views
Materialized views This feature supports collation, but limitations apply
Table functions This feature supports collation, but limitations apply
BI engine

Where you can assign a collation specification

A collation specification can be assigned to these collation-supported types:

In addition:

In summary:

You can define a default collation specification for a dataset. For example:

CREATE SCHEMA (...)
DEFAULT COLLATE 'und:ci'

You can define a default collation specification for a table. For example:

CREATE TABLE (...)
DEFAULT COLLATE 'und:ci'

You can define a collation specification for a collation-supported column. For example:

CREATE TABLE (
  case_insensitive_column STRING COLLATE 'und:ci'
)

You can specify a collation specification for a collation-supported expression with the COLLATE function. For example:

SELECT COLLATE('a', 'und:ci') AS character

DDL statements

Location Support Notes
Dataset CREATE SCHEMA Create a dataset and optionally add a default collation specification to the dataset.
Dataset ALTER SCHEMA Updates the default collation specification for a dataset.
Table CREATE TABLE Create a table and optionally add a default collation specification to a table or a collation specification to a collation-supported type in a column.

You can't have collation on a column used with CLUSTERING.
Table ALTER TABLE Update the default collation specification for collation-supported type in a table.
Column ADD COLUMN Add a collation specification to a collation-supported type in a new column in an existing table.

Data types

Type Notes
STRING You can apply a collation specification directly to this data type.
STRUCT You can apply a collation specification to a STRING field in a STRUCT. A STRUCT can have STRING fields with different collation specifications. A STRUCT can only be used in comparisons with the following operators and conditional expressions: =, !=, IN, and CASE.
ARRAY You can apply a collation specification to a STRING element in an ARRAY. An ARRAY can have STRING elements with different collation specifications.

Functions, operators, and conditional expressions

Functions

Type Support Notes
Scalar COLLATE
Scalar ENDS_WITH
Scalar GREATEST
Scalar INSTR
Scalar LEAST
Scalar REPLACE
Scalar SPLIT
Scalar STARTS_WITH
Scalar STRPOS
Aggregate COUNT This operator is only affected by collation when the input includes the DISTINCT argument.
Aggregate MAX
Aggregate MIN

Operators

Support Notes
<
<=
>
>=
=
!=
[NOT] BETWEEN
[NOT] IN Limitations apply.
[NOT] LIKE Limitations apply.
Quantified [NOT] LIKE Limitations apply.

Conditional expressions

Support
CASE
CASE expr
NULLIF

The preceding collation-supported operations (functions, operators, and conditional expressions) can include input with explicitly defined collation specifications for collation-supported types. In a collation-supported operation:

For example:

-- Assume there's a table with this column declaration:
CREATE TABLE table_a
(
    col_a STRING COLLATE 'und:ci',
    col_b STRING COLLATE '',
    col_c STRING,
    col_d STRING COLLATE 'und:ci'
);

-- This runs. Column 'b' has a collation specification and the
-- column 'c' doesn't.
SELECT STARTS_WITH(col_b_expression, col_c_expression)
FROM table_a;

-- This runs. Column 'a' and 'd' have the same collation specification.
SELECT STARTS_WITH(col_a_expression, col_d_expression)
FROM table_a;

-- This runs. Even though column 'a' and 'b' have different
-- collation specifications, column 'b' is considered the default collation
-- because it's assigned to an empty collation specification.
SELECT STARTS_WITH(col_a_expression, col_b_expression)
FROM table_a;

-- This works. Even though column 'a' and 'b' have different
-- collation specifications, column 'b' is updated to use the same
-- collation specification as column 'a'.
SELECT STARTS_WITH(col_a_expression, COLLATE(col_b_expression, 'und:ci'))
FROM table_a;

-- This runs. Column 'c' doesn't have a collation specification, so it uses the
-- collation specification of column 'd'.
SELECT STARTS_WITH(col_c_expression, col_d_expression)
FROM table_a;

Collation specification details

A collation specification determines how strings are sorted and compared in collation-supported operations. You can define the Unicode collation specification, und:ci, for collation-supported types.

If a collation specification isn't defined, the default collation specification is used. To learn more, see the next section.

Default collation specification

When a collation specification isn't assigned or is empty, 'binary' collation is used. Binary collation indicates that the operation should return data in Unicode code point order. You can't set binary collation explicitly.

In general, the following behavior occurs when an empty string is included in collation:

Unicode collation specification

collation_specification:
  'language_tag:collation_attribute'

A unicode collation specification indicates that the operation should use the Unicode Collation Algorithm to sort and compare strings. The collation specification can be a STRING literal or a query parameter.

The language tag

The language tag determines how strings are generally sorted and compared. Allowed values for language_tag are:

The collation attribute

In addition to the language tag, the unicode collation specification must have a collation_attribute, which enables additional rules for sorting and comparing strings. Allowed values are:

Collation specification example

This is what the ci collation attribute looks like when used with the und language tag in the COLLATE function:

COLLATE('orange1', 'und:ci')

Caveats

Limitations

Limitations for supported features are captured in the previous sections, but here are a few general limitations to keep in mind: