Using BigQuery with Pandas#

Retrieve BigQuery data as a Pandas DataFrame#

As of version 0.29.0, you can use the to_dataframe() function to retrieve query results or table rows as a pandas.DataFrame.

First, ensure that the pandas library is installed by running:

pip install --upgrade pandas

Alternatively, you can install the BigQuery python client library with pandas by running:

pip install --upgrade google-cloud-bigquery[pandas]

To retrieve query results as a pandas.DataFrame:

# from google.cloud import bigquery
# client = bigquery.Client()

sql = """
    SELECT name, SUM(number) as count
    FROM `bigquery-public-data.usa_names.usa_1910_current`
    GROUP BY name
    ORDER BY count DESC
    LIMIT 10
"""

df = client.query(sql).to_dataframe()

To retrieve table rows as a pandas.DataFrame:

# from google.cloud import bigquery
# client = bigquery.Client()

dataset_ref = client.dataset("samples", project="bigquery-public-data")
table_ref = dataset_ref.table("shakespeare")
table = client.get_table(table_ref)

df = client.list_rows(table).to_dataframe()

Load a Pandas DataFrame to a BigQuery Table#

As of version 1.3.0, you can use the load_table_from_dataframe() function to load data from a pandas.DataFrame to a Table. To use this function, in addition to pandas, you will need to install the pyarrow library. You can install the BigQuery python client library with pandas and pyarrow by running:

pip install --upgrade google-cloud-bigquery[pandas,pyarrow]

The following example demonstrates how to create a pandas.DataFrame and load it into a new table:

from google.cloud import bigquery

import pandas

# TODO(developer): Construct a BigQuery client object.
# client = bigquery.Client()

# TODO(developer): Set table_id to the ID of the table to create.
# table_id = "your-project.your_dataset.your_table_name"

records = [
    {"title": u"The Meaning of Life", "release_year": 1983},
    {"title": u"Monty Python and the Holy Grail", "release_year": 1975},
    {"title": u"Life of Brian", "release_year": 1979},
    {"title": u"And Now for Something Completely Different", "release_year": 1971},
]
dataframe = pandas.DataFrame(
    records,
    # In the loaded table, the column order reflects the order of the
    # columns in the DataFrame.
    columns=["title", "release_year"],
    # Optionally, set a named index, which can also be written to the
    # BigQuery table.
    index=pandas.Index(
        [u"Q24980", u"Q25043", u"Q24953", u"Q16403"], name="wikidata_id"
    ),
)
job_config = bigquery.LoadJobConfig(
    # Specify a (partial) schema. All columns are always written to the
    # table. The schema is used to assist in data type definitions.
    schema=[
        # Specify the type of columns whose type cannot be auto-detected. For
        # example the "title" column uses pandas dtype "object", so its
        # data type is ambiguous.
        bigquery.SchemaField("title", bigquery.enums.SqlTypeNames.STRING),
        # Indexes are written if included in the schema by name.
        bigquery.SchemaField("wikidata_id", bigquery.enums.SqlTypeNames.STRING),
    ],
    # Optionally, set the write disposition. BigQuery appends loaded rows
    # to an existing table by default, but with WRITE_TRUNCATE write
    # disposition it replaces the table with the loaded data.
    write_disposition="WRITE_TRUNCATE",
)

job = client.load_table_from_dataframe(
    dataframe,
    table_id,
    job_config=job_config,
    location="US",  # Must match the destination dataset location.
)  # Make an API request.
job.result()  # Waits for the job to complete.

table = client.get_table(table_id)  # Make an API request.
print(
    "Loaded {} rows and {} columns to {}".format(
        table.num_rows, len(table.schema), table_id
    )
)