diff --git a/notebooks/dataframes/struct_and_array_dtypes.ipynb b/notebooks/dataframes/struct_and_array_dtypes.ipynb new file mode 100644 index 0000000000..3bcdaf40f7 --- /dev/null +++ b/notebooks/dataframes/struct_and_array_dtypes.ipynb @@ -0,0 +1,656 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "# Copyright 2023 Google LLC\n", + "#\n", + "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", + "# you may not use this file except in compliance with the License.\n", + "# You may obtain a copy of the License at\n", + "#\n", + "# https://www.apache.org/licenses/LICENSE-2.0\n", + "#\n", + "# Unless required by applicable law or agreed to in writing, software\n", + "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", + "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", + "# See the License for the specific language governing permissions and\n", + "# limitations under the License." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# A Guide to Array and Struct Data Types in BigQuery DataFrames" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Set up your environment\n", + "\n", + "To get started, follow the instructions in the notebooks within the `getting_started` folder to set up your environment. Once your environment is ready, you can import the necessary packages by running the following code:" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "import bigframes.pandas as bpd\n", + "import bigframes.bigquery as bbq\n", + "import pyarrow as pa" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "REGION = \"US\" # @param {type: \"string\"}\n", + "\n", + "bpd.options.display.progress_bar = None\n", + "bpd.options.bigquery.location = REGION" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Array Data Types\n", + "\n", + "In BigQuery, an [array](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#array_type) (also called a repeated column) is an ordered list of zero or more elements of the same data type. Arrays cannot contain other arrays or `NULL` elements.\n", + "\n", + "BigQuery DataFrames map BigQuery array types to `pandas.ArrowDtype(pa.list_())`. The following code examples illustrate how to work with array columns in BigQuery DataFrames." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Create DataFrames with array columns\n", + "\n", + "Create a DataFrame in BigQuery DataFrames from local sample data. Use a list of lists to create a column with the `list[pyarrow]` dtype, which corresponds to the `ARRAY` type in BigQuery." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
NameScores
0Alice[95 88 92]
1Bob[78 81]
2Charlie[ 82 89 94 100]
\n", + "

3 rows × 2 columns

\n", + "
[3 rows x 2 columns in total]" + ], + "text/plain": [ + " Name Scores\n", + "0 Alice [95 88 92]\n", + "1 Bob [78 81]\n", + "2 Charlie [ 82 89 94 100]\n", + "\n", + "[3 rows x 2 columns]" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df = bpd.DataFrame({\n", + " 'Name': ['Alice', 'Bob', 'Charlie'],\n", + " 'Scores': [[95, 88, 92], [78, 81], [82, 89, 94, 100]],\n", + "})\n", + "df" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Name string[pyarrow]\n", + "Scores list[pyarrow]\n", + "dtype: object" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.dtypes" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Operate on array data\n", + "\n", + "While pandas offers vectorized operations and lambda expressions for array manipulation, BigQuery DataFrames leverages the computational power of BigQuery itself. You can access a variety of native BigQuery array operations, such as [`array_agg`](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.bigquery#bigframes_bigquery_array_agg) and [`array_length`](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.bigquery#bigframes_bigquery_array_length), through the [`bigframes.bigquery`](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.bigquery) package (abbreviated as `bbq` in the following code samples)." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 3\n", + "1 2\n", + "2 4\n", + "Name: Scores, dtype: Int64" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Find the length in each array.\n", + "bbq.array_length(df['Scores'])" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 95\n", + "0 88\n", + "0 92\n", + "1 78\n", + "1 81\n", + "2 82\n", + "2 89\n", + "2 94\n", + "2 100\n", + "Name: Scores, dtype: Int64" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Transforms array elements into individual rows, preserving original order when in ordering\n", + "# mode. If an array has multiple elements, exploded rows are ordered by the element's index\n", + "# within its original array.\n", + "scores = df['Scores'].explode()\n", + "scores" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 100.0\n", + "0 93.0\n", + "0 97.0\n", + "1 83.0\n", + "1 86.0\n", + "2 87.0\n", + "2 94.0\n", + "2 99.0\n", + "2 105.0\n", + "Name: Scores, dtype: Float64" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Adjust the scores.\n", + "adj_scores = scores + 5.0\n", + "adj_scores" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 [100. 93. 97.]\n", + "1 [83. 86.]\n", + "2 [ 87. 94. 99. 105.]\n", + "Name: Scores, dtype: list[pyarrow]" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Aggregate adjusted scores back into arrays.\n", + "adj_scores_arr = bbq.array_agg(adj_scores.groupby(level=0))\n", + "adj_scores_arr" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
NameScoresNewScores
0Alice[95 88 92][100. 93. 97.]
1Bob[78 81][83. 86.]
2Charlie[ 82 89 94 100][ 87. 94. 99. 105.]
\n", + "

3 rows × 3 columns

\n", + "
[3 rows x 3 columns in total]" + ], + "text/plain": [ + " Name Scores NewScores\n", + "0 Alice [95 88 92] [100. 93. 97.]\n", + "1 Bob [78 81] [83. 86.]\n", + "2 Charlie [ 82 89 94 100] [ 87. 94. 99. 105.]\n", + "\n", + "[3 rows x 3 columns]" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Add adjusted scores into the DataFrame. This operation requires an implicit join \n", + "# between the two tables, necessitating a unique index in the DataFrame (guaranteed \n", + "# in the default ordering and index mode).\n", + "df['NewScores'] = adj_scores_arr\n", + "df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Struct Data Types\n", + "\n", + "In BigQuery, a [struct](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#struct_type) (also known as a `record`) is a collection of ordered fields, each with a defined data type (required) and an optional field name. BigQuery DataFrames maps BigQuery struct types to the pandas equivalent, `pandas.ArrowDtype(pa.struct())`. This section provides practical code examples illustrating how to use struct columns with BigQuery DataFrames." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Create DataFrames with struct columns \n", + "\n", + "Create a DataFrame with an `Address` struct column by using dictionaries for the data and setting the dtype to `struct[pyarrow]`." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/usr/local/google/home/chelsealin/src/bigframes/venv/lib/python3.12/site-packages/google/cloud/bigquery/_pandas_helpers.py:570: UserWarning: Pyarrow could not determine the type of columns: bigframes_unnamed_index.\n", + " warnings.warn(\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
NameAddress
0Alice{'City': 'New York', 'State': 'NY'}
1Bob{'City': 'San Francisco', 'State': 'CA'}
2Charlie{'City': 'Seattle', 'State': 'WA'}
\n", + "

3 rows × 2 columns

\n", + "
[3 rows x 2 columns in total]" + ], + "text/plain": [ + " Name Address\n", + "0 Alice {'City': 'New York', 'State': 'NY'}\n", + "1 Bob {'City': 'San Francisco', 'State': 'CA'}\n", + "2 Charlie {'City': 'Seattle', 'State': 'WA'}\n", + "\n", + "[3 rows x 2 columns]" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "names = bpd.Series(['Alice', 'Bob', 'Charlie'])\n", + "address = bpd.Series(\n", + " [\n", + " {'City': 'New York', 'State': 'NY'},\n", + " {'City': 'San Francisco', 'State': 'CA'},\n", + " {'City': 'Seattle', 'State': 'WA'}\n", + " ],\n", + " dtype=bpd.ArrowDtype(pa.struct(\n", + " [('City', pa.string()), ('State', pa.string())]\n", + " )))\n", + "\n", + "df = bpd.DataFrame({'Name': names, 'Address': address})\n", + "df" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Name string[pyarrow]\n", + "Address struct[pyarrow]\n", + "dtype: object" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.dtypes" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Operate on struct data\n", + "\n", + "Similar to pandas, BigQuery DataFrames provides a [`StructAccessor`](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.operations.structs.StructAccessor). Use the methods provided in this accessor to manipulate struct data." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "City string[pyarrow]\n", + "State string[pyarrow]\n", + "dtype: object" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Return the dtype object of each child field of the struct.\n", + "df['Address'].struct.dtypes()" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 New York\n", + "1 San Francisco\n", + "2 Seattle\n", + "Name: City, dtype: string" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Extract a child field as a Series\n", + "city = df['Address'].struct.field(\"City\")\n", + "city" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
CityState
0New YorkNY
1San FranciscoCA
2SeattleWA
\n", + "

3 rows × 2 columns

\n", + "
[3 rows x 2 columns in total]" + ], + "text/plain": [ + " City State\n", + "0 New York NY\n", + "1 San Francisco CA\n", + "2 Seattle WA\n", + "\n", + "[3 rows x 2 columns]" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Extract all child fields of a struct as a DataFrame.\n", + "address_df = df['Address'].struct.explode()\n", + "address_df" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.1" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}