address comments

chelsea-lin · chelsea-lin · commit e172cf8cb0a8 · 2024-08-13T17:55:01.000Z
diff --git a/notebooks/dataframes/struct_and_array_dtypes.ipynb b/notebooks/dataframes/struct_and_array_dtypes.ipynb
@@ -34,12 +34,12 @@
    "source": [
     "# Set up your environment\n",
     "\n",
-    "Please refer to the notebooks in the `getting_started` folder for instructions on setting up your environment. Once your environment is ready, run the following code to import the necessary packages for working with BigFrames arrays:"
+    "To get started, follow the instructions in the notebooks within the `getting_started` folder to set up your environment.  Once your environment is ready, you can import the necessary packages by running the following code:"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 17,
+   "execution_count": 2,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -50,13 +50,14 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 18,
+   "execution_count": 3,
    "metadata": {},
    "outputs": [],
    "source": [
     "REGION = \"US\"  # @param {type: \"string\"}\n",
+    "\n",
     "bpd.options.display.progress_bar = None\n",
-    "bpd.options.bigquery.location = REGION\n"
+    "bpd.options.bigquery.location = REGION"
    ]
   },
   {
@@ -65,18 +66,18 @@
    "source": [
     "# Array Data Types\n",
     "\n",
-    "In BigQuery, an [array](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#array_type), also referred to as a `repeated` column, is an ordered list of zero or more non-array elements. These elements must be of the same data type, and arrays cannot contain other arrays. Furthermore, query results cannot include arrays with `NULL` elements.\n",
+    "In BigQuery, an [array](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#array_type) (also called a repeated column) is an ordered list of zero or more elements of the same data type. Arrays cannot contain other arrays or `NULL` elements.\n",
     "\n",
-    "BigFrames DataFrames, inheriting these properties, map BigQuery array types to `pandas.ArrowDtype(pa.list_())`. This section provides code examples demonstrating how to effectively work with array columns within BigFrames DataFrames."
+    "BigQuery DataFrames map BigQuery array types to `pandas.ArrowDtype(pa.list_())`. The following code examples illustrate how to work with array columns in BigQuery DataFrames."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Create DataFrames with array columns \n",
+    "## Create DataFrames with array columns\n",
     "\n",
-    "Let's create a sample BigFrames DataFrame where the `Scores` column holds array data of type `list<int64>[pyarrow]`:"
+    "Create a DataFrame in BigQuery DataFrames from local sample data. Use a list of lists to create a column with the `list<int64>[pyarrow]` dtype, which corresponds to the `ARRAY<INT64>` type in BigQuery."
    ]
   },
   {
@@ -178,11 +179,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## CRUD operations for array data\n",
-    "\n",
-    "While Pandas offers vectorized operations and lambda expressions to manipulate array data, BigFrames leverages BigQuery's computational power. BigFrames introduces the [`bigframes.bigquery`](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.bigquery) package to provide access to a variety of native BigQuery array operations, such as [array_agg](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.bigquery#bigframes_bigquery_array_agg), [array_length](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.bigquery#bigframes_bigquery_array_length), and others. This module allows you to seamlessly perform create, read, update, and delete (CRUD) operations on array data within your BigFrames DataFrames.\n",
+    "## Operate on array data\n",
     "\n",
-    "Let's delve into how you can utilize these functions to effectively manipulate array data in BigFrames."
+    "While pandas offers vectorized operations and lambda expressions for array manipulation, BigQuery DataFrames leverages the computational power of BigQuery itself. You can access a variety of native BigQuery array operations, such as [`array_agg`](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.bigquery#bigframes_bigquery_array_agg) and [`array_length`](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.bigquery#bigframes_bigquery_array_length), through the [`bigframes.bigquery`](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.bigquery) package (abbreviated as `bbq` in the following code samples)."
    ]
   },
   {
@@ -205,7 +204,7 @@
     }
    ],
    "source": [
-    "# Find the length in each array\n",
+    "# Find the length in each array.\n",
     "bbq.array_length(df['Scores'])"
    ]
   },
@@ -235,7 +234,9 @@
     }
    ],
    "source": [
-    "# Explode array elements into rows\n",
+    "# Transforms array elements into individual rows, preserving original order when in ordering\n",
+    "# mode. If an array has multiple elements, exploded rows are ordered by the element's index\n",
+    "# within its original array.\n",
     "scores = df['Scores'].explode()\n",
     "scores"
    ]
@@ -248,15 +249,15 @@
     {
      "data": {
       "text/plain": [
-       "0    95.238095\n",
-       "0    88.571429\n",
-       "0    92.380952\n",
-       "1    79.047619\n",
-       "1    81.904762\n",
-       "2    82.857143\n",
-       "2     89.52381\n",
-       "2    94.285714\n",
-       "2        100.0\n",
+       "0    100.0\n",
+       "0     93.0\n",
+       "0     97.0\n",
+       "1     83.0\n",
+       "1     86.0\n",
+       "2     87.0\n",
+       "2     94.0\n",
+       "2     99.0\n",
+       "2    105.0\n",
        "Name: Scores, dtype: Float64"
       ]
      },
@@ -266,8 +267,8 @@
     }
    ],
    "source": [
-    "# Adjust the scores\n",
-    "adj_scores = (scores + 5) / 105.0 * 100.0\n",
+    "# Adjust the scores.\n",
+    "adj_scores = scores + 5.0\n",
     "adj_scores"
    ]
   },
@@ -279,9 +280,9 @@
     {
      "data": {
       "text/plain": [
-       "0                [95.23809524 88.57142857 92.38095238]\n",
-       "1                            [79.04761905 81.9047619 ]\n",
-       "2    [ 82.85714286  89.52380952  94.28571429 100.  ...\n",
+       "0         [100.  93.  97.]\n",
+       "1                [83. 86.]\n",
+       "2    [ 87.  94.  99. 105.]\n",
        "Name: Scores, dtype: list<item: double>[pyarrow]"
       ]
      },
@@ -291,7 +292,7 @@
     }
    ],
    "source": [
-    "# Aggregate adjusted scores back into arrays\n",
+    "# Aggregate adjusted scores back into arrays.\n",
     "adj_scores_arr = bbq.array_agg(adj_scores.groupby(level=0))\n",
     "adj_scores_arr"
    ]
@@ -332,35 +333,30 @@
        "      <th>0</th>\n",
        "      <td>Alice</td>\n",
        "      <td>[95 88 92]</td>\n",
-       "      <td>[95.23809524 88.57142857 92.38095238]</td>\n",
+       "      <td>[100.  93.  97.]</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1</th>\n",
        "      <td>Bob</td>\n",
        "      <td>[78 81]</td>\n",
-       "      <td>[79.04761905 81.9047619 ]</td>\n",
+       "      <td>[83. 86.]</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>2</th>\n",
        "      <td>Charlie</td>\n",
        "      <td>[ 82  89  94 100]</td>\n",
-       "      <td>[ 82.85714286  89.52380952  94.28571429 100.  ...</td>\n",
+       "      <td>[ 87.  94.  99. 105.]</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "<p>3 rows × 3 columns</p>\n",
        "</div>[3 rows x 3 columns in total]"
       ],
       "text/plain": [
-       "      Name             Scores  \\\n",
-       "0    Alice         [95 88 92]   \n",
-       "1      Bob            [78 81]   \n",
-       "2  Charlie  [ 82  89  94 100]   \n",
-       "\n",
-       "                                           NewScores  \n",
-       "0              [95.23809524 88.57142857 92.38095238]  \n",
-       "1                          [79.04761905 81.9047619 ]  \n",
-       "2  [ 82.85714286  89.52380952  94.28571429 100.  ...  \n",
+       "      Name             Scores              NewScores\n",
+       "0    Alice         [95 88 92]       [100.  93.  97.]\n",
+       "1      Bob            [78 81]              [83. 86.]\n",
+       "2  Charlie  [ 82  89  94 100]  [ 87.  94.  99. 105.]\n",
        "\n",
        "[3 rows x 3 columns]"
       ]
@@ -371,7 +367,9 @@
     }
    ],
    "source": [
-    "# Incorporate adjusted scores into the DataFrame\n",
+    "# Add adjusted scores into the DataFrame. This operation requires an implicit join \n",
+    "# between the two tables, necessitating a unique index in the DataFrame (guaranteed \n",
+    "# in the default ordering and index mode).\n",
     "df['NewScores'] = adj_scores_arr\n",
     "df"
    ]
@@ -382,7 +380,7 @@
    "source": [
     "# Struct Data Types\n",
     "\n",
-    "In BigQuery, a [struct](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#struct_type) (also known as a `record`) is a collection of ordered fields, each with a defined data type (required) and an optional field name. BigFrames maps BigQuery struct types to the Pandas equivalent, `pandas.ArrowDtype(pa.struct())`. In this section, we'll explore practical code examples illustrating how to work with struct columns within your BigFrames DataFrames."
+    "In BigQuery, a [struct](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#struct_type) (also known as a `record`) is a collection of ordered fields, each with a defined data type (required) and an optional field name. BigQuery DataFrames maps BigQuery struct types to the pandas equivalent, `pandas.ArrowDtype(pa.struct())`. This section provides practical code examples illustrating how to use struct columns with BigQuery DataFrames."
    ]
   },
   {
@@ -391,7 +389,7 @@
    "source": [
     "## Create DataFrames with struct columns \n",
     "\n",
-    "Let's create a sample BigFrames DataFrame where the `Address` column holds struct data of type `struct<City: string, State: string>[pyarrow]`:"
+    "Create a DataFrame with an `Address` struct column by using dictionaries for the data and setting the dtype to `struct<City: string, State: string>[pyarrow]`."
    ]
   },
   {
@@ -403,7 +401,7 @@
      "name": "stderr",
      "output_type": "stream",
      "text": [
-      "/usr/local/google/home/chelsealin/src/bigframes2/venv/lib/python3.12/site-packages/google/cloud/bigquery/_pandas_helpers.py:537: UserWarning: Pyarrow could not determine the type of columns: bigframes_unnamed_index.\n",
+      "/usr/local/google/home/chelsealin/src/bigframes/venv/lib/python3.12/site-packages/google/cloud/bigquery/_pandas_helpers.py:570: UserWarning: Pyarrow could not determine the type of columns: bigframes_unnamed_index.\n",
       "  warnings.warn(\n"
      ]
     },
@@ -509,9 +507,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## CRUD operations for struct data\n",
+    "## Operate on struct data\n",
     "\n",
-    "Similar to Pandas, BigFrames provides a [`StructAccessor`](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.operations.structs.StructAccessor) to streamline the manipulation of struct data. Let's explore how you can utilize this feature for efficient CRUD operations on your nested struct columns."
+    "Similar to pandas, BigQuery DataFrames provides a [`StructAccessor`](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.operations.structs.StructAccessor). Use the methods provided in this accessor to manipulate struct data."
    ]
   },
   {