Skip to content

Commit e172cf8

Browse files
committed
address comments
1 parent f3ce531 commit e172cf8

File tree

1 file changed

+45
-47
lines changed

1 file changed

+45
-47
lines changed

notebooks/dataframes/struct_and_array_dtypes.ipynb

Lines changed: 45 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -34,12 +34,12 @@
3434
"source": [
3535
"# Set up your environment\n",
3636
"\n",
37-
"Please refer to the notebooks in the `getting_started` folder for instructions on setting up your environment. Once your environment is ready, run the following code to import the necessary packages for working with BigFrames arrays:"
37+
"To get started, follow the instructions in the notebooks within the `getting_started` folder to set up your environment. Once your environment is ready, you can import the necessary packages by running the following code:"
3838
]
3939
},
4040
{
4141
"cell_type": "code",
42-
"execution_count": 17,
42+
"execution_count": 2,
4343
"metadata": {},
4444
"outputs": [],
4545
"source": [
@@ -50,13 +50,14 @@
5050
},
5151
{
5252
"cell_type": "code",
53-
"execution_count": 18,
53+
"execution_count": 3,
5454
"metadata": {},
5555
"outputs": [],
5656
"source": [
5757
"REGION = \"US\" # @param {type: \"string\"}\n",
58+
"\n",
5859
"bpd.options.display.progress_bar = None\n",
59-
"bpd.options.bigquery.location = REGION\n"
60+
"bpd.options.bigquery.location = REGION"
6061
]
6162
},
6263
{
@@ -65,18 +66,18 @@
6566
"source": [
6667
"# Array Data Types\n",
6768
"\n",
68-
"In BigQuery, an [array](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#array_type), also referred to as a `repeated` column, is an ordered list of zero or more non-array elements. These elements must be of the same data type, and arrays cannot contain other arrays. Furthermore, query results cannot include arrays with `NULL` elements.\n",
69+
"In BigQuery, an [array](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#array_type) (also called a repeated column) is an ordered list of zero or more elements of the same data type. Arrays cannot contain other arrays or `NULL` elements.\n",
6970
"\n",
70-
"BigFrames DataFrames, inheriting these properties, map BigQuery array types to `pandas.ArrowDtype(pa.list_())`. This section provides code examples demonstrating how to effectively work with array columns within BigFrames DataFrames."
71+
"BigQuery DataFrames map BigQuery array types to `pandas.ArrowDtype(pa.list_())`. The following code examples illustrate how to work with array columns in BigQuery DataFrames."
7172
]
7273
},
7374
{
7475
"cell_type": "markdown",
7576
"metadata": {},
7677
"source": [
77-
"## Create DataFrames with array columns \n",
78+
"## Create DataFrames with array columns\n",
7879
"\n",
79-
"Let's create a sample BigFrames DataFrame where the `Scores` column holds array data of type `list<int64>[pyarrow]`:"
80+
"Create a DataFrame in BigQuery DataFrames from local sample data. Use a list of lists to create a column with the `list<int64>[pyarrow]` dtype, which corresponds to the `ARRAY<INT64>` type in BigQuery."
8081
]
8182
},
8283
{
@@ -178,11 +179,9 @@
178179
"cell_type": "markdown",
179180
"metadata": {},
180181
"source": [
181-
"## CRUD operations for array data\n",
182-
"\n",
183-
"While Pandas offers vectorized operations and lambda expressions to manipulate array data, BigFrames leverages BigQuery's computational power. BigFrames introduces the [`bigframes.bigquery`](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.bigquery) package to provide access to a variety of native BigQuery array operations, such as [array_agg](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.bigquery#bigframes_bigquery_array_agg), [array_length](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.bigquery#bigframes_bigquery_array_length), and others. This module allows you to seamlessly perform create, read, update, and delete (CRUD) operations on array data within your BigFrames DataFrames.\n",
182+
"## Operate on array data\n",
184183
"\n",
185-
"Let's delve into how you can utilize these functions to effectively manipulate array data in BigFrames."
184+
"While pandas offers vectorized operations and lambda expressions for array manipulation, BigQuery DataFrames leverages the computational power of BigQuery itself. You can access a variety of native BigQuery array operations, such as [`array_agg`](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.bigquery#bigframes_bigquery_array_agg) and [`array_length`](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.bigquery#bigframes_bigquery_array_length), through the [`bigframes.bigquery`](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.bigquery) package (abbreviated as `bbq` in the following code samples)."
186185
]
187186
},
188187
{
@@ -205,7 +204,7 @@
205204
}
206205
],
207206
"source": [
208-
"# Find the length in each array\n",
207+
"# Find the length in each array.\n",
209208
"bbq.array_length(df['Scores'])"
210209
]
211210
},
@@ -235,7 +234,9 @@
235234
}
236235
],
237236
"source": [
238-
"# Explode array elements into rows\n",
237+
"# Transforms array elements into individual rows, preserving original order when in ordering\n",
238+
"# mode. If an array has multiple elements, exploded rows are ordered by the element's index\n",
239+
"# within its original array.\n",
239240
"scores = df['Scores'].explode()\n",
240241
"scores"
241242
]
@@ -248,15 +249,15 @@
248249
{
249250
"data": {
250251
"text/plain": [
251-
"0 95.238095\n",
252-
"0 88.571429\n",
253-
"0 92.380952\n",
254-
"1 79.047619\n",
255-
"1 81.904762\n",
256-
"2 82.857143\n",
257-
"2 89.52381\n",
258-
"2 94.285714\n",
259-
"2 100.0\n",
252+
"0 100.0\n",
253+
"0 93.0\n",
254+
"0 97.0\n",
255+
"1 83.0\n",
256+
"1 86.0\n",
257+
"2 87.0\n",
258+
"2 94.0\n",
259+
"2 99.0\n",
260+
"2 105.0\n",
260261
"Name: Scores, dtype: Float64"
261262
]
262263
},
@@ -266,8 +267,8 @@
266267
}
267268
],
268269
"source": [
269-
"# Adjust the scores\n",
270-
"adj_scores = (scores + 5) / 105.0 * 100.0\n",
270+
"# Adjust the scores.\n",
271+
"adj_scores = scores + 5.0\n",
271272
"adj_scores"
272273
]
273274
},
@@ -279,9 +280,9 @@
279280
{
280281
"data": {
281282
"text/plain": [
282-
"0 [95.23809524 88.57142857 92.38095238]\n",
283-
"1 [79.04761905 81.9047619 ]\n",
284-
"2 [ 82.85714286 89.52380952 94.28571429 100. ...\n",
283+
"0 [100. 93. 97.]\n",
284+
"1 [83. 86.]\n",
285+
"2 [ 87. 94. 99. 105.]\n",
285286
"Name: Scores, dtype: list<item: double>[pyarrow]"
286287
]
287288
},
@@ -291,7 +292,7 @@
291292
}
292293
],
293294
"source": [
294-
"# Aggregate adjusted scores back into arrays\n",
295+
"# Aggregate adjusted scores back into arrays.\n",
295296
"adj_scores_arr = bbq.array_agg(adj_scores.groupby(level=0))\n",
296297
"adj_scores_arr"
297298
]
@@ -332,35 +333,30 @@
332333
" <th>0</th>\n",
333334
" <td>Alice</td>\n",
334335
" <td>[95 88 92]</td>\n",
335-
" <td>[95.23809524 88.57142857 92.38095238]</td>\n",
336+
" <td>[100. 93. 97.]</td>\n",
336337
" </tr>\n",
337338
" <tr>\n",
338339
" <th>1</th>\n",
339340
" <td>Bob</td>\n",
340341
" <td>[78 81]</td>\n",
341-
" <td>[79.04761905 81.9047619 ]</td>\n",
342+
" <td>[83. 86.]</td>\n",
342343
" </tr>\n",
343344
" <tr>\n",
344345
" <th>2</th>\n",
345346
" <td>Charlie</td>\n",
346347
" <td>[ 82 89 94 100]</td>\n",
347-
" <td>[ 82.85714286 89.52380952 94.28571429 100. ...</td>\n",
348+
" <td>[ 87. 94. 99. 105.]</td>\n",
348349
" </tr>\n",
349350
" </tbody>\n",
350351
"</table>\n",
351352
"<p>3 rows × 3 columns</p>\n",
352353
"</div>[3 rows x 3 columns in total]"
353354
],
354355
"text/plain": [
355-
" Name Scores \\\n",
356-
"0 Alice [95 88 92] \n",
357-
"1 Bob [78 81] \n",
358-
"2 Charlie [ 82 89 94 100] \n",
359-
"\n",
360-
" NewScores \n",
361-
"0 [95.23809524 88.57142857 92.38095238] \n",
362-
"1 [79.04761905 81.9047619 ] \n",
363-
"2 [ 82.85714286 89.52380952 94.28571429 100. ... \n",
356+
" Name Scores NewScores\n",
357+
"0 Alice [95 88 92] [100. 93. 97.]\n",
358+
"1 Bob [78 81] [83. 86.]\n",
359+
"2 Charlie [ 82 89 94 100] [ 87. 94. 99. 105.]\n",
364360
"\n",
365361
"[3 rows x 3 columns]"
366362
]
@@ -371,7 +367,9 @@
371367
}
372368
],
373369
"source": [
374-
"# Incorporate adjusted scores into the DataFrame\n",
370+
"# Add adjusted scores into the DataFrame. This operation requires an implicit join \n",
371+
"# between the two tables, necessitating a unique index in the DataFrame (guaranteed \n",
372+
"# in the default ordering and index mode).\n",
375373
"df['NewScores'] = adj_scores_arr\n",
376374
"df"
377375
]
@@ -382,7 +380,7 @@
382380
"source": [
383381
"# Struct Data Types\n",
384382
"\n",
385-
"In BigQuery, a [struct](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#struct_type) (also known as a `record`) is a collection of ordered fields, each with a defined data type (required) and an optional field name. BigFrames maps BigQuery struct types to the Pandas equivalent, `pandas.ArrowDtype(pa.struct())`. In this section, we'll explore practical code examples illustrating how to work with struct columns within your BigFrames DataFrames."
383+
"In BigQuery, a [struct](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#struct_type) (also known as a `record`) is a collection of ordered fields, each with a defined data type (required) and an optional field name. BigQuery DataFrames maps BigQuery struct types to the pandas equivalent, `pandas.ArrowDtype(pa.struct())`. This section provides practical code examples illustrating how to use struct columns with BigQuery DataFrames."
386384
]
387385
},
388386
{
@@ -391,7 +389,7 @@
391389
"source": [
392390
"## Create DataFrames with struct columns \n",
393391
"\n",
394-
"Let's create a sample BigFrames DataFrame where the `Address` column holds struct data of type `struct<City: string, State: string>[pyarrow]`:"
392+
"Create a DataFrame with an `Address` struct column by using dictionaries for the data and setting the dtype to `struct<City: string, State: string>[pyarrow]`."
395393
]
396394
},
397395
{
@@ -403,7 +401,7 @@
403401
"name": "stderr",
404402
"output_type": "stream",
405403
"text": [
406-
"/usr/local/google/home/chelsealin/src/bigframes2/venv/lib/python3.12/site-packages/google/cloud/bigquery/_pandas_helpers.py:537: UserWarning: Pyarrow could not determine the type of columns: bigframes_unnamed_index.\n",
404+
"/usr/local/google/home/chelsealin/src/bigframes/venv/lib/python3.12/site-packages/google/cloud/bigquery/_pandas_helpers.py:570: UserWarning: Pyarrow could not determine the type of columns: bigframes_unnamed_index.\n",
407405
" warnings.warn(\n"
408406
]
409407
},
@@ -509,9 +507,9 @@
509507
"cell_type": "markdown",
510508
"metadata": {},
511509
"source": [
512-
"## CRUD operations for struct data\n",
510+
"## Operate on struct data\n",
513511
"\n",
514-
"Similar to Pandas, BigFrames provides a [`StructAccessor`](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.operations.structs.StructAccessor) to streamline the manipulation of struct data. Let's explore how you can utilize this feature for efficient CRUD operations on your nested struct columns."
512+
"Similar to pandas, BigQuery DataFrames provides a [`StructAccessor`](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.operations.structs.StructAccessor). Use the methods provided in this accessor to manipulate struct data."
515513
]
516514
},
517515
{

0 commit comments

Comments
 (0)