fix!: use nullable `Int64` and `boolean` dtypes in `to_dataframe` #786

tswast · 2021-07-20T21:11:37Z

To override this behavior, specify the types for the desired columns with the
dtype argument.

BREAKING CHANGE: uses Int64 type by default to avoid loss-of-precision in results with large integer values

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
Ensure the tests and linter pass
Code coverage does not decrease (if any source code was changed)
Appropriate docs were updated (if necessary)

Fixes https://issuetracker.google.com/144712110 🦕
Fixes #793

…frame` To override this behavior, specify the types for the desired columns with the `dtype` argument.

tswast · 2021-07-20T21:13:44Z

I'll take a closer look at #776 before finishing this one, as it might mean fewer code paths to cover. I think the BQ Storage API will always be used for to_dataframe after that PR.

tswast · 2021-07-21T20:58:08Z

I did a little bit of experimentation to see what the intermediate pyarrow.Table types are for both REST and BQ Storage API on all scalar types. They do align, so that's good. Also, floating points appear to be handled correctly, even with null values in the table.

It appears https://issuetracker.google.com/144712110 was fixed for FLOAT columns in #314 as of google-cloud-bigquery >= 2.2.0 (That was technically a breaking change [oops])

I might still keep this open so that we can have some explicit tests for different data types. Also, we're relying on PyArrow -> Pandas to pick the right data types, so maybe there's some dtype defaults we can help with still.

…python-bigquery into b144712110-nullable-pandas-types

…andas-types

google/cloud/bigquery/_pandas_helpers.py

tswast · 2021-08-11T15:17:54Z

Re: system test failure:

_____________ TestBigQuery.test_load_avro_from_uri_then_dump_table _____________
.
.
.
E   google.api_core.exceptions.RetryError: Deadline of 120.0s exceeded while calling functools.partial(<bound method PollingFuture._done_or_raise of <google.cloud.bigquery.job.load.LoadJob object at 0x7ff6eae07f40>>), last exception:

Didn't we increase the default deadline to 10 minutes? Maybe v3 branch needs a sync?

plamut

Two nits, but not essential, looks good.

plamut · 2021-08-13T08:57:29Z

docs/usage/pandas.rst

@@ -14,12 +14,12 @@ First, ensure that the :mod:`pandas` library is installed by running:

   pip install --upgrade pandas

-Alternatively, you can install the BigQuery python client library with
+Alternatively, you can install the BigQuery Python client library with


(nit)
Since already at this, there's at least on other occurrence of "python" not capitalized (line 69), which can also be fixed.

plamut · 2021-08-13T09:02:17Z

google/cloud/bigquery/_pandas_helpers.py

+    loss-of-precision.
+
+    Returns:
+        Dict[str, str]: mapping from column names to dtypes


(nit) Can be expressed as the annotation of the function return type.

…' into b144712110-nullable-pandas-types

tswast · 2021-08-16T15:16:00Z

tests/system/test_pandas.py

+    ("max_results",), ((None,), (10,),)  # Use BQ Storage API.  # Use REST API.
+)
+def test_list_rows_nullable_scalars_dtypes(bigquery_client, scalars_table, max_results):
+    df = bigquery_client.list_rows(


Note to self: I'll need to exclude the INTERVAL column next time we sync with master

deps!: BigQuery Storage and pyarrow are required dependencies (#776) fix!: use nullable `Int64` and `boolean` dtypes in `to_dataframe` (#786) feat!: destination tables are no-longer removed by `create_job` (#891) feat!: In `to_dataframe`, use `dbdate` and `dbtime` dtypes from db-dtypes package for BigQuery DATE and TIME columns (#972) fix!: automatically convert out-of-bounds dates in `to_dataframe`, remove `date_as_object` argument (#972) feat!: mark the package as type-checked (#1058) feat!: default to DATETIME type when loading timezone-naive datetimes from Pandas (#1061) feat: add `api_method` parameter to `Client.query` to select `INSERT` or `QUERY` API (#967) fix: improve type annotations for mypy validation (#1081) feat: use `StandardSqlField` class for `Model.feature_columns` and `Model.label_columns` (#1117) docs: Add migration guide from version 2.x to 3.x (#1027) Release-As: 3.0.0

deps!: BigQuery Storage and pyarrow are required dependencies (googleapis#776) fix!: use nullable `Int64` and `boolean` dtypes in `to_dataframe` (googleapis#786) feat!: destination tables are no-longer removed by `create_job` (googleapis#891) feat!: In `to_dataframe`, use `dbdate` and `dbtime` dtypes from db-dtypes package for BigQuery DATE and TIME columns (googleapis#972) fix!: automatically convert out-of-bounds dates in `to_dataframe`, remove `date_as_object` argument (googleapis#972) feat!: mark the package as type-checked (googleapis#1058) feat!: default to DATETIME type when loading timezone-naive datetimes from Pandas (googleapis#1061) feat: add `api_method` parameter to `Client.query` to select `INSERT` or `QUERY` API (googleapis#967) fix: improve type annotations for mypy validation (googleapis#1081) feat: use `StandardSqlField` class for `Model.feature_columns` and `Model.label_columns` (googleapis#1117) docs: Add migration guide from version 2.x to 3.x (googleapis#1027) Release-As: 3.0.0

feat!: use nullable types like float and Int64 by default in `to_data…

76d88f4

…frame` To override this behavior, specify the types for the desired columns with the `dtype` argument.

tswast requested a review from a team July 20, 2021 21:11

tswast requested a review from a team as a code owner July 20, 2021 21:11

tswast requested review from stephaniewang526 and removed request for a team July 20, 2021 21:11

product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Jul 20, 2021

google-cla bot added the cla: yes This human has signed the Contributor License Agreement. label Jul 20, 2021

tswast marked this pull request as draft July 20, 2021 21:11

tswast removed request for a team and stephaniewang526 July 20, 2021 21:12

tswast mentioned this pull request Jul 21, 2021

feat: use geopandas for GEOGRAPHY columns if geopandas is installed #792

Closed

add test data for all scalar columns

f2223e9

tswast mentioned this pull request Jul 21, 2021

use pandas Int64 type by default to avoid precision loss #793

Closed

tswast added 2 commits July 21, 2021 16:36

add test data for all scalar columns

07ed871

Merge branch 'b144712110-nullable-pandas-types' of github.com:tswast/…

66ce732

…python-bigquery into b144712110-nullable-pandas-types

tswast changed the title ~~feat!: use nullable types like float and Int64 by default in to_dataframe~~ feat!: use nullable Int64 dtype by default in to_dataframe Jul 21, 2021

tswast changed the title ~~feat!: use nullable Int64 dtype by default in to_dataframe~~ fix!: use nullable Int64 dtype by default in to_dataframe Jul 21, 2021

tswast added 2 commits July 22, 2021 17:07

update tests with expected dtypes

21d4369

add expected types, REST test

69a747f

tswast mentioned this pull request Jul 23, 2021

to_dataframe dtypes argument could allow functions that take pandas.Series-like object and return new series #807

Closed

use dtype defaults for "easy" cases

4f78e6d

tswast mentioned this pull request Jul 26, 2021

to_dataframe fails with Decimal type with precision 77 does not fit when REST API is used to download extreme BIGNUMERIC values #812

Closed

plamut added the semver: major Hint for users that this is an API breaking change. label Jul 27, 2021

tswast changed the base branch from master to v3.x.x July 27, 2021 17:07

tswast changed the base branch from v3.x.x to v3 July 27, 2021 17:27

Merge remote-tracking branch 'upstream/v3' into b144712110-nullable-p…

62a57bd

…andas-types

tswast added 4 commits August 6, 2021 15:16

Merge remote-tracking branch 'upstream/v3' into b144712110-nullable-p…

d17e637

…andas-types

WIP: split TIME and DATE into separate issues

6ceff2c

WIP: unit tests

18152d9

add tests, update minimum pandas version

2e957cd

tswast marked this pull request as ready for review August 9, 2021 20:05

tswast changed the title ~~fix!: use nullable Int64 dtype by default in to_dataframe~~ fix!: use nullable Int64 and boolean dtype by default in to_dataframe Aug 9, 2021

tswast changed the title ~~fix!: use nullable Int64 and boolean dtype by default in to_dataframe~~ fix!: use nullable Int64 and boolean dtypes in to_dataframe Aug 9, 2021

tswast commented Aug 11, 2021

View reviewed changes

google/cloud/bigquery/_pandas_helpers.py Show resolved Hide resolved

add unit test for repeated fields

8f90c51

tswast requested a review from plamut August 11, 2021 16:45

Merge branch 'v3' into b144712110-nullable-pandas-types

187a950

plamut approved these changes Aug 13, 2021

View reviewed changes

tswast added 2 commits August 16, 2021 10:07

Address docs nits

3155dab

Merge remote-tracking branch 'origin/b144712110-nullable-pandas-types…

189404c

…' into b144712110-nullable-pandas-types

tswast commented Aug 16, 2021

View reviewed changes

tswast added the automerge Merge the pull request once unit tests and other checks pass. label Aug 16, 2021

tswast mentioned this pull request Aug 16, 2021

chore: sync v3 with master branch #880

Merged

2 tasks

gcf-merge-on-green bot merged commit dcd78c7 into googleapis:v3 Aug 16, 2021

gcf-merge-on-green bot removed the automerge Merge the pull request once unit tests and other checks pass. label Aug 16, 2021

tswast deleted the b144712110-nullable-pandas-types branch August 16, 2021 15:44

tswast mentioned this pull request Sep 8, 2021

Support for string[pyarrow] dtype #954

Closed

tswast mentioned this pull request Mar 29, 2022

fix!: remove out-of-date BigQuery ML protocol buffers #1178

Merged

4 tasks

release-please bot mentioned this pull request Mar 29, 2022

chore(main): release 3.0.0 #1179

Merged

derHeinzer mentioned this pull request Sep 8, 2022

Optional switch to non-nullable dtypes in to_dataframe #1345

Closed

1oglop1 mentioned this pull request Sep 13, 2022

insert_rows_from_dataframe produces int64 is not JSON serializable #1348

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix!: use nullable `Int64` and `boolean` dtypes in `to_dataframe` #786

fix!: use nullable `Int64` and `boolean` dtypes in `to_dataframe` #786

tswast commented Jul 20, 2021 •

edited

Loading

tswast commented Jul 20, 2021

tswast commented Jul 21, 2021

tswast commented Aug 11, 2021

plamut left a comment

plamut Aug 13, 2021

plamut Aug 13, 2021

tswast Aug 16, 2021

fix!: use nullable Int64 and boolean dtypes in to_dataframe #786

fix!: use nullable Int64 and boolean dtypes in to_dataframe #786

Conversation

tswast commented Jul 20, 2021 • edited Loading

tswast commented Jul 20, 2021

tswast commented Jul 21, 2021

tswast commented Aug 11, 2021

plamut left a comment

Choose a reason for hiding this comment

plamut Aug 13, 2021

Choose a reason for hiding this comment

plamut Aug 13, 2021

Choose a reason for hiding this comment

tswast Aug 16, 2021

Choose a reason for hiding this comment

fix!: use nullable `Int64` and `boolean` dtypes in `to_dataframe` #786

fix!: use nullable `Int64` and `boolean` dtypes in `to_dataframe` #786

tswast commented Jul 20, 2021 •

edited

Loading