Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix!: use nullable Int64 and boolean dtypes in to_dataframe #786

Merged

Conversation

tswast
Copy link
Contributor

@tswast tswast commented Jul 20, 2021

To override this behavior, specify the types for the desired columns with the
dtype argument.

BREAKING CHANGE: uses Int64 type by default to avoid loss-of-precision in results with large integer values

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes https://issuetracker.google.com/144712110 🦕
Fixes #793

…frame`

To override this behavior, specify the types for the desired columns with the
`dtype` argument.
@tswast tswast requested a review from a team July 20, 2021 21:11
@tswast tswast requested a review from a team as a code owner July 20, 2021 21:11
@tswast tswast requested review from stephaniewang526 and removed request for a team July 20, 2021 21:11
@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Jul 20, 2021
@google-cla google-cla bot added the cla: yes This human has signed the Contributor License Agreement. label Jul 20, 2021
@tswast tswast marked this pull request as draft July 20, 2021 21:11
@tswast tswast removed request for a team and stephaniewang526 July 20, 2021 21:12
@tswast
Copy link
Contributor Author

tswast commented Jul 20, 2021

I'll take a closer look at #776 before finishing this one, as it might mean fewer code paths to cover. I think the BQ Storage API will always be used for to_dataframe after that PR.

@tswast
Copy link
Contributor Author

tswast commented Jul 21, 2021

I did a little bit of experimentation to see what the intermediate pyarrow.Table types are for both REST and BQ Storage API on all scalar types. They do align, so that's good. Also, floating points appear to be handled correctly, even with null values in the table.

It appears https://issuetracker.google.com/144712110 was fixed for FLOAT columns in #314 as of google-cloud-bigquery >= 2.2.0 (That was technically a breaking change [oops])

I might still keep this open so that we can have some explicit tests for different data types. Also, we're relying on PyArrow -> Pandas to pick the right data types, so maybe there's some dtype defaults we can help with still.

@tswast tswast changed the title feat!: use nullable types like float and Int64 by default in to_dataframe feat!: use nullable Int64 dtype by default in to_dataframe Jul 21, 2021
@tswast tswast changed the title feat!: use nullable Int64 dtype by default in to_dataframe fix!: use nullable Int64 dtype by default in to_dataframe Jul 21, 2021
@plamut plamut added the semver: major Hint for users that this is an API breaking change. label Jul 27, 2021
@tswast tswast changed the base branch from master to v3.x.x July 27, 2021 17:07
@tswast tswast changed the base branch from v3.x.x to v3 July 27, 2021 17:27
@tswast tswast marked this pull request as ready for review August 9, 2021 20:05
@tswast tswast changed the title fix!: use nullable Int64 dtype by default in to_dataframe fix!: use nullable Int64 and boolean dtype by default in to_dataframe Aug 9, 2021
@tswast tswast changed the title fix!: use nullable Int64 and boolean dtype by default in to_dataframe fix!: use nullable Int64 and boolean dtypes in to_dataframe Aug 9, 2021
@tswast
Copy link
Contributor Author

tswast commented Aug 11, 2021

Re: system test failure:

_____________ TestBigQuery.test_load_avro_from_uri_then_dump_table _____________
.
.
.
E   google.api_core.exceptions.RetryError: Deadline of 120.0s exceeded while calling functools.partial(<bound method PollingFuture._done_or_raise of <google.cloud.bigquery.job.load.LoadJob object at 0x7ff6eae07f40>>), last exception:

Didn't we increase the default deadline to 10 minutes? Maybe v3 branch needs a sync?

@tswast tswast requested a review from plamut August 11, 2021 16:45
Copy link
Contributor

@plamut plamut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two nits, but not essential, looks good.

@@ -14,12 +14,12 @@ First, ensure that the :mod:`pandas` library is installed by running:

pip install --upgrade pandas

Alternatively, you can install the BigQuery python client library with
Alternatively, you can install the BigQuery Python client library with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit)
Since already at this, there's at least on other occurrence of "python" not capitalized (line 69), which can also be fixed.

loss-of-precision.

Returns:
Dict[str, str]: mapping from column names to dtypes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) Can be expressed as the annotation of the function return type.

("max_results",), ((None,), (10,),) # Use BQ Storage API. # Use REST API.
)
def test_list_rows_nullable_scalars_dtypes(bigquery_client, scalars_table, max_results):
df = bigquery_client.list_rows(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: I'll need to exclude the INTERVAL column next time we sync with master

@tswast tswast added the automerge Merge the pull request once unit tests and other checks pass. label Aug 16, 2021
@tswast tswast mentioned this pull request Aug 16, 2021
2 tasks
@gcf-merge-on-green gcf-merge-on-green bot merged commit dcd78c7 into googleapis:v3 Aug 16, 2021
@gcf-merge-on-green gcf-merge-on-green bot removed the automerge Merge the pull request once unit tests and other checks pass. label Aug 16, 2021
@tswast tswast deleted the b144712110-nullable-pandas-types branch August 16, 2021 15:44
tswast added a commit that referenced this pull request Mar 29, 2022
deps!: BigQuery Storage and pyarrow are required dependencies (#776)

fix!: use nullable `Int64` and `boolean` dtypes in `to_dataframe` (#786) 

feat!: destination tables are no-longer removed by `create_job` (#891)

feat!: In `to_dataframe`, use `dbdate` and `dbtime` dtypes from db-dtypes package for BigQuery DATE and TIME columns (#972)

fix!: automatically convert out-of-bounds dates in `to_dataframe`, remove `date_as_object` argument (#972)

feat!: mark the package as type-checked (#1058)

feat!: default to DATETIME type when loading timezone-naive datetimes from Pandas (#1061)

feat: add `api_method` parameter to `Client.query` to select `INSERT` or `QUERY` API (#967)

fix: improve type annotations for mypy validation (#1081)

feat: use `StandardSqlField` class for `Model.feature_columns` and `Model.label_columns` (#1117)

docs: Add migration guide from version 2.x to 3.x (#1027)

Release-As: 3.0.0
waltaskew pushed a commit to waltaskew/python-bigquery that referenced this pull request Jul 20, 2022
deps!: BigQuery Storage and pyarrow are required dependencies (googleapis#776)

fix!: use nullable `Int64` and `boolean` dtypes in `to_dataframe` (googleapis#786) 

feat!: destination tables are no-longer removed by `create_job` (googleapis#891)

feat!: In `to_dataframe`, use `dbdate` and `dbtime` dtypes from db-dtypes package for BigQuery DATE and TIME columns (googleapis#972)

fix!: automatically convert out-of-bounds dates in `to_dataframe`, remove `date_as_object` argument (googleapis#972)

feat!: mark the package as type-checked (googleapis#1058)

feat!: default to DATETIME type when loading timezone-naive datetimes from Pandas (googleapis#1061)

feat: add `api_method` parameter to `Client.query` to select `INSERT` or `QUERY` API (googleapis#967)

fix: improve type annotations for mypy validation (googleapis#1081)

feat: use `StandardSqlField` class for `Model.feature_columns` and `Model.label_columns` (googleapis#1117)

docs: Add migration guide from version 2.x to 3.x (googleapis#1027)

Release-As: 3.0.0
abdelmegahedgoogle pushed a commit to abdelmegahedgoogle/python-bigquery that referenced this pull request Apr 17, 2023
deps!: BigQuery Storage and pyarrow are required dependencies (googleapis#776)

fix!: use nullable `Int64` and `boolean` dtypes in `to_dataframe` (googleapis#786) 

feat!: destination tables are no-longer removed by `create_job` (googleapis#891)

feat!: In `to_dataframe`, use `dbdate` and `dbtime` dtypes from db-dtypes package for BigQuery DATE and TIME columns (googleapis#972)

fix!: automatically convert out-of-bounds dates in `to_dataframe`, remove `date_as_object` argument (googleapis#972)

feat!: mark the package as type-checked (googleapis#1058)

feat!: default to DATETIME type when loading timezone-naive datetimes from Pandas (googleapis#1061)

feat: add `api_method` parameter to `Client.query` to select `INSERT` or `QUERY` API (googleapis#967)

fix: improve type annotations for mypy validation (googleapis#1081)

feat: use `StandardSqlField` class for `Model.feature_columns` and `Model.label_columns` (googleapis#1117)

docs: Add migration guide from version 2.x to 3.x (googleapis#1027)

Release-As: 3.0.0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. cla: yes This human has signed the Contributor License Agreement. semver: major Hint for users that this is an API breaking change.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

use pandas Int64 type by default to avoid precision loss
2 participants