Optional switch to non-nullable dtypes in to_dataframe #1345

derHeinzer · 2022-09-08T06:40:48Z

Release 3.0.0 introduced the use of nullable Int64 and boolean dtypes (pandas extension dtypes) (#786)

However the pandas extension dtypes are not widely supported across the pandas api. This might lead to issues using numba in pandas operations.

Please see my bug report in the pandas project for example:
pandas-dev/pandas#46867

I would propose to introduce an optional switch to pre-3.0.0 behaviour.

derHeinzer · 2022-09-12T12:46:19Z

To make things worse, not all widely-used machine learning libraries can handle pandas extension dtypes. I just figured out, the training of a lightgbm Regressor ist not possible with such input.
I did not check every library I know, but I am pretty sure there must be others.

tswast · 2022-09-14T13:45:22Z

Thank you so much for the feedback! I hope that the nullable dtypes can stay as the default, as they more accurately reflect the BigQuery data types, but you make some good points that there are still use cases where the old behavior would be desirable.

tswast · 2023-03-09T17:04:13Z

Related: #954 Support for string[pyarrow] dtype (edit: fixed issue number)

I'm thinking we add int_dtype, bool_dtype, string_dtype, float_dtype, time_dtype, timestamp_dtype, date_dtype, datetime_dtype as arguments to to_dataframe.

Alternative 1: expose typemapper directly, but doubtful that a lambda function that inspects an arrow schema would be all that understandable to the average pandas user.

Alternative 2: expose use_nullable_dtypes argument to match https://pandas.pydata.org/docs/reference/api/pandas.read_parquet.html and similar methods. It defaults to True to avoid lossy conversions as we have attempted to do in google-cloud-bigquery v3 -- technically another breaking change though since we'll want to change to use Float64Dtype and StringDtype too, which we aren't currently.

tswast · 2023-03-23T15:49:17Z

With #1529, you'll be able to explicitly set int_dtype to None to use legacy dtypes.

chelsea-lin · 2023-03-24T20:34:03Z

The changes have been merged.

product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Sep 8, 2022

tswast added the type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. label Sep 14, 2022

chelsea-lin mentioned this issue Mar 22, 2023

feat: add bool, int, float, string dtype to to_dataframe #1529

Merged

tswast assigned chelsea-lin Mar 23, 2023

chelsea-lin closed this as completed Mar 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optional switch to non-nullable dtypes in to_dataframe #1345

Optional switch to non-nullable dtypes in to_dataframe #1345

derHeinzer commented Sep 8, 2022

derHeinzer commented Sep 12, 2022

tswast commented Sep 14, 2022

tswast commented Mar 9, 2023 •

edited

Loading

tswast commented Mar 23, 2023

chelsea-lin commented Mar 24, 2023

Optional switch to non-nullable dtypes in to_dataframe #1345

Optional switch to non-nullable dtypes in to_dataframe #1345

Comments

derHeinzer commented Sep 8, 2022

derHeinzer commented Sep 12, 2022

tswast commented Sep 14, 2022

tswast commented Mar 9, 2023 • edited Loading

tswast commented Mar 23, 2023

chelsea-lin commented Mar 24, 2023

tswast commented Mar 9, 2023 •

edited

Loading