Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optional switch to non-nullable dtypes in to_dataframe #1345

Closed
derHeinzer opened this issue Sep 8, 2022 · 5 comments
Closed

Optional switch to non-nullable dtypes in to_dataframe #1345

derHeinzer opened this issue Sep 8, 2022 · 5 comments
Assignees
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@derHeinzer
Copy link

Release 3.0.0 introduced the use of nullable Int64 and boolean dtypes (pandas extension dtypes) (#786)

However the pandas extension dtypes are not widely supported across the pandas api. This might lead to issues using numba in pandas operations.

Please see my bug report in the pandas project for example:
pandas-dev/pandas#46867

I would propose to introduce an optional switch to pre-3.0.0 behaviour.

@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Sep 8, 2022
@derHeinzer
Copy link
Author

To make things worse, not all widely-used machine learning libraries can handle pandas extension dtypes. I just figured out, the training of a lightgbm Regressor ist not possible with such input.
I did not check every library I know, but I am pretty sure there must be others.

@tswast tswast added the type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. label Sep 14, 2022
@tswast
Copy link
Contributor

tswast commented Sep 14, 2022

Thank you so much for the feedback! I hope that the nullable dtypes can stay as the default, as they more accurately reflect the BigQuery data types, but you make some good points that there are still use cases where the old behavior would be desirable.

@tswast
Copy link
Contributor

tswast commented Mar 9, 2023

Related: #954 Support for string[pyarrow] dtype (edit: fixed issue number)

I'm thinking we add int_dtype, bool_dtype, string_dtype, float_dtype, time_dtype, timestamp_dtype, date_dtype, datetime_dtype as arguments to to_dataframe.

Alternative 1: expose typemapper directly, but doubtful that a lambda function that inspects an arrow schema would be all that understandable to the average pandas user.

Alternative 2: expose use_nullable_dtypes argument to match https://pandas.pydata.org/docs/reference/api/pandas.read_parquet.html and similar methods. It defaults to True to avoid lossy conversions as we have attempted to do in google-cloud-bigquery v3 -- technically another breaking change though since we'll want to change to use Float64Dtype and StringDtype too, which we aren't currently.

@tswast
Copy link
Contributor

tswast commented Mar 23, 2023

With #1529, you'll be able to explicitly set int_dtype to None to use legacy dtypes.

@chelsea-lin
Copy link
Contributor

The changes have been merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

No branches or pull requests

3 participants