Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigQuery: Allow choice of compression for loading from dataframe #7701

Closed
awgymer opened this issue Apr 12, 2019 · 5 comments · Fixed by #8938
Closed

BigQuery: Allow choice of compression for loading from dataframe #7701

awgymer opened this issue Apr 12, 2019 · 5 comments · Fixed by #8938
Assignees
Labels
api: bigquery Issues related to the BigQuery API. good first issue This issue is a good place to started contributing to this repository. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@awgymer
Copy link

awgymer commented Apr 12, 2019

pandas.Dataframe.to_parquet() allows a choice of compression, with the deafult of snappy. When using the client.load_table_from_dataframe the call to this function only allows the use of the default.
Is it possible to allow a choice so that gzip can be used instead so that installing snappy on the system is not necessary

@tswast tswast added api: bigquery Issues related to the BigQuery API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. good first issue This issue is a good place to started contributing to this repository. labels Apr 12, 2019
@tswast
Copy link
Contributor

tswast commented Apr 12, 2019

Thanks for the report @awgymer . This is a great suggestion. I agree that the snappy system requirement can be painful to install. I think we can mirror the compression argument from pandas in the load_table_from_dataframe method.

Happy to accept a PR for this.

@awgymer
Copy link
Author

awgymer commented Apr 20, 2019

So on a little more looking, the pandas.Dataframe.to_parquet() allows three (four if you count uncompressed) options: snappy, gzip, and brotli, whilst the official docs say that

BigQuery supports Snappy, GZip, and LZO_1X codecs for compressed data blocks in Parquet files

So I don't know now if it is in fact more confusing to move to allowing 2/3 of the options (again uncompressed too?) rather than remaining fixed to 1 and simply making the requirement for snappy more clear in the docs?

@tswast
Copy link
Contributor

tswast commented Apr 24, 2019

What dependencies are needed for gzip? Is it a noticeable difference in performance? I'm open to changing the default if it's easier to install and not a noticeable difference.

Also I'm actually a bit wary of exposing compression, since I'd like to keep the option open to change the file format we serialize to. Parquet happens to be the best match to BigQuery's data types right now, but it'd be good to keep the option open to move to something else in the future.

@awgymer
Copy link
Author

awgymer commented Apr 24, 2019

As far as I can tell it uses the standard python gzip/zlib which use the zlib library. I couldn't speak for performance difference at the moment, but I appreciate your point about not exposing the parameter, so perhaps just making it clearer that snappy is a dependency for that function in the docs would be a suitable solution?

@tswast
Copy link
Contributor

tswast commented Apr 24, 2019

making it clearer that snappy is a dependency for that function in the docs would be a suitable solution?

Yeah, that seems reasonable. We already talk about needing pyarrow. We should mention snappy, too.

Related: The required system packages for snappy are one of the reasons I think conda is still relevant, despite wheels and why I publish the google-cloud-bigquery package to conda-forge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the BigQuery API. good first issue This issue is a good place to started contributing to this repository. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants