-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use BigQuery Dataframes as Read-Connector to BigQuery #17326
Comments
Thanks @OELSJAN for the feature request! I'm curious to hear if there's anything we can do in BigQuery DataFrames to make these filter pushdown features easier to implement. Also, for your awareness, we have on our backlog a request to make an official polars connector for BigQuery (watch issue googleapis/python-bigquery#1979 for updates). I suspect that might be a good place to implement such optimizations, as a separate package could make dependencies a little easier to manage. See also this request on the BigQuery DataFrames repo for some polars support. googleapis/python-bigquery-dataframes#735, which is mostly focused on the I/O piece similar to this request. |
A question: are there other I/O methods that support push-down to the storage layer? I'm curious what hooks are available for such functionality. Edit: Two reasons for asking: (1) it'd be lovely to hook into the existing optimizations somehow via some extension mechanism (note that much of these, such as row filters and column filters are supported via the BQ Storage Read API) and (2) it'd be great to introduce even more pushdown types, as BigQuery DataFrames supports aggregations, joins (to other BigQuery data sources or even local data if uploaded to a temp table or small enough to inline in SQL), and more. |
@tswast: Yes - for example, the Polars Iceberg integration1 supports various pushdown optimisations including predicates, range queries, and suchlike 👍 On a side-note, I've been meaning to look at integrating the BigQuery Client object as a valid connection type for our Footnotes
|
Very cool! From what I've heard from folks (e.g. googleapis/python-bigquery#1979), it's important to avoid unnecessary dependencies on pyarrow, so it's good to see that How stable is this interface? I'm curious if this sort of connector is best contributed directly to the polars package or should be provided by a separate package (similar to how pandas has refactored BigQuery support out into pandas-gbq years ago). Since this proposed feature is to go beyond predicates to potentially turning aggregates and such into BigQuery queries, maybe it's best to stick to the pola-rs/polar repo for now, as that scan functionality improves/extends? |
I've started a BigQuery + polars gist with some ideas. I'll try to keep that up to date as I experiment with reads and writes. The first experiment, bigquery-to-polars-no-pyarrow-ipynb, is a barebones read API that doesn't require This could be extended further to support a Edit: Note that to make this work for queries could go a few ways. It's complicated because we're dealing with multiple APIs: BigQuery REST API for queries and BigQuery Storage Read API for tables:
That said, maybe query support is not necessary, since we will have |
I am really overwhelmed by the stone i have set rolling here and i am delighted that this topic is now being adressed. Can't wait to process bigquery data in a more native way. |
Description
BigQuery launched a feature named "BigQuery Dataframes":
With this you can execute Pandas operations directly on BigQuery engine. So maybe this API can be used to implement a better connector towards BigQuery, which also supports some lazy optimizations like filter pushdown, instead of using
from_arrow
with a hardcoded query executed by the Python BigQuery client.The text was updated successfully, but these errors were encountered: