Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support BigQuery client or URI in database I/O functions #18547

Open
henryharbeck opened this issue Sep 4, 2024 · 4 comments
Open

Support BigQuery client or URI in database I/O functions #18547

henryharbeck opened this issue Sep 4, 2024 · 4 comments
Labels
A-io-database Area: reading/writing to databases enhancement New feature or an improvement of an existing feature

Comments

@henryharbeck
Copy link
Contributor

henryharbeck commented Sep 4, 2024

Description

At the moment writing to BigQuery requires some code that is not all that obvious, and differs significantly from every other supported database/driver. Reading data is more obvious, but could also be simplified further. It would be really nice if BigQuery could fit a bit more nicely into the existing Polars read/write database functions.

Below are some proposed options of what this could look like, compared against existing working solutions.

from io import BytesIO
import polars as pl
from google.cloud import bigquery

# setup
client = bigquery.Client()
query = "SELECT * FROM `project.dataset.table`"

### READING DATA ###

# Currently working solution - uses the BigQuery client rather than read_database[_uri]
rows = client.query_and_wait(query)
df = pl.from_arrow(rows.to_arrow())

# Currently working solution
# Has a dependency on connector-x where it is not really needed to get Arrow data
# Also forces auth via a (fully qualified) file name, which is not always desired
# Request to relax constraint raised over a year ago with connector-x
# https://github.com/sfu-db/connector-x/issues/499
authentication_file_path = "/home/path/to/credentials.json"
conn = "bigquery://" + authentication_file_path
df = pl.read_database_uri(query, conn)

# Reading - proposed option 1, use the client as the connection
df = pl.read_database(query, client)

# Reading - proposed option 2, provide a sqlalchemy-bigquery style connection string
# See https://github.com/googleapis/python-bigquery-sqlalchemy for reference
uri = "bigquery://project" # the BigQuery client is instantiated with this project
# OR
uri = "bigquery://" # the project is inferred from the environment
df = pl.read_database_uri(query, uri, engine="bigquery")


### WRITING DATA ###

df = pl.DataFrame({"foo": 1})

# Currently working solution - uses the BigQuery client rather than write_database
with BytesIO() as stream:
    df.write_parquet(stream)
    job = client.load_table_from_file(
        stream,
        destination="dataset.table_name",
        project="project_name",
        rewind=True,
        job_config=bigquery.LoadJobConfig(
            source_format=bigquery.SourceFormat.PARQUET,
        ),
    )
job.result()

# Writing- proposed option 1, use the client as the connection
df.write_datbase("dataset.table_name", client)


# Writing - proposed option 2, provide a sqlalchemy-bigquery style connection string
uri = "bigquery://project" # the BigQuery client is instantiated with this project
# OR
uri = "bigquery://" # the project is inferred from the environment
df.write_database("dataset.table_name", uri, engine="bigquery")
@henryharbeck henryharbeck added the enhancement New feature or an improvement of an existing feature label Sep 4, 2024
@pola-rs pola-rs deleted a comment Sep 4, 2024
@henryharbeck
Copy link
Contributor Author

@alexander-beedie - I was mid way through drafting this when I saw your comment on #17326, so I figured I'd create it now. Would be keen to get your thoughts on which approach you prefer.

I would also be keen to work on this or help out with an implementation.

@henryharbeck henryharbeck changed the title Support BigQuery client in database I/O functions Support BigQuery client or URI in database I/O functions Sep 4, 2024
@alexander-beedie alexander-beedie added the A-io-database Area: reading/writing to databases label Sep 4, 2024
@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Sep 4, 2024

@alexander-beedie - I was mid way through drafting this when I saw your comment on #17326, so I figured I'd create it now. Would be keen to get your thoughts on which approach you prefer.

I'll poke at it for a bit and see what appears to work best; my feeling is we likely want to accept the BigQuery Client object itself in read_database as if it was any other connection, and then shim/proxy any differences arising from that internally (as we do with ODBCCursorProxy and SurrealDBCursorProxy) 🤔

Luckily I have access to a decent-sized BigQuery instance, so I should be able to try a few things out in the near-ish future! At that point some help testing, validating, improving, etc would definitely be appreciated ✌️

@henryharbeck
Copy link
Contributor Author

henryharbeck commented Sep 4, 2024

Have filled out the rest of the issue now 😄

I'll poke at it for a bit and see what appears to work best; my feeling is we likely want to accept the BigQuery Client object itself in read_database

Cool. The URI option also seems viable for both read and write. Saves the user instantiating the client if all they are going to do is pass it to the function. It can be created for them. Obviously if the user wants more control over how to authenticate or which project to bill, they can provide the client.

So, perhaps both could be supported? This would be similar to how I can a give a Postgres URI to read_database_uri and write_database, and a SQLAlchemy connection wrapping that same URI to read_database and write_database. Except in this particular BigQuery case, read_database_uri and read_database should be as performant as each other.

I should be able to try a few things out in the near-ish future! At that point some help testing, validating, improving, etc would definitely be appreciated ✌️

Sounds good! And appreciate the prompt response 😀

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Sep 4, 2024

So, perhaps both could be supported?

Yup, could make sense 👌

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io-database Area: reading/writing to databases enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

2 participants