-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fail to create datasets from a generator when using Google Big Query #5750
Comments
from datasets import Dataset
from google.cloud import bigquery
client = bigquery.Client()
def gen()
# Perform a query.
QUERY = (
'SELECT name FROM `bigquery-public-data.usa_names.usa_1910_2013` '
'WHERE state = "TX" '
'LIMIT 100')
query_job = client.query(QUERY) # API request
yield from query_job.result() # Waits for query to finish
ds = Dataset.from_generator(rows)
for r in ds:
print(r) |
@mariosasko your code was incomplete, so I tried to fix it: from datasets import Dataset
from google.cloud import bigquery
client = bigquery.Client()
def gen():
# Perform a query.
QUERY = (
'SELECT name FROM `bigquery-public-data.usa_names.usa_1910_2013` '
'WHERE state = "TX" '
'LIMIT 100')
query_job = client.query(QUERY) # API request
yield from query_job.result() # Waits for query to finish
ds = Dataset.from_generator(gen)
for r in ds:
print(r) The error is also present in this case:
I think it doesn't matter if the generator is an object or a function. The problem is that the generator is referencing an object that is not pickable (the client in this case). |
It does matter: this function expects a generator function, as stated in the docs. This should work: from datasets import Dataset
from google.cloud import bigquery
def gen():
client = bigquery.Client()
# Perform a query.
QUERY = (
'SELECT name FROM `bigquery-public-data.usa_names.usa_1910_2013` '
'WHERE state = "TX" '
'LIMIT 100')
query_job = client.query(QUERY) # API request
yield from query_job.result() # Waits for query to finish
ds = Dataset.from_generator(gen)
for r in ds:
print(r) We could allow passing non-picklable objects and use a random hash for the generated arrow file. In that case, the caching mechanism would not work, meaning repeated calls with the same set of arguments would generate new datasets instead of reusing the cached version, but this behavior is still better than raising an error. |
Thank you @mariosasko . Your last code is working indeed. Curiously, the important detail here was to wrap the client instantiation within the generator itself. If the line I see now also your point in regard to the generator being a generator function. We can close the issue if you want. |
Describe the bug
Creating a dataset from a generator using
Dataset.from_generator()
fails if the generator is the Google Big Query Python client. The problem is that the Big Query client is not pickable. And the functioncreate_config_id
tries to get a hash of the generator by pickling it. So the following error is generated:Steps to reproduce the bug
pip install google-cloud-bigquery datasets
Expected behavior
Two options:
Environment info
python 3.9
google-cloud-bigquery 3.9.0
datasets 2.11.0
The text was updated successfully, but these errors were encountered: