Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail to create datasets from a generator when using Google Big Query #5750

Closed
ivanprado opened this issue Apr 14, 2023 · 4 comments
Closed

Comments

@ivanprado
Copy link

Describe the bug

Creating a dataset from a generator using Dataset.from_generator() fails if the generator is the Google Big Query Python client. The problem is that the Big Query client is not pickable. And the function create_config_id tries to get a hash of the generator by pickling it. So the following error is generated:

_pickle.PicklingError: Pickling client objects is explicitly not supported.
Clients have non-trivial state that is local and unpickleable.

Steps to reproduce the bug

  1. Install the big query client and datasets pip install google-cloud-bigquery datasets
  2. Run the following code:
from datasets import Dataset
from google.cloud import bigquery

client = bigquery.Client()

# Perform a query.
QUERY = (
    'SELECT name FROM `bigquery-public-data.usa_names.usa_1910_2013` '
    'WHERE state = "TX" '
    'LIMIT 100')
query_job = client.query(QUERY)  # API request
rows = query_job.result()  # Waits for query to finish

ds = Dataset.from_generator(rows)

for r in ds:
    print(r)

Expected behavior

Two options:

  1. Ignore the pickle errors when computing the hash
  2. Provide a scape hutch so that we can avoid calculating the hash for the generator. For example, allowing to provide a hash from the user.

Environment info

python 3.9
google-cloud-bigquery 3.9.0
datasets 2.11.0

@mariosasko
Copy link
Collaborator

from_generator expects a generator function, not a generator object, so this should work:

from datasets import Dataset
from google.cloud import bigquery

client = bigquery.Client()

def gen()
    # Perform a query.
    QUERY = (
        'SELECT name FROM `bigquery-public-data.usa_names.usa_1910_2013` '
         'WHERE state = "TX" '
         'LIMIT 100')
     query_job = client.query(QUERY)  # API request
     yield from query_job.result()  # Waits for query to finish

ds = Dataset.from_generator(rows)

for r in ds:
    print(r)

@ivanprado
Copy link
Author

@mariosasko your code was incomplete, so I tried to fix it:

from datasets import Dataset
from google.cloud import bigquery

client = bigquery.Client()

def gen():
    # Perform a query.
    QUERY = (
        'SELECT name FROM `bigquery-public-data.usa_names.usa_1910_2013` '
         'WHERE state = "TX" '
         'LIMIT 100')
    query_job = client.query(QUERY)  # API request
    yield from query_job.result()  # Waits for query to finish

ds = Dataset.from_generator(gen)

for r in ds:
    print(r)

The error is also present in this case:

_pickle.PicklingError: Pickling client objects is explicitly not supported.
Clients have non-trivial state that is local and unpickleable.

I think it doesn't matter if the generator is an object or a function. The problem is that the generator is referencing an object that is not pickable (the client in this case).

@mariosasko
Copy link
Collaborator

It does matter: this function expects a generator function, as stated in the docs.

This should work:

from datasets import Dataset
from google.cloud import bigquery

def gen():
    client = bigquery.Client()
    # Perform a query.
    QUERY = (
        'SELECT name FROM `bigquery-public-data.usa_names.usa_1910_2013` '
         'WHERE state = "TX" '
         'LIMIT 100')
    query_job = client.query(QUERY)  # API request
    yield from query_job.result()  # Waits for query to finish

ds = Dataset.from_generator(gen)

for r in ds:
    print(r)

We could allow passing non-picklable objects and use a random hash for the generated arrow file. In that case, the caching mechanism would not work, meaning repeated calls with the same set of arguments would generate new datasets instead of reusing the cached version, but this behavior is still better than raising an error.

@ivanprado
Copy link
Author

Thank you @mariosasko . Your last code is working indeed. Curiously, the important detail here was to wrap the client instantiation within the generator itself. If the line client = bigquery.Client() is moved outside, then the error is back.

I see now also your point in regard to the generator being a generator function. We can close the issue if you want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants