Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigQuery: DB-API is very slow #9185

Closed
haibin opened this issue Sep 6, 2019 · 6 comments · Fixed by #9199
Closed

BigQuery: DB-API is very slow #9185

haibin opened this issue Sep 6, 2019 · 6 comments · Fixed by #9199
Assignees
Labels
api: bigquery Issues related to the BigQuery API. performance priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@haibin
Copy link

haibin commented Sep 6, 2019

DB-API is very slow.

google-cloud-bigquery version: 1.19.0

from datetime import datetime

from google.cloud import bigquery
from google.cloud.bigquery import dbapi

client = bigquery.Client()
conn = dbapi.Connection(client)
curr = conn.cursor()

start = datetime.now()
QUERY = """SELECT name FROM `bigquery-public-data.usa_names.usa_1910_2013` LIMIT 100"""
query_job = client.query(QUERY)
query_job.result()
print('API', datetime.now() - start)

start = datetime.now()
curr.execute(QUERY)
result = curr.fetchall()
print('DB-API', datetime.now() - start)

Output

API 0:00:01.623182
DB-API 0:01:36.157141
@tseaver tseaver added api: bigquery Issues related to the BigQuery API. performance type: question Request for information or clarification. Not an issue. labels Sep 6, 2019
@tswast
Copy link
Contributor

tswast commented Sep 9, 2019

Peter, can you look into this? 1 minute versus 1 second is quite the difference!

I don't know why this would be the case, as the DB-API should be creating a QueryJob behind the scenes, but maybe there's something we're doing wrong to wait for results (such as sleeping between requests or something)?

@plamut
Copy link
Contributor

plamut commented Sep 10, 2019

This is indeed quite a difference, will check. I confirm that the issue is reproducible.

Update: The reason is that results are requested one at a time, because the page size is set to 1, meaning that 100 requests are made to the backend.

@plamut
Copy link
Contributor

plamut commented Sep 10, 2019

If using a cursor directly, one should set the arraysize attribute on it:

curr.execute(QUERY)
curr.arraysize = 100  # <-- THIS
result = curr.fetchall()

The default value is 1 as specified in PEP 249, meaning that if the attribute is not explicitly set, only one row at a time will be fetched by default. There is also a note on this in fetchall() description.

PEP 249 also specifies a fetchmany() method with an optional size parameter, but the BigQuery implementation ignores it, and requires to explicitly set the aforementioned arraysize attribute.

@tswast Do you know the reason why the size parameter is ignored in fetchmany()'s helper method _try_fetch()? Adding support for that seems straightforward.

Also, the fetchall() method should at least mention this aspect, since it is quite easy to accidentally introduce a performance issue by default. I will classify this as a docs issue for now, and open a PR to fix it.

@plamut plamut added type: docs Improvement to the documentation for an API. and removed type: question Request for information or clarification. Not an issue. labels Sep 10, 2019
@tswast
Copy link
Contributor

tswast commented Sep 10, 2019

Do you know the reason why the size parameter is ignored in fetchmany()'s helper method _try_fetch()? Adding support for that seems straightforward.

I don't recall the reason. Possibly, I just didn't see the size parameter?

@tswast
Copy link
Contributor

tswast commented Sep 10, 2019

Oh, now I think I remember. I think it's because we call list_rows before anyone even gets to make a call to fetchmany(). We'd have to implement our own pagination (multiple calls to list_rows, manually populating the pagination token each time) to support the size parameter (which is possible, but was more than I was willing to do at the time).

@plamut
Copy link
Contributor

plamut commented Sep 11, 2019

As discussed on the PR, setting the default page size to None (to let the backend choose it) is preferred to merely documenting the arraysize attribute, thus this is not a docs type issue anymore.

@plamut plamut removed the type: docs Improvement to the documentation for an API. label Sep 11, 2019
@yoshi-automation yoshi-automation added triage me I really want to be triaged. 🚨 This issue needs some love. labels Sep 11, 2019
@plamut plamut added priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. and removed 🚨 This issue needs some love. triage me I really want to be triaged. labels Sep 11, 2019
@yoshi-automation yoshi-automation added triage me I really want to be triaged. 🚨 This issue needs some love. and removed 🚨 This issue needs some love. triage me I really want to be triaged. labels Sep 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the BigQuery API. performance priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
5 participants