BigQuery: DB-API is very slow #9185

haibin · 2019-09-06T09:09:09Z

DB-API is very slow.

google-cloud-bigquery version: 1.19.0

from datetime import datetime

from google.cloud import bigquery
from google.cloud.bigquery import dbapi

client = bigquery.Client()
conn = dbapi.Connection(client)
curr = conn.cursor()

start = datetime.now()
QUERY = """SELECT name FROM `bigquery-public-data.usa_names.usa_1910_2013` LIMIT 100"""
query_job = client.query(QUERY)
query_job.result()
print('API', datetime.now() - start)

start = datetime.now()
curr.execute(QUERY)
result = curr.fetchall()
print('DB-API', datetime.now() - start)

Output

API 0:00:01.623182
DB-API 0:01:36.157141

The text was updated successfully, but these errors were encountered:

tswast · 2019-09-09T17:29:22Z

Peter, can you look into this? 1 minute versus 1 second is quite the difference!

I don't know why this would be the case, as the DB-API should be creating a QueryJob behind the scenes, but maybe there's something we're doing wrong to wait for results (such as sleeping between requests or something)?

plamut · 2019-09-10T09:35:52Z

This is indeed quite a difference, will check. I confirm that the issue is reproducible.

Update: The reason is that results are requested one at a time, because the page size is set to 1, meaning that 100 requests are made to the backend.

plamut · 2019-09-10T11:35:19Z

If using a cursor directly, one should set the arraysize attribute on it:

curr.execute(QUERY)
curr.arraysize = 100  # <-- THIS
result = curr.fetchall()

The default value is 1 as specified in PEP 249, meaning that if the attribute is not explicitly set, only one row at a time will be fetched by default. There is also a note on this in fetchall() description.

PEP 249 also specifies a fetchmany() method with an optional size parameter, but the BigQuery implementation ignores it, and requires to explicitly set the aforementioned arraysize attribute.

@tswast Do you know the reason why the size parameter is ignored in fetchmany()'s helper method _try_fetch()? Adding support for that seems straightforward.

Also, the fetchall() method should at least mention this aspect, since it is quite easy to accidentally introduce a performance issue by default. I will classify this as a docs issue for now, and open a PR to fix it.

tswast · 2019-09-10T16:59:50Z

Do you know the reason why the size parameter is ignored in fetchmany()'s helper method _try_fetch()? Adding support for that seems straightforward.

I don't recall the reason. Possibly, I just didn't see the size parameter?

tswast · 2019-09-10T17:02:32Z

Oh, now I think I remember. I think it's because we call list_rows before anyone even gets to make a call to fetchmany(). We'd have to implement our own pagination (multiple calls to list_rows, manually populating the pagination token each time) to support the size parameter (which is possible, but was more than I was willing to do at the time).

plamut · 2019-09-11T18:25:17Z

As discussed on the PR, setting the default page size to None (to let the backend choose it) is preferred to merely documenting the arraysize attribute, thus this is not a docs type issue anymore.

tseaver added api: bigquery Issues related to the BigQuery API. performance type: question Request for information or clarification. Not an issue. labels Sep 6, 2019

tswast assigned plamut Sep 9, 2019

plamut added type: docs Improvement to the documentation for an API. and removed type: question Request for information or clarification. Not an issue. labels Sep 10, 2019

plamut mentioned this issue Sep 10, 2019

BigQuery: Change the default value of Cursor instances' arraysize attribute to None #9199

Merged

1 task

plamut removed the type: docs Improvement to the documentation for an API. label Sep 11, 2019

yoshi-automation added triage me I really want to be triaged. 🚨 This issue needs some love. labels Sep 11, 2019

plamut added priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. and removed 🚨 This issue needs some love. triage me I really want to be triaged. labels Sep 11, 2019

yoshi-automation added triage me I really want to be triaged. 🚨 This issue needs some love. and removed 🚨 This issue needs some love. triage me I really want to be triaged. labels Sep 11, 2019

plamut closed this as completed in #9199 Sep 12, 2019

aranke mentioned this issue Aug 25, 2023

[CT-3032] dbt show --limit should include the limit in the DWH query, to avoid long-running queries dbt-labs/dbt-core#8496

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BigQuery: DB-API is very slow #9185

BigQuery: DB-API is very slow #9185

haibin commented Sep 6, 2019

tswast commented Sep 9, 2019

plamut commented Sep 10, 2019 •

edited

Loading

plamut commented Sep 10, 2019 •

edited

Loading

tswast commented Sep 10, 2019

tswast commented Sep 10, 2019

plamut commented Sep 11, 2019

BigQuery: DB-API is very slow #9185

BigQuery: DB-API is very slow #9185

Comments

haibin commented Sep 6, 2019

tswast commented Sep 9, 2019

plamut commented Sep 10, 2019 • edited Loading

plamut commented Sep 10, 2019 • edited Loading

tswast commented Sep 10, 2019

tswast commented Sep 10, 2019

plamut commented Sep 11, 2019

plamut commented Sep 10, 2019 •

edited

Loading

plamut commented Sep 10, 2019 •

edited

Loading