perf: DB-API uses more efficient `query_and_wait` when no job ID is provided #1747

tswast · 2023-12-08T01:36:50Z

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
Ensure the tests and linter pass
Code coverage does not decrease (if any source code was changed)
Appropriate docs were updated (if necessary)

Fixes #1745 🦕

…_is_completely_cached

…or` of results Set the `QUERY_PREVIEW_ENABLED=TRUE` environment variable to use this with the new JOB_CREATION_OPTIONAL mode (currently in preview).

tswast · 2023-12-12T16:21:23Z

I tested this manually in ipython as well, since it seems we don't have much (any?) system tests / samples in this repo that test this functionality.

# In [1]:
import google.cloud.bigquery.dbapi as bqdb
conn = bqdb.connect()
cur = conn.cursor()
cur.execute("SELECT 1")
cur.fetchall()
# Out[1]: [Row((1,), {'f0_': 0})]

# In [2]:
cur._query_rows._should_use_bqstorage(None, create_bqstorage_client=True)
# Out[2]: False

# In [4]:
cur.execute("SELECT name, SUM(`number`) AS total FROM `bigquery-public-data.usa_names.usa_1910_2013` GROUP BY name")
cur._query_rows._should_use_bqstorage(None, create_bqstorage_client=True)
# Out[4]: False

# In [5]:
r = cur.fetchall()
len(r)
# Out[5]: 29828

# In [7]:
cur.execute("SELECT name, number, year FROM `bigquery-public-data.usa_names.usa_1910_2013`")
cur._query_rows._should_use_bqstorage(None, create_bqstorage_client=True)
# Out[7]: True

# In [8]:
r = cur.fetchall()
len(r)
# Out[8]: 5552452

I have also tested these same queries with SQLAlchemy to make sure this doesn't somehow break the connector.

# In [1]:
from sqlalchemy import *
from sqlalchemy.engine import create_engine
from sqlalchemy.schema import *
engine = create_engine('bigquery://swast-scratch')

# In [2]:
table = Table(
    'usa_1910_2013',
    MetaData(bind=engine),
    autoload=True,
    schema='bigquery-public-data.usa_names',
)
select([func.count('*')], from_obj=table).scalar()
# Out[2]: 5552452

# In[3]:
len(select(
    [table.c.name, func.sum(table.c.number).label('total')],
    from_obj=table
).group_by(table.c.name).execute().all())
# Out[3]: 29828

# In[4]:
len(select(
    [table.c.name, table.c.number, table.c.year],
    from_obj=table
).execute().all())
# Out[4]: 5552452

For sanity, I checked that there is a speedup when using this change:

After this change:

# In [1]:
import google.cloud.bigquery.dbapi as bqdb
conn = bqdb.connect()
cur = conn.cursor()

# In [2]:
%%timeit -n10 -r10
cur.execute("SELECT 1")
r = cur.fetchall()
# Out [2]:
# 319 ms ± 19.5 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)

# In [3]:
%%timeit -n5 -r4
cur.execute("SELECT name, SUM(`number`) AS total FROM `bigquery-public-data.usa_names.usa_1910_2013` GROUP BY name")
cur.fetchall()
# Out [3]:
# 1.63 s ± 80.3 ms per loop (mean ± std. dev. of 4 runs, 5 loops each)

Before this change:

# In [1]:
import google.cloud.bigquery.dbapi as bqdb
conn = bqdb.connect()
cur = conn.cursor()

# In [2]:
%%timeit -n10 -r10
cur.execute("SELECT 1")
r = cur.fetchall()
# Out [2]:
# 1.18 s ± 73.2 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)

# In [3]:
%%timeit -n5 -r4
cur.execute("SELECT name, SUM(`number`) AS total FROM `bigquery-public-data.usa_names.usa_1910_2013` GROUP BY name")
cur.fetchall()
# Out [3]:
# 1.67 s ± 62.2 ms per loop (mean ± std. dev. of 4 runs, 5 loops each)

This means that small query results (SELECT 1) have a (1.18 / 0.319) = 3.7x speedup! For medium-sized results/queries, this is less dramatic at (1.67 / 1.63) = 1.2x speedup and not statistically significant.

Linchin · 2023-12-17T05:45:09Z

google/cloud/bigquery/table.py

@@ -1635,7 +1647,10 @@ def _is_almost_completely_cached(self):
        This is useful to know, because we can avoid alternative download
        mechanisms.
        """
-        if self._first_page_response is None:
+        if (
+            not hasattr(self, "_first_page_response")


Because we set at line 1591 self._first_page_response = first_page_response, this attribute will always exist? Maybe we can check whether the value is None or not.

This also was needed for some tests where we have a mock row iterator but want to test with a real implementation of this method.

Linchin · 2023-12-17T05:56:11Z

google/cloud/bigquery/table.py

-        if self.next_page_token is not None:
+        # The developer has already started paging through results if
+        # next_page_token is set.
+        if hasattr(self, "next_page_token") and self.next_page_token is not None:


Just for my education, it looks like attribute next_page_token is inherited from "grandparent" class Iterator from the core library, which creates this attribute at init. Is it necessary to check whether this attribute exist or not?

This was purely for some failing unit tests where this superclass was mocked out.

Linchin · 2023-12-17T05:58:18Z

Thank you Tim for the timely fix! LGTM, except for some nits.

chalmerlowe

LGTM

tswast added 30 commits November 15, 2023 14:19

perf: use the first page a results when query(api_method="QUERY")

b8c583a

add tests

6a8059d

respect max_results with cached page

1f0e38e

respect page_size, also avoid bqstorage if almost fully downloaded

4401725

skip true test if bqstorage not installed

d078941

Merge remote-tracking branch 'origin/main' into issue589-RowIterator.…

660aa76

…_is_completely_cached

coverage

476bcd7

Merge remote-tracking branch 'origin/main' into issue589-RowIterator.…

05d6a3e

…_is_completely_cached

feat: add Client.query_and_wait which directly returns a `RowIterat…

c16e4be

…or` of results Set the `QUERY_PREVIEW_ENABLED=TRUE` environment variable to use this with the new JOB_CREATION_OPTIONAL mode (currently in preview).

implement basic query_and_wait and add code sample to test

222f91b

avoid duplicated QueryJob construction

73e5817

update unit tests

9508121

Merge remote-tracking branch 'origin/main' into issue589-query_and_wait

543481d

fix merge conflict in rowiterator

85f1cab

support max_results, add tests

c0e6c86

retry tests

d4a322d

Merge remote-tracking branch 'origin/main' into issue589-query_and_wait

e0b2d2e

unit test coverage

9daccbd

Merge remote-tracking branch 'origin/main' into issue589-query_and_wait

bba36d2

dont retry twice

adf0b49

Merge remote-tracking branch 'origin/main' into issue589-query_and_wait

6dfbf92

fix mypy_samples session

765a644

consolidate docstrings for query_and_wait

e461ebe

remove mention of job ID

895b6d0

fallback to jobs.insert for unsupported features

d5345cd

distinguish API timeout from wait timeout

f75d8ab

add test for jobs.insert fallback

baff9d6

populate default job config

221898d

refactor default config

18f825a

Merge remote-tracking branch 'origin/main' into issue589-query_and_wait

5afbc41

tswast requested review from chalmerlowe and Linchin and removed request for obada-ab December 11, 2023 19:49

tswast added 5 commits December 11, 2023 13:50

fix unit tests

4771602

unit test coverage

b435e41

more coverage

64bf4a4

coverage for real

225049e

Merge branch 'main' into b1745-dbapi-query_and_wait

5154866

tswast mentioned this pull request Dec 12, 2023

dbapi: Skip storage client fetch when results cached #1745

Closed

tswast and others added 2 commits December 14, 2023 12:06

Merge branch 'main' into b1745-dbapi-query_and_wait

3b60b5b

Merge branch 'main' into b1745-dbapi-query_and_wait

83679ab

Linchin reviewed Dec 17, 2023

View reviewed changes

Linchin self-requested a review December 17, 2023 05:58

Linchin approved these changes Dec 17, 2023

View reviewed changes

chalmerlowe approved these changes Dec 18, 2023

View reviewed changes

Merge branch 'main' into b1745-dbapi-query_and_wait

0bd35e5

tswast added the automerge Merge the pull request once unit tests and other checks pass. label Dec 19, 2023

gcf-merge-on-green bot merged commit d225a94 into main Dec 19, 2023
21 of 22 checks passed

gcf-merge-on-green bot deleted the b1745-dbapi-query_and_wait branch December 19, 2023 22:00

gcf-merge-on-green bot removed the automerge Merge the pull request once unit tests and other checks pass. label Dec 19, 2023

release-please bot mentioned this pull request Dec 19, 2023

chore(main): release 3.15.0 #1752

Merged

hsheth2 mentioned this pull request Jan 10, 2024

fix(ingest/bigquery): support google-cloud-bigquery 3.15.0 datahub-project/datahub#9595

Merged

5 tasks

This was referenced May 21, 2024

Use faster query_and_wait method with %%bigquery magics googleapis/python-bigquery-magics#31

Closed

feat: use faster query_and_wait method from google-cloud-bigquery when available googleapis/python-bigquery-pandas#722

Merged

This was referenced Jan 17, 2024

January 15, 2024 kitta65/bq-extension-vscode#268

Closed

January 15, 2024 kitta65/prettier-plugin-bq#274

Closed

January 15, 2024 kitta65/bq2cst#284

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: DB-API uses more efficient `query_and_wait` when no job ID is provided #1747

perf: DB-API uses more efficient `query_and_wait` when no job ID is provided #1747

tswast commented Dec 8, 2023

tswast commented Dec 12, 2023

Linchin Dec 17, 2023

tswast Dec 19, 2023

Linchin Dec 17, 2023

tswast Dec 19, 2023

Linchin commented Dec 17, 2023

chalmerlowe left a comment

perf: DB-API uses more efficient query_and_wait when no job ID is provided #1747

perf: DB-API uses more efficient query_and_wait when no job ID is provided #1747

Conversation

tswast commented Dec 8, 2023

tswast commented Dec 12, 2023

Linchin Dec 17, 2023

Choose a reason for hiding this comment

tswast Dec 19, 2023

Choose a reason for hiding this comment

Linchin Dec 17, 2023

Choose a reason for hiding this comment

tswast Dec 19, 2023

Choose a reason for hiding this comment

Linchin commented Dec 17, 2023

chalmerlowe left a comment

Choose a reason for hiding this comment

perf: DB-API uses more efficient `query_and_wait` when no job ID is provided #1747

perf: DB-API uses more efficient `query_and_wait` when no job ID is provided #1747