-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: if page_size
or max_results
is set on QueryJob.result()
, use to download first page of results
#1942
Conversation
…use to download first page of results
…ait' into b344008814-page_size-query_and_wait
# regards to allowing the user to tune how many results to download | ||
# while we wait for the query to finish. See internal issue: | ||
# 344008814. | ||
if page_size is None and max_results is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for my education, why are setting max_results and page_size equivalent here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imagine the user has selected a value for max_results
. That means we should not download more than that number of rows, in which case it's important to set the page size.
Also, it means the user is expecting to download the results and do something with them, which means it wouldn't be a waste to ask for the first page of results while we wait for the query to finish.
|
||
if timeout is not None: | ||
if not isinstance(timeout, (int, float)): | ||
timeout = _MIN_GET_QUERY_RESULTS_TIMEOUT | ||
else: | ||
timeout = max(timeout, _MIN_GET_QUERY_RESULTS_TIMEOUT) | ||
|
||
if page_size > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also for my education, why is the timestamp option related to maxResults?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we accept any results in the response, we want to make sure the data is encoded correctly. Without this option, BigQuery encodes timestamps as floating point numbers, which results in a loss of microsecond-level precision.
The reason it matters is that previously we always asked for 0 rows, in which case we didn't include this option.
I suppose we could always included it, but it's not necessary if we don't download any data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM and thank you :)
If I understood correctly, in the future if a user wants to make query_and_wait()
more efficient, they pass a page_size
value so we can use it to reduce an additional API call. Should we add some documentation for this use case?
@@ -1139,114 +1156,92 @@ def test_result_with_none_timeout(self): | |||
timeout=None, | |||
) | |||
|
|||
def test_result_with_done_jobs_query_response_and_page_size_invalidates_cache(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""We don't call jobs.query with a page size, so if the user explicitly requests a certain size, invalidate the cache. """
This is no longer true. We do set the page size in the initial getQueryResults call.
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
Fixes internal issue 344008814
🦕