Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: avoid unnecessary API call in QueryJob.result() when job is already finished #1900

Merged
merged 14 commits into from
Apr 18, 2024

Conversation

tswast
Copy link
Contributor

@tswast tswast commented Apr 12, 2024

BEGIN_COMMIT_OVERRIDE
perf: avoid unnecessary API call in QueryJob.result() when job is already finished (#1900)

fix: retry query jobs that fail even with ambiguous jobs.getQueryResults REST errors (#1903, #1900)
END_COMMIT_OVERRIDE

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

@product-auto-label product-auto-label bot added size: m Pull request size is medium. api: bigquery Issues related to the googleapis/python-bigquery API. labels Apr 12, 2024
@tswast tswast changed the title fix: avoid unnecessary API call in QueryJob.result() when job is alre… fix: avoid unnecessary API call in QueryJob.result() when job is already finished Apr 12, 2024
@tswast
Copy link
Contributor Author

tswast commented Apr 12, 2024

CC @chalmerlowe

@@ -1316,37 +1343,6 @@ def test_result_w_empty_schema(self):
self.assertEqual(result.location, "asia-northeast1")
self.assertEqual(result.query_id, "xyz-abc")

def test_result_invokes_begins(self):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of #967 released in google-cloud-bigquery 3.0.0, the _begin method is no longer used for query jobs.

@tswast tswast marked this pull request as ready for review April 13, 2024 14:44
@tswast tswast requested review from a team as code owners April 13, 2024 14:44
@tswast tswast requested a review from obada-ab April 13, 2024 14:44
@tswast tswast requested a review from chalmerlowe April 13, 2024 14:44
@product-auto-label product-auto-label bot added size: l Pull request size is large. and removed size: m Pull request size is medium. labels Apr 13, 2024
@@ -1430,40 +1430,6 @@ def _reload_query_results(
timeout=transport_timeout,
)

def _done_or_raise(self, retry=DEFAULT_RETRY, timeout=None):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was overridden because we wanted result() from the superclass to call jobs.getQueryResults, not just jobs.get (i.e. job.reload() in Python). Now that we aren't using the superclass for result(), this method is no longer necessary.

try:
self.reload(retry=retry, timeout=transport_timeout)
except exceptions.GoogleAPIError as exc:
self.set_exception(exc)
Copy link
Contributor Author

@tswast tswast Apr 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought: We probably should have been calling set_exception based on the job status. Need to look into this further.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. We are. 😅

self.set_exception(exception)

Which we call from _set_properties

self._set_future_result()

Which we call from reload

self._set_properties(api_response)

@chalmerlowe chalmerlowe self-assigned this Apr 15, 2024
# wait for the query to finish. Unlike most methods,
# jobs.getQueryResults hangs as long as it can to ensure we
# know when the query has finished as soon as possible.
self._reload_query_results(retry=retry, timeout=timeout)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh oh, if jobs.getQueryResults fails because the job failed it can throw an exception but restart_query_job will still be False.

But we don't want restart_query_job = True because sometimes this can raise an ambiguous exception such as quota exceeded, where we don't know if it's the job quota and it's a failed job or at a higher level (Google Frontend - GFE) where the job might actually still be running and/or succeeded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't the worst way to fail, but it'd be nice to do the jobs.get call above in case of an exception to get a chance at retrying this job if the job failed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #1903 to track improvements to ambiguous errors. 12fa9fb fixes an issue where we weren't actually retrying after an ambiguous failure even though we thought we were.

Copy link
Collaborator

@chalmerlowe chalmerlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am halfway through my review.
Releasing these comments for now.
Will come back to this to finish out my review as soon as possible.

google/cloud/bigquery/job/query.py Outdated Show resolved Hide resolved

do_get_result()
# timeout can be `None` or an object from our superclass
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which superclass are we discussing here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

google.api_core.future.polling.PollingFuture._DEFAULT_VALUE introduced in googleapis/python-api-core#462.

I've updated the comments with some more info as well as some things to consider in case we want to have a default value for timeout in future.

google/cloud/bigquery/retry.py Outdated Show resolved Hide resolved
@@ -39,7 +39,7 @@
# Allow for a few retries after the API request times out. This relevant for
# rateLimitExceeded errors, which can be raised either by the Google load
# balancer or the BigQuery job server.
_DEFAULT_JOB_DEADLINE = 3.0 * _DEFAULT_RETRY_DEADLINE
_DEFAULT_JOB_DEADLINE = 4.0 * _DEFAULT_RETRY_DEADLINE
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of using 4.0 here?
Can we get a comment indicating why 4.0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to 2.0 * (2.0 * _DEFAULT_RETRY_DEADLINE) and added some explanation both here and in QueryJob.result().

Note: This still only gets us 1 query retry in the face of the problematic ambiguous error codes from jobs.getQueryResults() but that's better than the nothing that we were actually getting before in some cases. I don't feel comfortable bumping this much further, though maybe 3.0 * 2.0 * _DEFAULT_RETRY_DEADLINE would be slightly less arbitrary at 1 hour?

google/cloud/bigquery/retry.py Outdated Show resolved Hide resolved
@@ -970,7 +877,12 @@ def test_result(self):
"rows": [{"f": [{"v": "abc"}]}],
}
conn = make_connection(
query_resource, query_resource_done, job_resource_done, query_page_resource
job_resource,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I am tracking the relationship between the make connection inputs versus the assert_has_calls checks.

Can you explain how these tests are supposed to work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make_connection is a convention in google-cloud-bigquery unit tests that actually predates our use of the "mock" package. It mocks out the responses to REST API calls, previously with a fake implementation of our "Connection" class from the _http module and now with a true mock object. For every quest that our test makes, there should be a corresponding response. As with Mock.side_effect, any exceptions in this list will be raised, instead.

I'm guessing your question also relates to "Why this particular set of requests / responses?". I've added some comments explaining why we're expecting this sequence of API calls. I've also updated this test to more explicitly check for a possible cause of customer issue b/332850329.

@@ -1289,7 +1217,18 @@ def test_result_w_retry(self):
)

connection.api_request.assert_has_calls(
[query_results_call, query_results_call, reload_call]
[
reload_call,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing here. Can I get some clarity on what we are doing and looking for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some explanation here as well as above in the make_connection() call.

Co-authored-by: Chalmer Lowe <chalmerlowe@google.com>
Copy link
Collaborator

@chalmerlowe chalmerlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

tests/unit/job/test_query.py Show resolved Hide resolved

job.result()
with freezegun.freeze_time("1970-01-01 00:00:00", tick=False):
job.result(timeout=1.125)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there are reason we are using such a specific number?
1.125.

Can I get a comment here to let future me know why we picked this number?

method="GET",
path=f"/projects/{self.PROJECT}/queries/{self.JOB_ID}",
query_params={
"maxResults": 0,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is maxResults of 0 synonymous with asking for all results? OR is it really asking for zero results?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually want 0 rows. If we omit this or ask for non-zero number of rows, the jobs.getQueryResults API can hang when the query has wide rows (many columns).

tests/unit/test_job_retry.py Outdated Show resolved Hide resolved
@tswast tswast requested a review from chalmerlowe April 17, 2024 19:34
Copy link
Collaborator

@chalmerlowe chalmerlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the comments, etc.
Future me thanks you as well.

LGTM, APPROVED.

google/cloud/bigquery/job/query.py Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. size: l Pull request size is large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants