Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move BigQuery list_ methods to use iterators #2565

Merged
merged 5 commits into from
Nov 1, 2016

Conversation

dhermes
Copy link
Contributor

@dhermes dhermes commented Oct 19, 2016

Follow up to #2561. This only covers Dataset.list_tables() and Table.fetch_data(). It may also be "correct" to use an Iterator in

  • QueryResults.fetch_data
  • QueryResults.run (via QueryResults._build_resource)

but I wanted to get eyes on this change first / opinion from the original author (@tseaver) before moving forward.

@dhermes dhermes added the api: bigquery Issues related to the BigQuery API. label Oct 19, 2016
@googlebot googlebot added the cla: yes This human has signed the Contributor License Agreement. label Oct 19, 2016
@@ -72,7 +72,7 @@
}
TEST_RC_REPLACEMENTS = {
'FORMAT': {
'max-module-lines': 1960,
'max-module-lines': 2000,

This comment was marked as spam.

This comment was marked as spam.

page_token=page_token, max_results=max_results,
page_start=_rows_page_start)
iterator.schema = self._schema
# Over-ride the key used to retrieve the next page token.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

@@ -86,20 +86,34 @@ def _string_from_json(value, _):
}


def _row_from_json(row, schema):
"""Convert JSON row data to row w/ appropriate types.

This comment was marked as spam.

"""Convert JSON row data to row w/ appropriate types.

:type row: dict
:param row:

This comment was marked as spam.

return _row_from_json(resource, iterator.schema)


# pylint: disable=unused-argument

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

max_results=MAX,
page_token=TOKEN)
iterator = table.fetch_data(
client=client2, max_results=MAX, page_token=TOKEN)

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

@dhermes
Copy link
Contributor Author

dhermes commented Oct 19, 2016

@tseaver Can you weigh in here? I'm especially curious how you feel about fetch_data (both in Table and QueryResults) and QueryResults.run.

@tseaver
Copy link
Contributor

tseaver commented Oct 20, 2016

I'm especially curious how you feel about fetch_data (both in Table and QueryResults) and QueryResults.run.

+0 for adding it to fetch_data: those methods currently return a 3-tuple, row_data, total_rows, page_token, which means we would have to choose what to do with the total_rows count.

-1 for adding it to run: that doesn't return row data, paged or not.

@dhermes
Copy link
Contributor Author

dhermes commented Oct 20, 2016

+0 for adding it to fetch_data: those methods currently return a 3-tuple, row_data, total_rows, page_token, which means we would have to choose what to do with the total_rows count.

Check out the implementation, I put total_rows on iterator.page.

"""
total_rows = response.get('totalRows')
if total_rows is not None:
page.total_rows = int(total_rows)

This comment was marked as spam.

@tseaver
Copy link
Contributor

tseaver commented Oct 20, 2016

LGTM, once you decide about my question about where total_rows should be exposed.

@dhermes
Copy link
Contributor Author

dhermes commented Oct 20, 2016

@tseaver I'll put total_rows on both once back at keyboard. Now that you've seen the implementation, does it seem appropriate to change QueryResults.fetch_data over as well?

@dhermes
Copy link
Contributor Author

dhermes commented Oct 20, 2016

@tseaver I started writing the code to put total_rows on the Iterator instance and realized something. If while paging, the first request has totalRows: 1200 and the second request somehow has the key missing, then on the second page we'd either have to del iterator.total_rows or iterator.total_rows = None or have an invalid value.

Given this scenario, it seems like my_iter.page.total_rows is the only real correct place for total_rows since it is actually a page-specific value (it may even change between pages if the backend is processing more rows between requests?)

@tseaver
Copy link
Contributor

tseaver commented Oct 20, 2016

@dhermes For queries, the docs say totalRows is only present on a page of results if the query has already completed: we can therefore presume that i will not change.

For tables, the docs say only that totalRows represents "[t]he total number of rows in the complete table." We would have to ask the back-end team whether the page_token encodes a "timestamped" cursor which would make totalRows constant across pages, even if new values were inserted into the table in the meanwhile.

@tseaver
Copy link
Contributor

tseaver commented Oct 25, 2016

@fhoffa Can you chime in on how stable the totalRows value is for tabledata?

Also fixing BigQuery system test in the process.
@dhermes
Copy link
Contributor Author

dhermes commented Nov 1, 2016

@tseaver I put total_rows only on the Iterator and just allowed it to be None if missing in the response.

@dhermes dhermes merged commit 4b4023f into googleapis:master Nov 1, 2016
@dhermes dhermes deleted the bigquery-iterators-2 branch November 1, 2016 15:55
richkadel pushed a commit to richkadel/google-cloud-python that referenced this pull request May 6, 2017
Move BigQuery list_ methods to use iterators
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the BigQuery API. backend cla: yes This human has signed the Contributor License Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants