Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve order in to_dataframe with BQ Storage from queries containing ORDER BY #7793

Merged
merged 3 commits into from
Apr 30, 2019

Conversation

tswast
Copy link
Contributor

@tswast tswast commented Apr 23, 2019

This fixes an issue where due to reading from multiple stream in
parallel, the order of rows is not preserved. Normally this isn't an
issue, but it is when the rows are query results from an ORDER BY query.

Note: The detection of ORDER BY is not perfect. It will have false positives (such as ORDER BY in a string or sub-query), but hopefully no false negatives. Since a false positive only means we read from a single stream, behavior is correct but perhaps slower than expected.

Closes #7759

@tswast tswast requested a review from crwilcox as a code owner April 23, 2019 23:59
@googlebot googlebot added the cla: yes This human has signed the Contributor License Agreement. label Apr 23, 2019
@tswast tswast added api: bigquery Issues related to the BigQuery API. api: bigquerystorage Issues related to the BigQuery Storage API. labels Apr 23, 2019
@tswast tswast requested a review from shollyman April 23, 2019 23:59
if query is None:
return False

return re.search(r"ORDER\s+BY", query, re.IGNORECASE) is not None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have a comment indicating we know there's a class of false positives (ordered window functions)?

if query is None:
return False

return re.search(r"ORDER\s+BY", query, re.IGNORECASE) is not None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please compile the pattern as a module-scope global, and use "raw" strings e.g.:

_CONTAINS_ORDER_BY = re.compile(r"ORDER\s+BY", re.IGNORECASE)

...
def _contains_order_by(query):
    """Do we need to preserve the order of the query results?"""
    return query and _CONTAINS_ORDER_BY.search(query)

tswast added 2 commits April 29, 2019 13:57
…ng ORDER BY

This fixes an issue where due to reading from multiple stream in
parallel, the order of rows is not preserved. Normally this isn't an
issue, but it is when the rows are query results from an ORDER BY query.
@tswast tswast force-pushed the issue7759-preserve-order branch from 3b0c7ae to babdd66 Compare April 29, 2019 21:18
@tswast tswast requested a review from a team April 29, 2019 21:18
@tswast tswast merged commit 64b66da into googleapis:master Apr 30, 2019
@tswast tswast deleted the issue7759-preserve-order branch April 30, 2019 17:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the BigQuery API. api: bigquerystorage Issues related to the BigQuery Storage API. cla: yes This human has signed the Contributor License Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BigQuery: order not preserved when downloading ORDER BY query results to dataframe with BQ Storage API
4 participants