Skip to content

perf: Directly read gbq table for simple plans #1607

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Apr 15, 2025
Merged

Conversation

TrevorBergeron
Copy link
Contributor

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> 🦕

@product-auto-label product-auto-label bot added size: xl Pull request size is extra large. api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. labels Apr 9, 2025
@TrevorBergeron TrevorBergeron requested a review from tswast April 9, 2025 22:03
@TrevorBergeron TrevorBergeron marked this pull request as ready for review April 9, 2025 22:03
@TrevorBergeron TrevorBergeron requested review from a team as code owners April 9, 2025 22:03
@@ -574,6 +583,31 @@ def with_id(self, id: identifiers.ColumnId) -> ScanItem:
class ScanList:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we get a docstring for ScanList, please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@@ -574,6 +583,31 @@ def with_id(self, id: identifiers.ColumnId) -> ScanItem:
class ScanList:
items: typing.Tuple[ScanItem, ...]

def filter(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we call this project to align with https://en.wikipedia.org/wiki/Projection_(relational_algebra) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed to project

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually renamed this one to filter_cols

) -> ScanList:
result = ScanList(tuple(item for item in self.items if item.id in ids))
if len(result.items) == 0:
# We need to select something, or stuff breaks
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we'd pick the smallest size column, right? Let's add a TODO if so. Or does this column not actually get scanned?

Would be good to describe more what stuff breaks, too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

specified that its the sql syntax that breaks. The columns won't actually get scanned, as it will just be pruned out by the optimizer anyways. Amended commend to say that its the sql syntax

result = ScanList(self.items[:1])
return result

def select(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A docstring would be great. Are we talking about https://en.wikipedia.org/wiki/Selection_(relational_algebra) (aka row predicates) here?

@@ -0,0 +1,81 @@
from typing import Any, Optional
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs license header.

Suggested change
from typing import Any, Optional
# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import annotations
from typing import Any, Optional

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

from bigframes.session import executor, semi_executor


class ReadApiSemiExecutor(semi_executor.SemiExecutor):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A docstring would be great. I assume the "Semi" part means that not all plans/expressions are supported?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added docstring to SemiExecutor abc, as well as ReadApiSemiExecutor. And yeah, SemiExecutor means exactly that, they can execute a subset of possible plans.

Comment on lines +38 to +39
if node.source.sql_predicate:
read_options["row_restriction"] = node.source.sql_predicate
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside: How are we protecting against plans that we can't run. Hopefully we've got an allowlist of properties. I'm worried about adding a property to BigFrameNode and forgetting to add it to the executor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ultimately, each semi-executor has to specify the subset of plans it can or cannot execute, whether through allowlists, if-else statements, or some other mechanism. The real problem is if it drifts out of sync with the expected reference behavior. This shouldn't be a problem for now, as all tests that can use this semi-executor, will, and will catch issues. However, as we add more semi-executors, we should directly compare the results against the reference implementation, and maybe even do some fuzzing.

return None

import google.cloud.bigquery_storage_v1.types as bq_storage_types
from google.protobuf.timestamp_pb2 import Timestamp
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Use import statements for packages and modules only, not for individual types, classes, or functions.

https://google.github.io/styleguide/pyguide.html#22-imports

Timestamp isn't in the list of excemptions. https://google.github.io/styleguide/pyguide.html#2241-exemptions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@TrevorBergeron TrevorBergeron requested a review from tswast April 13, 2025 05:35
@TrevorBergeron TrevorBergeron requested a review from sycai April 15, 2025 17:16
@@ -1068,6 +1113,11 @@ def variables_introduced(self) -> int:
# This operation only renames variables, doesn't actually create new ones
return 0

@property
def double_references_ids(self) -> bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming nit: maybe "has_multi_reference_ids" is better, since the return type is bool?

Plus, technically speaking you can have triple references too 🧐

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done



def try_reduce_to_table_scan(node: nodes.BigFrameNode) -> Optional[nodes.ReadTableNode]:
if not all(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we re-write this into a plain-old "for" loop?

for n in node.unique_nodes():
  if not isinstance(n, ...):
     return None

This might be more readable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, makes sense, I tend too much towards "everything is an expression". fixed

# Unstable interface, in development
class SemiExecutor(abc.ABC):
"""
A semi executor executes a subset of possible plans, returns None if it cannot execute a given plan.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"returns None if it cannot execute a given plan."

Maybe it should be "... if cannot execute any/all given plans?"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rephrased, but it is give a single (logical) plan at a time, and if it cannot find a way to physically execute it, it returns None.

@TrevorBergeron TrevorBergeron enabled auto-merge (squash) April 15, 2025 19:00
@TrevorBergeron TrevorBergeron merged commit 6ad38e8 into main Apr 15, 2025
17 of 24 checks passed
@TrevorBergeron TrevorBergeron deleted the semi_executor branch April 15, 2025 19:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. size: xl Pull request size is extra large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants