-
Notifications
You must be signed in to change notification settings - Fork 48
perf: Directly read gbq table for simple plans #1607
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
3a9dc4b
to
84c4aa0
Compare
84c4aa0
to
83312f0
Compare
@@ -574,6 +583,31 @@ def with_id(self, id: identifiers.ColumnId) -> ScanItem: | |||
class ScanList: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we get a docstring for ScanList, please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
bigframes/core/nodes.py
Outdated
@@ -574,6 +583,31 @@ def with_id(self, id: identifiers.ColumnId) -> ScanItem: | |||
class ScanList: | |||
items: typing.Tuple[ScanItem, ...] | |||
|
|||
def filter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we call this project to align with https://en.wikipedia.org/wiki/Projection_(relational_algebra) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed to project
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually renamed this one to filter_cols
bigframes/core/nodes.py
Outdated
) -> ScanList: | ||
result = ScanList(tuple(item for item in self.items if item.id in ids)) | ||
if len(result.items) == 0: | ||
# We need to select something, or stuff breaks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we'd pick the smallest size column, right? Let's add a TODO if so. Or does this column not actually get scanned?
Would be good to describe more what stuff breaks, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
specified that its the sql syntax that breaks. The columns won't actually get scanned, as it will just be pruned out by the optimizer anyways. Amended commend to say that its the sql syntax
bigframes/core/nodes.py
Outdated
result = ScanList(self.items[:1]) | ||
return result | ||
|
||
def select( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A docstring would be great. Are we talking about https://en.wikipedia.org/wiki/Selection_(relational_algebra) (aka row predicates) here?
@@ -0,0 +1,81 @@ | |||
from typing import Any, Optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs license header.
from typing import Any, Optional | |
# Copyright 2025 Google LLC | |
# | |
# Licensed under the Apache License, Version 2.0 (the "License"); | |
# you may not use this file except in compliance with the License. | |
# You may obtain a copy of the License at | |
# | |
# http://www.apache.org/licenses/LICENSE-2.0 | |
# | |
# Unless required by applicable law or agreed to in writing, software | |
# distributed under the License is distributed on an "AS IS" BASIS, | |
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |
# See the License for the specific language governing permissions and | |
# limitations under the License. | |
from __future__ import annotations | |
from typing import Any, Optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
from bigframes.session import executor, semi_executor | ||
|
||
|
||
class ReadApiSemiExecutor(semi_executor.SemiExecutor): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A docstring would be great. I assume the "Semi" part means that not all plans/expressions are supported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added docstring to SemiExecutor abc, as well as ReadApiSemiExecutor. And yeah, SemiExecutor means exactly that, they can execute a subset of possible plans.
if node.source.sql_predicate: | ||
read_options["row_restriction"] = node.source.sql_predicate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside: How are we protecting against plans that we can't run. Hopefully we've got an allowlist of properties. I'm worried about adding a property to BigFrameNode and forgetting to add it to the executor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ultimately, each semi-executor has to specify the subset of plans it can or cannot execute, whether through allowlists, if-else statements, or some other mechanism. The real problem is if it drifts out of sync with the expected reference behavior. This shouldn't be a problem for now, as all tests that can use this semi-executor, will, and will catch issues. However, as we add more semi-executors, we should directly compare the results against the reference implementation, and maybe even do some fuzzing.
return None | ||
|
||
import google.cloud.bigquery_storage_v1.types as bq_storage_types | ||
from google.protobuf.timestamp_pb2 import Timestamp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Use import statements for packages and modules only, not for individual types, classes, or functions.
https://google.github.io/styleguide/pyguide.html#22-imports
Timestamp isn't in the list of excemptions. https://google.github.io/styleguide/pyguide.html#2241-exemptions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
bigframes/core/nodes.py
Outdated
@@ -1068,6 +1113,11 @@ def variables_introduced(self) -> int: | |||
# This operation only renames variables, doesn't actually create new ones | |||
return 0 | |||
|
|||
@property | |||
def double_references_ids(self) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
naming nit: maybe "has_multi_reference_ids" is better, since the return type is bool?
Plus, technically speaking you can have triple references too 🧐
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
|
||
def try_reduce_to_table_scan(node: nodes.BigFrameNode) -> Optional[nodes.ReadTableNode]: | ||
if not all( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we re-write this into a plain-old "for" loop?
for n in node.unique_nodes():
if not isinstance(n, ...):
return None
This might be more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, makes sense, I tend too much towards "everything is an expression". fixed
bigframes/session/semi_executor.py
Outdated
# Unstable interface, in development | ||
class SemiExecutor(abc.ABC): | ||
""" | ||
A semi executor executes a subset of possible plans, returns None if it cannot execute a given plan. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"returns None if it cannot execute a given plan."
Maybe it should be "... if cannot execute any/all given plans?"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rephrased, but it is give a single (logical) plan at a time, and if it cannot find a way to physically execute it, it returns None.
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
Fixes #<issue_number_goes_here> 🦕