refactor: Cache dtypes for scalar expressions for SQLGlot compiler #1759

sycai · 2025-05-21T20:12:10Z

Make DerefOp hold a field id_or_field, and resolve the column ID to actual Field instance upon node creation.

TrevorBergeron · 2025-05-22T17:49:54Z

bigframes/core/expression.py

 class DerefOp(Expression):
-    """A variable expression representing an unbound variable."""
+    """A variable expression representing a column reference."""


Can we just create a new FieldRefOp instead of overloading the existing minimal reference class with optional semantics?

I think that would require a lot more code changes than necessary.

I updated the deref's attribute to this union type: ColumnId | Field. ColumnId means type unresolved. Field means otherwise. This can keep the code change to minimum.

There is a certain irreducible complexity here. So something somewhere is checking. I'd rather have the callers deal with this than trying to paper over the differences with a single class.

TrevorBergeron · 2025-05-22T17:51:39Z

bigframes/core/expression.py

+    @abc.abstractmethod
+    def resolve_deferred_types(
+        self, col_dtypes: Dict[ids.ColumnId, dtypes.ExpressionType]
+    ) -> Expression:
        ...


bind_refs should be able to do this, don't think we need a new method

I think that would make things more complicated:

We cannot bake in any short-circuit mechanism. For type resolution, if a expression already has a resolved type, we can skip the binding. This is crucial for gradually updating deref ops with types. We cannot do so with bind_refs because we don't know if the caller's intention is to resolve types or replace expressions.

What we want is essentially this:

def resolved(expr, node) -> Expression: if expr.resolved: return expr else: return expr.bind_refs({id: ex.field_ref(node.id_to_field[id]) for id in node.ids})

I have encountered an edge case where the ProjectionNode has a resolved expression whose column reference does not appear in the child node. The column ID is "level_0", so it looks like a default index.

For cases like this, this logic would still fail because "level_0" is not present in the map from column IDs to Fields.

that sounds like a bug, if it doesn't fail later, that is due to sheer luck. Every reference should match a column in the child schema.

For specifically, the case comes from this test:

python-bigquery-dataframes/tests/unit/core/compile/sqlglot/test_compile_projection.py

Lines 24 to 31 in 50dca4c

def test_compile_projection(

scalars_types_pandas_df: pd.DataFrame, compiler_session: bigframes.Session, snapshot

):

bf_df = bpd.DataFrame(

scalars_types_pandas_df[["int64_col"]], session=compiler_session

)

bf_df["int64_col"] = bf_df["int64_col"] + 1

snapshot.assert_match(bf_df.sql, "out.sql")

. Not sure whether it is ought to happen.

What we want is essentially this:

def resolved(expr, node) -> Expression: if expr.resolved: return expr else: return expr.bind_refs({id: ex.field_ref(node.id_to_field[id]) for id in node.ids})

This worked! Thanks

TrevorBergeron · 2025-05-22T18:02:47Z

bigframes/core/expression.py

-        operand_types = tuple(
-            map(lambda x: x.output_type(input_types=input_types), self.inputs)
+    @functools.cached_property
+    def output_type(self) -> dtypes.ExpressionType | dtypes.AbsentDtype:


dealing with the absentdtype in post-binding contexts will be a pain. maybe a strict mode with typing overload?

Right. Removed the AbsentDtype

Instead, I added a property is_type_resolved, which behaves like that field in Spark.

TrevorBergeron · 2025-05-22T18:04:56Z

bigframes/core/expression.py


    id: ids.ColumnId
+    dtype: dtypes.ExpressionType | dtypes.AbsentDtype = dtypes.ABSENT_DTYPE


I would prefer to just use the field object. Nullability information can be useful for some operation simplifications.

Now deref ops are able to hold both ColumnId and Field values

TrevorBergeron · 2025-05-28T19:17:05Z

bigframes/core/expression.py

 class DerefOp(Expression):
-    """A variable expression representing an unbound variable."""
+    """A variable expression representing a column reference."""


There is a certain irreducible complexity here. So something somewhere is checking. I'd rather have the callers deal with this than trying to paper over the differences with a single class.

TrevorBergeron · 2025-05-28T19:18:24Z

bigframes/core/expression.py

+    @functools.cached_property
+    def output_type(self) -> dtypes.ExpressionType:
+        if not self.is_type_resolved:
+            raise ValueError(f"Type of expression {self.op.name} has not been fixed.")
+
+        input_types = [input.output_type for input in self.inputs]
+
+        return self.op.output_type(*input_types)


Not sure this really needs to be cached? Will build a lot of expressions and not care about type a lot of the time.

This is for enabling dynamic dispatch of the SQLGlot compiler.

bigframes/core/nodes.py

sycai added 7 commits May 21, 2025 20:00

feat: include bq schema and query string in dry run results

dc4ad3b

rename key

5fe5a79

fix tests

11e186b

refactor: cache dtypes for scalar expressions:

a504a07

fix deref expr type resolution bug

318e91d

add test

5a11564

remove dry_run changes from another branch

8538c77

product-auto-label bot added size: m Pull request size is medium. api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. labels May 21, 2025

sycai and others added 5 commits May 21, 2025 20:13

remove more changes from dry_run PR

4afbdac

rename DeferredDtype to AbsentDtype

d849811

Merge branch 'main' into sycai_expr

83b1389

Merge branch 'main' into sycai_expr

d4a1d68

Merge branch 'main' into sycai_expr

155d11e

TrevorBergeron reviewed May 22, 2025

View reviewed changes

removed absentDtype and reuse bind_refs

074a259

product-auto-label bot added size: l Pull request size is large. and removed size: m Pull request size is medium. labels May 22, 2025

sycai and others added 4 commits May 22, 2025 22:14

use a separate resolver for fields

77877c3

Merge branch 'main' into sycai_expr

366604d

fix lint

9dfe38c

move field resolutions to a separate function

505d737

sycai marked this pull request as ready for review May 23, 2025 20:32

sycai requested review from a team as code owners May 23, 2025 20:32

sycai requested a review from tswast May 23, 2025 20:32

blunderbuss-gcf bot assigned shobsi May 23, 2025

sycai requested a review from TrevorBergeron May 23, 2025 20:32

sycai added 2 commits May 23, 2025 16:40

Merge branch 'main' into sycai_expr

9f27e96

Merge branch 'main' into sycai_expr

7f0cedb

sycai and others added 3 commits May 28, 2025 11:06

Merge branch 'main' into sycai_expr

b615d0e

update helper function name

ba692cb

update doc and function names

869186d

TrevorBergeron previously approved these changes May 28, 2025

View reviewed changes

bind schema at compile time for SQLGlot compiler

71a57da

sycai dismissed TrevorBergeron’s stale review via 71a57da May 28, 2025 22:47

sycai changed the title ~~refactor: Cache dtypes for scalar expressions.~~ refactor: Cache dtypes for scalar expressions for SQLGlot compiler May 28, 2025

Merge branch 'main' into sycai_expr

968e968

TrevorBergeron previously approved these changes May 28, 2025

View reviewed changes

define a separate expression for field reference

f1775d8

sycai dismissed TrevorBergeron’s stale review via f1775d8 May 29, 2025 00:12

Merge branch 'main' into sycai_expr

c36c447

TrevorBergeron approved these changes May 29, 2025

View reviewed changes

sycai merged commit 8e71b03 into main May 29, 2025
24 checks passed

sycai deleted the sycai_expr branch May 29, 2025 05:41

	def test_compile_projection(
	scalars_types_pandas_df: pd.DataFrame, compiler_session: bigframes.Session, snapshot
	):
	bf_df = bpd.DataFrame(
	scalars_types_pandas_df[["int64_col"]], session=compiler_session
	)
	bf_df["int64_col"] = bf_df["int64_col"] + 1
	snapshot.assert_match(bf_df.sql, "out.sql")


		id: ids.ColumnId
		dtype: dtypes.ExpressionType \| dtypes.AbsentDtype = dtypes.ABSENT_DTYPE

refactor: Cache dtypes for scalar expressions for SQLGlot compiler #1759

refactor: Cache dtypes for scalar expressions for SQLGlot compiler #1759

Uh oh!

Conversation

sycai commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sycai commented May 21, 2025 •

edited

Loading