Skip to content

refactor: Cache dtypes for scalar expressions for SQLGlot compiler #1759

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
May 29, 2025

Conversation

sycai
Copy link
Contributor

@sycai sycai commented May 21, 2025

Make DerefOp hold a field id_or_field, and resolve the column ID to actual Field instance upon node creation.

@product-auto-label product-auto-label bot added size: m Pull request size is medium. api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. labels May 21, 2025
Comment on lines 368 to 369
class DerefOp(Expression):
"""A variable expression representing an unbound variable."""
"""A variable expression representing a column reference."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just create a new FieldRefOp instead of overloading the existing minimal reference class with optional semantics?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would require a lot more code changes than necessary.

I updated the deref's attribute to this union type: ColumnId | Field. ColumnId means type unresolved. Field means otherwise. This can keep the code change to minimum.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a certain irreducible complexity here. So something somewhere is checking. I'd rather have the callers deal with this than trying to paper over the differences with a single class.

Comment on lines 208 to 212
@abc.abstractmethod
def resolve_deferred_types(
self, col_dtypes: Dict[ids.ColumnId, dtypes.ExpressionType]
) -> Expression:
...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bind_refs should be able to do this, don't think we need a new method

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would make things more complicated:

We cannot bake in any short-circuit mechanism. For type resolution, if a expression already has a resolved type, we can skip the binding. This is crucial for gradually updating deref ops with types. We cannot do so with bind_refs because we don't know if the caller's intention is to resolve types or replace expressions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we want is essentially this:

def resolved(expr, node) -> Expression:
   if expr.resolved:
      return expr
   else:
      return expr.bind_refs({id: ex.field_ref(node.id_to_field[id]) for id in node.ids})

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have encountered an edge case where the ProjectionNode has a resolved expression whose column reference does not appear in the child node. The column ID is "level_0", so it looks like a default index.

For cases like this, this logic would still fail because "level_0" is not present in the map from column IDs to Fields.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that sounds like a bug, if it doesn't fail later, that is due to sheer luck. Every reference should match a column in the child schema.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For specifically, the case comes from this test:

def test_compile_projection(
scalars_types_pandas_df: pd.DataFrame, compiler_session: bigframes.Session, snapshot
):
bf_df = bpd.DataFrame(
scalars_types_pandas_df[["int64_col"]], session=compiler_session
)
bf_df["int64_col"] = bf_df["int64_col"] + 1
snapshot.assert_match(bf_df.sql, "out.sql")
. Not sure whether it is ought to happen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we want is essentially this:

def resolved(expr, node) -> Expression:
   if expr.resolved:
      return expr
   else:
      return expr.bind_refs({id: ex.field_ref(node.id_to_field[id]) for id in node.ids})

This worked! Thanks

operand_types = tuple(
map(lambda x: x.output_type(input_types=input_types), self.inputs)
@functools.cached_property
def output_type(self) -> dtypes.ExpressionType | dtypes.AbsentDtype:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dealing with the absentdtype in post-binding contexts will be a pain. maybe a strict mode with typing overload?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Removed the AbsentDtype

Instead, I added a property is_type_resolved, which behaves like that field in Spark.


id: ids.ColumnId
dtype: dtypes.ExpressionType | dtypes.AbsentDtype = dtypes.ABSENT_DTYPE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to just use the field object. Nullability information can be useful for some operation simplifications.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now deref ops are able to hold both ColumnId and Field values

@product-auto-label product-auto-label bot added size: l Pull request size is large. and removed size: m Pull request size is medium. labels May 22, 2025
@sycai sycai marked this pull request as ready for review May 23, 2025 20:32
@sycai sycai requested review from a team as code owners May 23, 2025 20:32
@sycai sycai requested a review from tswast May 23, 2025 20:32
@sycai sycai requested a review from TrevorBergeron May 23, 2025 20:32
TrevorBergeron
TrevorBergeron previously approved these changes May 28, 2025
Comment on lines 368 to 369
class DerefOp(Expression):
"""A variable expression representing an unbound variable."""
"""A variable expression representing a column reference."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a certain irreducible complexity here. So something somewhere is checking. I'd rather have the callers deal with this than trying to paper over the differences with a single class.

Comment on lines 471 to 478
@functools.cached_property
def output_type(self) -> dtypes.ExpressionType:
if not self.is_type_resolved:
raise ValueError(f"Type of expression {self.op.name} has not been fixed.")

input_types = [input.output_type for input in self.inputs]

return self.op.output_type(*input_types)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this really needs to be cached? Will build a lot of expressions and not care about type a lot of the time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for enabling dynamic dispatch of the SQLGlot compiler.

@sycai sycai changed the title refactor: Cache dtypes for scalar expressions. refactor: Cache dtypes for scalar expressions for SQLGlot compiler May 28, 2025
TrevorBergeron
TrevorBergeron previously approved these changes May 28, 2025
@sycai sycai merged commit 8e71b03 into main May 29, 2025
24 checks passed
@sycai sycai deleted the sycai_expr branch May 29, 2025 05:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. size: l Pull request size is large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants