-
Notifications
You must be signed in to change notification settings - Fork 52
refactor: Cache dtypes for scalar expressions for SQLGlot compiler #1759
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
bigframes/core/expression.py
Outdated
class DerefOp(Expression): | ||
"""A variable expression representing an unbound variable.""" | ||
"""A variable expression representing a column reference.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just create a new FieldRefOp
instead of overloading the existing minimal reference class with optional semantics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that would require a lot more code changes than necessary.
I updated the deref's attribute to this union type: ColumnId | Field
. ColumnId
means type unresolved. Field
means otherwise. This can keep the code change to minimum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a certain irreducible complexity here. So something somewhere is checking. I'd rather have the callers deal with this than trying to paper over the differences with a single class.
bigframes/core/expression.py
Outdated
@abc.abstractmethod | ||
def resolve_deferred_types( | ||
self, col_dtypes: Dict[ids.ColumnId, dtypes.ExpressionType] | ||
) -> Expression: | ||
... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bind_refs
should be able to do this, don't think we need a new method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that would make things more complicated:
We cannot bake in any short-circuit mechanism. For type resolution, if a expression already has a resolved type, we can skip the binding. This is crucial for gradually updating deref ops with types. We cannot do so with bind_refs
because we don't know if the caller's intention is to resolve types or replace expressions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What we want is essentially this:
def resolved(expr, node) -> Expression:
if expr.resolved:
return expr
else:
return expr.bind_refs({id: ex.field_ref(node.id_to_field[id]) for id in node.ids})
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have encountered an edge case where the ProjectionNode has a resolved expression whose column reference does not appear in the child node. The column ID is "level_0", so it looks like a default index.
For cases like this, this logic would still fail because "level_0" is not present in the map from column IDs to Fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that sounds like a bug, if it doesn't fail later, that is due to sheer luck. Every reference should match a column in the child schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For specifically, the case comes from this test:
python-bigquery-dataframes/tests/unit/core/compile/sqlglot/test_compile_projection.py
Lines 24 to 31 in 50dca4c
def test_compile_projection( | |
scalars_types_pandas_df: pd.DataFrame, compiler_session: bigframes.Session, snapshot | |
): | |
bf_df = bpd.DataFrame( | |
scalars_types_pandas_df[["int64_col"]], session=compiler_session | |
) | |
bf_df["int64_col"] = bf_df["int64_col"] + 1 | |
snapshot.assert_match(bf_df.sql, "out.sql") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What we want is essentially this:
def resolved(expr, node) -> Expression: if expr.resolved: return expr else: return expr.bind_refs({id: ex.field_ref(node.id_to_field[id]) for id in node.ids})
This worked! Thanks
bigframes/core/expression.py
Outdated
operand_types = tuple( | ||
map(lambda x: x.output_type(input_types=input_types), self.inputs) | ||
@functools.cached_property | ||
def output_type(self) -> dtypes.ExpressionType | dtypes.AbsentDtype: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dealing with the absentdtype in post-binding contexts will be a pain. maybe a strict mode with typing overload?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. Removed the AbsentDtype
Instead, I added a property is_type_resolved
, which behaves like that field in Spark.
bigframes/core/expression.py
Outdated
|
||
id: ids.ColumnId | ||
dtype: dtypes.ExpressionType | dtypes.AbsentDtype = dtypes.ABSENT_DTYPE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to just use the field object. Nullability information can be useful for some operation simplifications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now deref ops are able to hold both ColumnId and Field values
bigframes/core/expression.py
Outdated
class DerefOp(Expression): | ||
"""A variable expression representing an unbound variable.""" | ||
"""A variable expression representing a column reference.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a certain irreducible complexity here. So something somewhere is checking. I'd rather have the callers deal with this than trying to paper over the differences with a single class.
bigframes/core/expression.py
Outdated
@functools.cached_property | ||
def output_type(self) -> dtypes.ExpressionType: | ||
if not self.is_type_resolved: | ||
raise ValueError(f"Type of expression {self.op.name} has not been fixed.") | ||
|
||
input_types = [input.output_type for input in self.inputs] | ||
|
||
return self.op.output_type(*input_types) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure this really needs to be cached? Will build a lot of expressions and not care about type a lot of the time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for enabling dynamic dispatch of the SQLGlot compiler.
Make
DerefOp
hold a fieldid_or_field
, and resolve the column ID to actualField
instance upon node creation.