Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider introducing nullability on SqlExpression #33889

Open
Tracked by #34524
roji opened this issue Jun 4, 2024 · 15 comments
Open
Tracked by #34524

Consider introducing nullability on SqlExpression #33889

roji opened this issue Jun 4, 2024 · 15 comments

Comments

@roji
Copy link
Member

roji commented Jun 4, 2024

The following are some just some architecture thoughts at this point, we shouldn't invest time in this in the near future

Our current query architecture currently doesn't represent nullability in the tree; certain specific nodes do that (e.g. ColumnExpression has IsNullable), but for the rest, it's impossible for translation/visitation code to know whether it's dealing with a node that could contain nulls. Instead, our current design concentrates nullability knowledge at a single visitation step - SqlNullabilityProcessor - where nullability is calculated as part of the visitation pass (in ephemeral state), rather than represented inside the tree itself.

One negative result of this is that we can't know that nodes aren't nullable in earlier phases (e.g. in translation), which prevents some possible SQL optimizations - we sometimes do best-effort checks for non-nullable columns and optimize for those, but otherwise can't do anything for more complex expression types.

We could introduce nullability information on SqlExpression itself, alongside (or possibly even inside) the type mapping. Note that we currently disallow nullable value types on SqlExpression; we'd start allowing that, and introduce an additional bool flag for reference types only (since reference nullability isn't represented on .NET Type). Bubbling nullability up the tree would happen as nodes are constructed; for example, when a SqlBinaryExpression is constructed, at that point we'd calculcate its nullability, based on its operands - similar to what we do today for type mapping. Note that we propose to do the same for collations in #29620.

One problem with the above approach, is that we have no knowledge of parameter nullability anywhere in the pipeline, until SqlNullabilityProcessor; this processor runs after our 2nd-level cache, which is there precisely to sniff parameter nullability and cache SQL based on that. In other words, in the translation phase, all parameters would have to be considered nullable, and the nullability processor would have to recalculate and bubble up nullability anyway, because at that point parameter nullability is known. The mechanism for calculating and bubbling up nullability could be reused though, between translation and SqlNullabilityProcessor (i.e. both would use SqlExpressionFactory, which would do this). In effect, SqlNullabilityProcessor would simply replace a nullable SqlParameterExpression with a non-nullable one, and then bubble that up, reconstructing nodes on top as necessary.

An additional thing to keep in mind, is that any replacing of a SQL node in any visitor anywhere would in theory require recalculating nullability up the tree - unless we decide that nullability changes aren't supported. Note that we already have this problem with type mappings, at least in principle - if some visitor modifies the type mapping of some node, other nodes whose type mappings depend on it don't change, and could be incorrect. We could think of a general visitor architecture to deal with this, but this would be quite a big change.

/cc @maumar @ranma42

@ranma42
Copy link
Contributor

ranma42 commented Jun 4, 2024

The main advantage of placing nullability information in the expression is that it would be readily available across the pipeline.
There is at least one major disadvantage: it would be harder to track contextual nullability (which would be a regression from the current state).

For example currently the SqlNullabilityProcessor, when visiting

CASE
  WHEN x.NullableInt IS NOT NULL THEN x.NullableInt + 1
  ELSE x.NullableInt - 1
END

propagates from the WHEN test to the x.NullableInt +1 subexpression the fact that x.NullableInt cannot be null (conversely, iirc, it does not propagate to the ELSE branch that x.NullableInt is null... room for improvements ;) ).

Obviously it would be impossible to properly perform this kind of non-local analysis while constructing the expressions, assuming they are immutable (hence x.NullableInt is determined to be nullable at construction time) and/or that subexpressions can be re-used (even if SqlExpression was mutable, reusing the same x.NullableInt in the 3 places where it appears in the example would make it impossible to state that the first occurrence might be null, the second one is guaranteed to be non-null and the third one is guaranteed to be null).

Note that there is still value in performing a (contextless) bottom-up nullability analysis; what I am worried about is that

  • doing only that would regress the current translation
  • having 2 different nullability analyses (one contextual, one contextless) and passing one to all of the pipeline while sometimes relying on the other one might make it harder to reason about what is going on and why (especially if this crossed barriers such as core library / external providers)

An additional thing to keep in mind, is that any replacing of a SQL node in any visitor anywhere would in theory require recalculating nullability up the tree - unless we decide that nullability changes aren't supported. Note that we already have this problem with type mappings, at least in principle - if some visitor modifies the type mapping of some node, other nodes whose type mappings depend on it don't change, and could be incorrect. We could think of a general visitor architecture to deal with this, but this would be quite a big change.

I am not really worried about this: as long as the transformation is replacing a node with an equivalent one, the nullability analysis would not need to be invalidated/updated (actually, if the analysis is contextual, it might need to run on the sub-expression, not on the upper tree).
The cases where the replacement is not equivalent should be looked at with great attention; they might indeed invalidate the analysis, but they also risk changing the semantics of the whole query in other ways.
Do you know of some cases where EFCore does this kind of transformation?

(I am trying to apply standard notions from data-flow analysis to the nullability analysis performed by EFCore; I am not yet sure about the context it takes into account nor about the propagation of the information, but I still believe that comparing it to the standard/general framework is both convenient and useful)

@roji
Copy link
Member Author

roji commented Jun 4, 2024

Obviously it would be impossible to properly perform this kind of non-local analysis while constructing the expressions, assuming they are immutable (hence x.NullableInt is determined to be nullable at construction time) and/or that subexpressions can be re-used (even if SqlExpression was mutable, reusing the same x.NullableInt in the 3 places where it appears in the example would make it impossible to state that the first occurrence might be null, the second one is guaranteed to be non-null and the third one is guaranteed to be null).

On this, take a look at what we do with type mapping inference... When you do Where(b => b.Name == foo), SqlExpressionFactory applies the type mapping from b.Name across to foo, which is what ensures that e.g. any value converter on b.Name is properly applied to foo before it's sent to the database. In other words, we already (immutably) apply contextual information on existing expression, replacing a SqlExpression that has no known type mapping with one that does. Note that this can also bubble down if instead of foo we have a more complex subtree.

In other words, to go back to to your CASE/WHEN example above, applying the knowledge that x.NullableInt isn't null to x.NullableInt + 1 doesn't seem very far from what we're already doing for type mapping... Does that make sense?

I am not really worried about this: as long as the transformation is replacing a node with an equivalent one, the nullability analysis would not need to be invalidated/updated (actually, if the analysis is contextual, it might need to run on the sub-expression, not on the upper tree).

I'm not sure I follow... Imagine that some post-processing visitor somewhere decides to replace some non-nullable string node with a nullable string node (not sure if we have a case like that). If that node was embedded in e.g. a concatenation, then the result of the concatenation now must becomes nullable as well, right?

(I am trying to apply standard notions from data-flow analysis to the nullability analysis performed by EFCore; I am not yet sure about the context it takes into account nor about the propagation of the information, but I still believe that comparing it to the standard/general framework is both convenient and useful)

Absolutely, I very much welcome this viewpoint from "real" compilers. I'm sure we have much to learn (though in some cases our problem area is different and can justify divergences).

@ranma42
Copy link
Contributor

ranma42 commented Jun 4, 2024

On this, take a look at what we do with type mapping inference... When you do Where(b => b.Name == foo), SqlExpressionFactory applies the type mapping from b.Name across to foo, which is what ensures that e.g. any value converter on b.Name is properly applied to foo before it's sent to the database. In other words, we already (immutably) apply contextual information on existing expression, replacing a SqlExpression that has no known type mapping with one that does. Note that this can also bubble down if instead of foo we have a more complex subtree.

In other words, to go back to to your CASE/WHEN example above, applying the knowledge that x.NullableInt isn't null to x.NullableInt + 1 doesn't seem very far from what we're already doing for type mapping... Does that make sense?

I was incorrectly assuming that SqlExpressions were simple ASTs, while it looks like they also convey some semantic analysis results.

IIUC this approach works by instantiating new nodes with the appropriate type mapping, effectively solving the reuse problem by creating separate instances; in a sense, it is "transforming" the original expression by rewriting it with the typeMapping applied.
This approach definitely makes sense and can used in general for any property, but some care should be taken if propagation is "non-local" (not for correctness, but for performance reasons).

It should be noted that as the information is collected, the AST needs to be re-visited and rewritten. For example if, after building an expression X (maybe with a biggish AST), the analysis found that the NullableInt column is actually non-null in the context being built (WHEN NullableInt IS NOT NULL THEN X), the expression builder would be required to:

  1. search for all of the occurrences of NullableInt in the
  2. create corresponding AST nodes that are marked as non-nullable
  3. re-create all of the ancestors from those nodes to the context that is currently being introduced
    If no additional data structure is used, the search can be implemented with a simple visit, but that would mean that each enrichment of the context could require O(N) operations, effectively making the worst case for the construction of trees O(N^2).
    The construction of new nodes would have a smaller impact (except for very deep list-like trees, which would also cause O(N) allocations), but it might cause some memory churn.

If I undertand ApplyTypeMapping correctly, creating the expression b.Name == foo might rewrite both the LHS and RHS (recursively), but on each side it immediately stops if a typeMapping is found; I believe the common case (maybe even an invariant) is that the recursion won't ever be very deep.

I expect nullability propagation to often be "less local", i.e. the WHEN clause might be constraining a node far deep in the THEN sub-expression; this would involve revisiting/rewriting the full subtree (opposed to a "local" change, where only the top expression in the THEN would be replaced and its sub-expressions reused).

Note that as long as the expression trees are smallish, the approach used for type mappings work just fine and can be applied to any property with no problem at all (regardless of changes being local); it is definitely possible that I am overthinking this as I don't yet have a proper grasp of the common patterns/sizes/... of EFCore queries and of the placement/effectiveness of its caches.

I am not really worried about this: as long as the transformation is replacing a node with an equivalent one, the nullability analysis would not need to be invalidated/updated (actually, if the analysis is contextual, it might need to run on the sub-expression, not on the upper tree).

I'm not sure I follow... Imagine that some post-processing visitor somewhere decides to replace some non-nullable string node with a nullable string node (not sure if we have a case like that). If that node was embedded in e.g. a concatenation, then the result of the concatenation now must becomes nullable as well, right?

If the assumption is that the new node has the same behavior as the old node, then they really should have the same nullability.
Let's see if I can make an example of the replacement you suggest (the replacement is not really coming from EFCore, just a random example that might match your question):

CASE
WHEN nonNullableComplexExpression(x.NonNullableColumnA, x.NonNullableColumnB) = x.NullableString
THEN nonNullableComplexExpression(x.NonNullableColumnA, x.NonNullableColumnB) || 'foo'
END

could be transformed into

CASE
WHEN nonNullableComplexExpression(x.NonNullableColumnA, x.NonNullableColumnB) = x.NullableString
THEN x.NullableString || 'foo'
END

i.e. the nonNullableComplexExpression(...) could be replaced with x.NullableString... but obviously in that context this expression would be non-nullable, hence also the concatenation would remain non-nullable.

(I am trying to apply standard notions from data-flow analysis to the nullability analysis performed by EFCore; I am not yet sure about the context it takes into account nor about the propagation of the information, but I still believe that comparing it to the standard/general framework is both convenient and useful)

Absolutely, I very much welcome this viewpoint from "real" compilers. I'm sure we have much to learn (though in some cases our problem area is different and can justify divergences).

I agree, there is quite a lot besides the simple model "expression language", but re-using some knowledge from the field of compilers/languages makes the EFCore library much more approachable (at least to me).
OTOH I don't know yet many of the peculiarities of EFCore, so I might sometimes be working under incorrect assumptions (just like I was misunderstanding SqlExpressions for a simple AST). I will try my best to provide valuable contributions (and waste as little as possible of your and the team's time 😇 ), but I will certainly need further guidance as I learn about EFCore.

@roji
Copy link
Member Author

roji commented Jun 5, 2024

@ranma42 thank you for all your thoughts, I pretty much agree with everything.

It should be noted that as the information is collected, the AST needs to be re-visited and rewritten.

That's right. Here's an example relevant for today: since EF 8.0's support for primitive (scalar) collections, we now have to deal with the following sort of queries: Where(b => someList.Where(i > 8).Contains(b.Foo)). Now, when we encounter someList in translation, we don't yet know what type mapping it has - we only have a CLR type. As we continue translating, we encounter the Where(), which still doesn't provide us with any information. But when we finally get to Contains, we see that the elements in someList are compared to b.Foo, at which point we can finally infer a collection type mapping over b.Foo's type mapping. This "advanced" type of type mapping doesn't actually happen during translation (to avoid repeated recursive visitation each time we figure out a type mapping), but is rather done in a single pass in post-processing (take a look at RelationalTypeMappingPostprocessor in the main branch if you're interested).

(@cincuranet you may be interested in the above too - it's an important part of the machinery behind primitive/scalar collections)

If I undertand ApplyTypeMapping correctly, creating the expression b.Name == foo might rewrite both the LHS and RHS (recursively), but on each side it immediately stops if a typeMapping is found; I believe the common case (maybe even an invariant) is that the recursion won't ever be very deep.

Yes, and this is crucial - in the current scheme, if a node already has a type mapping, that's never clobbered, so we only ever recurse once into a given sub-tree; whereas you're right that with nullability analysis we'd have to potentially recurse multiple times. And you're also right that contrary to regular compiler scenarios, in the EF case the trees may be shallow enough that repeated visitation for these cases may be acceptable.

But I'd like to stress again that we currently assume that type mappings never change once determined; that seems to have worked out so far (though we know we have various bugs around type mapping), but on could imagine a situation where some post-processing step decides to change a type mapping. If that ever needs to happen, we're currently lacking the machinery to propagate the effects of that change throughout the tree.

If the assumption is that the new node has the same behavior as the old node, then they really should have the same nullability.

That might be true; and if that's the case, then yeah, once a node's type mapping and/or nullability has been determined, it never changes. But I'm not sure it is in the general case - I just haven't thought about it enough.

I agree, there is quite a lot besides the simple model "expression language", but re-using some knowledge from the field of compilers/languages makes the EFCore library much more approachable (at least to me).

💯

I will try my best to provide valuable contributions (and waste as little as possible of your and the team's time 😇 ), but I will certainly need further guidance as I learn about EFCore.

Nothing you've submitted (or written) so far has been a waste of time - quite to the contrary, it's great to engage at such a deep level. Please continue to do so!

@ranma42
Copy link
Contributor

ranma42 commented Jul 3, 2024

Another option I am investigating is a context-independent notion of nullability (which needs to be richer than a simple boolean.
As a sketch (unfortunately the whole thing is a little more convoluted), let's say that I am looking for the nullability of
x + @y (x is a column, @y is a parameter). The current approach relies on:

  • knowing that the x column/@y parameter cannot ever contain a NULL value (aka ColumnExpression.IsNullable/SqlParameterExpression.IsNullable)
  • being in a visitor that keeps a list of columns that are known to be NULL/non-NULL (afaik currently only SqlNullabilityProcessor)

I am considering using an approach similar to SqlNullabilityProcessor.ProcessNullNotNull to extract an expression like:

  • x IS NOT NULL AND @y IS NOT NULL if neither is known to be non-null
  • @y IS NOT NULL if only the column is known to be non-null
  • TRUE if both are known to be non-null (i.e. the sum is known to be non-null)

This has the advantage of being independent of the context. This makes it easy to have this as a property of the expression (possibly even computed just once, regardless of the expression being re-used or moved around; it's trivially immutable) and to use it across all of the pipeline (possibly enabling reasoning on nullability before choosing the values of the parameters).
Assuming reasonable representations for these expressions, it should also be trivial to combine them with the current context, apply the usual simplifications and get either a TRUE/FALSE constant (meaning that the expression is definitely NULL/non-NULL) or an expression (meaning that the expression would be NULL in some cases).

@roji
Copy link
Member Author

roji commented Jul 3, 2024

@ranma42 is this a (recursive) calculation which any arbitrary visitor would do at a point where it's interested in knowing the nullability of something? Does the information somehow get reused across visitors, or would each visitor need to recompute it? Even within a single visitor, would it be bubbled up in some way (as nullability is currently bubbled up in SqlNullabilityProcessor)?

A quick and dirty pseudo-code sample may help illustrate what you have in mind...

@ranma42
Copy link
Contributor

ranma42 commented Jul 3, 2024

@ranma42 is this a (recursive) calculation which any arbitrary visitor would do at a point where it's interested in knowing the nullability of something?

It is a (recursive) construction of a property that is inherent to the expression, not computed by a visitor (would it make sense to compute it lazily? maybe).

Does the information somehow get reused across visitors, or would each visitor need to recompute it?

The information is only related to the expression; any visitor that cares about it can re-use it and it never makes sense to recompute it (assumption: immutable expressions).

Even within a single visitor, would it be bubbled up in some way (as nullability is currently bubbled up in SqlNullabilityProcessor)?

It is associated with the expressions; in a sense it "bubbles up" with the visit... although that is quite a weird perspective.

A quick and dirty pseudo-code sample may help illustrate what you have in mind...

ranma42@69069af might help a little bit.
Note that it requires some care, for example to avoid infinite recursion.
I might be able to provide more interesting (and possibly working) examples in a future branch in which I start experimenting with this in a more structured way.

@roji
Copy link
Member Author

roji commented Jul 4, 2024

OK, thanks for the code - so yeah, that corresponds indeed more or less to what I had in mind when opening the issue: we calculate the nullability of a node when it is constructed - based on its operands - and that naturally bubbles up as we compose more nodes on top. It's similar to how we deal with the type mapping right now, though note that the type mapping sometimes can't just be calculated from the comprising operands - it must also be inferred e.g. from another operand that the node is being compared to.

That connects to your point about contextual/contextless nullability analysis above. Following the type inference model, we can still have a system where we rebuild nodes in order to apply contextual nullability information. For example, given your example above:

CASE
  WHEN x.NullableInt IS NOT NULL THEN x.NullableInt + 1
  ELSE x.NullableInt - 1
END

The factory method for the CASE/WHEN as a whole could get information about non-nullable columns that's bubbled up via the x.NullableInt IS NOT NULL node, and then re-process the x.NullableInt + 1 node, rewriting it to in light of the contextual information. (that's what we do for type mapping inference, though obviously here things are more complex). We'd have to carefully consider performance here, of course, to not do endless revisitations - this is a danger whenever we introduce recursive behavior on construction (just like #34149). This could be at least partly mitigated by calculating more rich nullability information, like which currently nullable nodes are contained somewhere within a node; this would allow us to revisit only nodes which contain a nullable node which new context information tells us can no longer be nullable. But it would require careful thought to keep it from getting out of hand.

BTW re immutability, while it's true that SelectExpression is mutable, that's the case only when the Select is the (current) top-level thing in the tree; the moment something else is composed on top of the SelectExpression (e.g. it is pushed down into a subquery, enclosed in an InExpression or ExistsExpression), it becomes immutable (at least that's how the design currently works). So it may actually be safe to cache information like nullability and hashcodes in composing nodes.

@roji
Copy link
Member Author

roji commented Jul 4, 2024

One thing to keep in mind is that any non-trivial thing we add to SqlExpression creates complications for precompiled queries (AOT). When we can't precompile a final SQL representation (because there are two many nullable parameters, or because we must sniff parameter values and can't cache), the SQL expression tree must itself be serialized in code form, so that the last part of the query pipeline can be executed on it at runtime, in order to produce the final SQL then. So we'd need to generate code which preserves all the extra information proposed above (obviously a simple IsNull is easy, but starting to referencing contextual information is something else).

FWIW we already have this problem with type mapping, but this rich contextual nullability information makes it worse... In that sense, just doing the analysis in a single visitation step (SqlNullabilityProcessor) rather than storing it in the tree makes things easier.

@roji
Copy link
Member Author

roji commented Jul 4, 2024

And one last thought - I'd be interested in seeing which optimizations something like this would actually unlock in practice... The direction definitely makes sense in practice, but I'd want to see which optimizations we can't currently apply in SqlNullabilityProcessor (or which would unlock further optimizations if applied earlier, at construction). We also need to keep in mind that we wouldn't be able to reason about parameter nullability in any case before the SQL nullability cache.

@ranma42
Copy link
Contributor

ranma42 commented Jul 4, 2024

The idea is not to remove SqlNullabilityProcessor or nullability information, but rather to represent it in a richer way in part of the pipeline. When lowering to "boolean nullability" (which could happen in the SqlNullabilityProcessor) this richness would be used for optimizations (or null rewriting, as needed) and flattened to a boolean value.

I am not sure that adding this information to SqlExpressions is easy/feasible; it is one of the properties I am investigating on a could-be IR.

That connects to your point about contextual/contextless nullability analysis above. Following the type inference model, we can still have a system where we rebuild nodes in order to apply contextual nullability information. For example, given your example above:

CASE
  WHEN x.NullableInt IS NOT NULL THEN x.NullableInt + 1
  ELSE x.NullableInt - 1
END

The factory method for the CASE/WHEN as a whole could get information about non-nullable columns that's bubbled up via the x.NullableInt IS NOT NULL node, and then re-process the x.NullableInt + 1 node, rewriting it to in light of the contextual information.

The interesting rewrite here is on the other branch (and a bit more convoluted):

  • in the ELSE we know NOT(x.NullableInt IS NOT NULL) aka x.NullableInt IS NULL; we can propagate the constant
CASE
  WHEN x.NullableInt IS NOT NULL THEN x.NullableInt + 1
  ELSE NULL - 1
END
  • we also know NULL - 1 is just NULL; we can fold the +
CASE
  WHEN x.NullableInt IS NOT NULL THEN x.NullableInt + 1
  ELSE NULL
END
x.NullableInt + 1

For this specific case, propagating the x.NullableInt IS NOT NULL into x.NullableInt + 1 is really a no-op, but using its negation in the ELSE branch (and then some other optimizations) is really effective.

Going back to the wider topic, I acknowledge that a richer notion of nullability is more complex to construct, but it also opens the way for another class of optimizations; in particular it unifies the propagation of nullability both from the context into the sub-expression and the other way around (which is something I hit and mark as FIXME in https://github.com/dotnet/efcore/pull/34127/files#diff-74ef06f9894035919a0ed82dcc0afbf2502715e712c0989cd4bc0a04a742c092R634).

@ranma42
Copy link
Contributor

ranma42 commented Jul 4, 2024

And one last thought - I'd be interested in seeing which optimizations something like this would actually unlock in practice... The direction definitely makes sense in practice, but I'd want to see which optimizations we can't currently apply in SqlNullabilityProcessor (or which would unlock further optimizations if applied earlier, at construction).

Well, we can obviously re-encode everything to run within SqlNullabilityProcessor, so the point is really: can we compose simpler optimizations and still obtain as much as SqlNullabilityProcessor (or more). Yes :)
Doing optimizations like #34127 earlier would allow, for example, to simplify nested queries/joins.

The ability to perform more complex optimizations on expressions containing parameters would enable applying optimizations to a wider range of expressions.

We also need to keep in mind that we wouldn't be able to reason about parameter nullability in any case before the SQL nullability cache.

I am not sure I follow; why would that be the case?
We should be able to reason (in a "rich" way) on parameters across the whole pipeline. The SQL nullability cache only provides a constant value to the IS NOT NULL question. Currently the only possible response in SqlNullabilityProcessor is a constant "this expression is not null" or "it is possibly null"; the nullability cache provides a definitive answer of the form "it is null" or "it is non-null". With a richer representation we could have "it is null in these cases".

I see no reason why

CASE
  WHEN @myparam IS NOT NULL THEN x.NullableInt + @myparam
  ELSE @myparam - 1
END

should not optimize to

x.NullableInt + @myparam 

before the nullability cache. 🤔

@roji
Copy link
Member Author

roji commented Jul 4, 2024

We also need to keep in mind that we wouldn't be able to reason about parameter nullability in any case before the SQL nullability cache.

I am not sure I follow; why would that be the case?

Right, I should have been clearer - I meant to say that we can't do "bottom-up" reasonining on parameters before SqlNullabilityProcessor, since all parameters are by definition nullable at that point. We can indeed do at least some contextual reasoning over them (like over anything at all) - like what you propose above with the CASE/WHEN - since that doesn't depend on knowing the nullness of the (parameter) expression itself.

In other words, there's a whole range of additional stuff that may get unlocked once we actually know that some parameter is null (at SqlNullabilityProcessor), and that may cascade into lots of other optimizations. But that's only possible at that point; so it seems like we indeed have to make sure that a lot/most of these optimizations are repeated in SqlNullabilityProcessor as we're rebuilding the tree based on the new parameter information that's suddenly available.

As to concrete examples, I'm not doubting that this is at least in theory a good approach - it's just that it implies considerably more complexity/changes than anything we've been doing here so far, so I just want to make sure it's worth it, is all.

But I don't expect you to prove anything or "defend" this - it's more something for us to think about...

@ranma42
Copy link
Contributor

ranma42 commented Jul 4, 2024

In other words, there's a whole range of additional stuff that may get unlocked once we actually know that some parameter is null (at SqlNullabilityProcessor), and that may cascade into lots of other optimizations. But that's only possible at that point; so it seems like we indeed have to make sure that a lot/most of these optimizations are repeated in SqlNullabilityProcessor as we're rebuilding the tree based on the new parameter information that's suddenly available.

Not really; all of the optimizations that are implemented in the same fashion as the most recent ones (aka in the expression tree construction) would simply re-use the same code; in fact, you might have noticed that the changes I have recently been making were mostly about moving re-usable optimizations out of the SqlNullabilityProcessor; ideally I expect that SqlNullabilityProcessor could take care of just lowering nullability and possibly not even do any explicit optimizations (instead it would rely on them being automatically applied, as in #34072).

As to concrete examples, I'm not doubting that this is at least in theory a good approach - it's just that it implies considerably more complexity/changes than anything we've been doing here so far, so I just want to make sure it's worth it, is all.

But I don't expect you to prove anything or "defend" this - it's more something for us to think about...

Yes, this is currently speculation and I am afraid that it would be quite an invasive change if we want to take full advantage of this (aka pass this information along the pipeline as much as possible).

I believe it would have some interesting features, especially regarding the complexity of the resulting system:

  • it would make nullability computation more modular; for example, some designs would make it easy to extend the system with custom ways in which nulls are propagated; currently this would require customizing SqlNullabilityProcessor heavily (basically whenever ProcessNullNotNull must be extended); this is in a sense related to Functions: Allow specifying in metadata if a function propagates nulls #13718
  • it would enable some optimizations that currently would require re-running the SqlNullabilityProcessor; these optimization could possibly even be performed implicitly
  • it would avoid duplicating top-down and bottom-up reasoning on nullability (I hit it in Use null propagation to optimize away IS NOT NULL checks #34127 and am not really happy about my current solution)

As an experiment, I could try and implement it in a class similar to SqlNullabilityProcessor that instead of computing a bool nullability, extracts the rich expression. This would not be sufficient to get the features I am aiming for, but it might help evaluating the complexity of the approach. @roji What do you think?

Note that to avoid major regressions while fixing #34001 some kind of nullability analysis seems to be required, so I am seriously considering this approach as a way to propagate nullability information 🤯

@roji
Copy link
Member Author

roji commented Jul 4, 2024

Not really; all of the optimizations that are implemented in the same fashion as the most recent ones [...]

Once again I didn't express myself precisely enough - I didn't mean to say that there are new types of optimizations, but that there are many more opportunities for applying those optimizations, since parameter expressions where previously assumed to be nullable, where now we know they are not. I've definitely been appreciating the reuse of the same optimizations both when constructing during translation and in SqlNullabilityProcessor.

Yes, this is currently speculation and I am afraid that it would be quite an invasive change if we want to take full advantage of this (aka pass this information along the pipeline as much as possible).

Exactly, that's my only point. All the changes you've been submitting up to now are both impactful and relatively non-invasive, this one is another category of change, is all.

In any case, of course I'm OK with any experiment you'd like to try, and would be happy to review too :) Just as long as you understand we end up not being able to complete and merge this work for 9.0 (FYI we are starting to see the end of the feature work phase on the horizon). So in a sense it may be better to work on less invasive changes for 9.0, and defer the more ambitious architectural changes to 10 - but I'm fully OK with whatever you want to work on, and in whatever order.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants