-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider introducing nullability on SqlExpression #33889
Comments
The main advantage of placing nullability information in the expression is that it would be readily available across the pipeline. For example currently the CASE
WHEN x.NullableInt IS NOT NULL THEN x.NullableInt + 1
ELSE x.NullableInt - 1
END propagates from the Obviously it would be impossible to properly perform this kind of non-local analysis while constructing the expressions, assuming they are immutable (hence Note that there is still value in performing a (contextless) bottom-up nullability analysis; what I am worried about is that
I am not really worried about this: as long as the transformation is replacing a node with an equivalent one, the nullability analysis would not need to be invalidated/updated (actually, if the analysis is contextual, it might need to run on the sub-expression, not on the upper tree). (I am trying to apply standard notions from data-flow analysis to the nullability analysis performed by EFCore; I am not yet sure about the context it takes into account nor about the propagation of the information, but I still believe that comparing it to the standard/general framework is both convenient and useful) |
On this, take a look at what we do with type mapping inference... When you do In other words, to go back to to your CASE/WHEN example above, applying the knowledge that
I'm not sure I follow... Imagine that some post-processing visitor somewhere decides to replace some non-nullable string node with a nullable string node (not sure if we have a case like that). If that node was embedded in e.g. a concatenation, then the result of the concatenation now must becomes nullable as well, right?
Absolutely, I very much welcome this viewpoint from "real" compilers. I'm sure we have much to learn (though in some cases our problem area is different and can justify divergences). |
I was incorrectly assuming that IIUC this approach works by instantiating new nodes with the appropriate type mapping, effectively solving the reuse problem by creating separate instances; in a sense, it is "transforming" the original expression by rewriting it with the typeMapping applied. It should be noted that as the information is collected, the AST needs to be re-visited and rewritten. For example if, after building an expression X (maybe with a biggish AST), the analysis found that the
If I undertand I expect nullability propagation to often be "less local", i.e. the Note that as long as the expression trees are smallish, the approach used for type mappings work just fine and can be applied to any property with no problem at all (regardless of changes being local); it is definitely possible that I am overthinking this as I don't yet have a proper grasp of the common patterns/sizes/... of EFCore queries and of the placement/effectiveness of its caches.
If the assumption is that the new node has the same behavior as the old node, then they really should have the same nullability.
could be transformed into
i.e. the
I agree, there is quite a lot besides the simple model "expression language", but re-using some knowledge from the field of compilers/languages makes the EFCore library much more approachable (at least to me). |
@ranma42 thank you for all your thoughts, I pretty much agree with everything.
That's right. Here's an example relevant for today: since EF 8.0's support for primitive (scalar) collections, we now have to deal with the following sort of queries: (@cincuranet you may be interested in the above too - it's an important part of the machinery behind primitive/scalar collections)
Yes, and this is crucial - in the current scheme, if a node already has a type mapping, that's never clobbered, so we only ever recurse once into a given sub-tree; whereas you're right that with nullability analysis we'd have to potentially recurse multiple times. And you're also right that contrary to regular compiler scenarios, in the EF case the trees may be shallow enough that repeated visitation for these cases may be acceptable. But I'd like to stress again that we currently assume that type mappings never change once determined; that seems to have worked out so far (though we know we have various bugs around type mapping), but on could imagine a situation where some post-processing step decides to change a type mapping. If that ever needs to happen, we're currently lacking the machinery to propagate the effects of that change throughout the tree.
That might be true; and if that's the case, then yeah, once a node's type mapping and/or nullability has been determined, it never changes. But I'm not sure it is in the general case - I just haven't thought about it enough.
💯
Nothing you've submitted (or written) so far has been a waste of time - quite to the contrary, it's great to engage at such a deep level. Please continue to do so! |
Another option I am investigating is a context-independent notion of nullability (which needs to be richer than a simple boolean.
I am considering using an approach similar to
This has the advantage of being independent of the context. This makes it easy to have this as a property of the expression (possibly even computed just once, regardless of the expression being re-used or moved around; it's trivially immutable) and to use it across all of the pipeline (possibly enabling reasoning on nullability before choosing the values of the parameters). |
@ranma42 is this a (recursive) calculation which any arbitrary visitor would do at a point where it's interested in knowing the nullability of something? Does the information somehow get reused across visitors, or would each visitor need to recompute it? Even within a single visitor, would it be bubbled up in some way (as nullability is currently bubbled up in SqlNullabilityProcessor)? A quick and dirty pseudo-code sample may help illustrate what you have in mind... |
It is a (recursive) construction of a property that is inherent to the expression, not computed by a visitor (would it make sense to compute it lazily? maybe).
The information is only related to the expression; any visitor that cares about it can re-use it and it never makes sense to recompute it (assumption: immutable expressions).
It is associated with the expressions; in a sense it "bubbles up" with the visit... although that is quite a weird perspective.
ranma42@69069af might help a little bit. |
OK, thanks for the code - so yeah, that corresponds indeed more or less to what I had in mind when opening the issue: we calculate the nullability of a node when it is constructed - based on its operands - and that naturally bubbles up as we compose more nodes on top. It's similar to how we deal with the type mapping right now, though note that the type mapping sometimes can't just be calculated from the comprising operands - it must also be inferred e.g. from another operand that the node is being compared to. That connects to your point about contextual/contextless nullability analysis above. Following the type inference model, we can still have a system where we rebuild nodes in order to apply contextual nullability information. For example, given your example above: CASE
WHEN x.NullableInt IS NOT NULL THEN x.NullableInt + 1
ELSE x.NullableInt - 1
END The factory method for the CASE/WHEN as a whole could get information about non-nullable columns that's bubbled up via the BTW re immutability, while it's true that SelectExpression is mutable, that's the case only when the Select is the (current) top-level thing in the tree; the moment something else is composed on top of the SelectExpression (e.g. it is pushed down into a subquery, enclosed in an InExpression or ExistsExpression), it becomes immutable (at least that's how the design currently works). So it may actually be safe to cache information like nullability and hashcodes in composing nodes. |
One thing to keep in mind is that any non-trivial thing we add to SqlExpression creates complications for precompiled queries (AOT). When we can't precompile a final SQL representation (because there are two many nullable parameters, or because we must sniff parameter values and can't cache), the SQL expression tree must itself be serialized in code form, so that the last part of the query pipeline can be executed on it at runtime, in order to produce the final SQL then. So we'd need to generate code which preserves all the extra information proposed above (obviously a simple IsNull is easy, but starting to referencing contextual information is something else). FWIW we already have this problem with type mapping, but this rich contextual nullability information makes it worse... In that sense, just doing the analysis in a single visitation step (SqlNullabilityProcessor) rather than storing it in the tree makes things easier. |
And one last thought - I'd be interested in seeing which optimizations something like this would actually unlock in practice... The direction definitely makes sense in practice, but I'd want to see which optimizations we can't currently apply in SqlNullabilityProcessor (or which would unlock further optimizations if applied earlier, at construction). We also need to keep in mind that we wouldn't be able to reason about parameter nullability in any case before the SQL nullability cache. |
The idea is not to remove I am not sure that adding this information to
The interesting rewrite here is on the other branch (and a bit more convoluted):
CASE
WHEN x.NullableInt IS NOT NULL THEN x.NullableInt + 1
ELSE NULL - 1
END
CASE
WHEN x.NullableInt IS NOT NULL THEN x.NullableInt + 1
ELSE NULL
END
x.NullableInt + 1 For this specific case, propagating the Going back to the wider topic, I acknowledge that a richer notion of nullability is more complex to construct, but it also opens the way for another class of optimizations; in particular it unifies the propagation of nullability both from the context into the sub-expression and the other way around (which is something I hit and mark as |
Well, we can obviously re-encode everything to run within The ability to perform more complex optimizations on expressions containing parameters would enable applying optimizations to a wider range of expressions.
I am not sure I follow; why would that be the case? I see no reason why CASE
WHEN @myparam IS NOT NULL THEN x.NullableInt + @myparam
ELSE @myparam - 1
END should not optimize to x.NullableInt + @myparam before the nullability cache. 🤔 |
Right, I should have been clearer - I meant to say that we can't do "bottom-up" reasonining on parameters before SqlNullabilityProcessor, since all parameters are by definition nullable at that point. We can indeed do at least some contextual reasoning over them (like over anything at all) - like what you propose above with the CASE/WHEN - since that doesn't depend on knowing the nullness of the (parameter) expression itself. In other words, there's a whole range of additional stuff that may get unlocked once we actually know that some parameter is null (at SqlNullabilityProcessor), and that may cascade into lots of other optimizations. But that's only possible at that point; so it seems like we indeed have to make sure that a lot/most of these optimizations are repeated in SqlNullabilityProcessor as we're rebuilding the tree based on the new parameter information that's suddenly available. As to concrete examples, I'm not doubting that this is at least in theory a good approach - it's just that it implies considerably more complexity/changes than anything we've been doing here so far, so I just want to make sure it's worth it, is all. But I don't expect you to prove anything or "defend" this - it's more something for us to think about... |
Not really; all of the optimizations that are implemented in the same fashion as the most recent ones (aka in the expression tree construction) would simply re-use the same code; in fact, you might have noticed that the changes I have recently been making were mostly about moving re-usable optimizations out of the
Yes, this is currently speculation and I am afraid that it would be quite an invasive change if we want to take full advantage of this (aka pass this information along the pipeline as much as possible). I believe it would have some interesting features, especially regarding the complexity of the resulting system:
As an experiment, I could try and implement it in a class similar to Note that to avoid major regressions while fixing #34001 some kind of nullability analysis seems to be required, so I am seriously considering this approach as a way to propagate nullability information 🤯 |
Once again I didn't express myself precisely enough - I didn't mean to say that there are new types of optimizations, but that there are many more opportunities for applying those optimizations, since parameter expressions where previously assumed to be nullable, where now we know they are not. I've definitely been appreciating the reuse of the same optimizations both when constructing during translation and in SqlNullabilityProcessor.
Exactly, that's my only point. All the changes you've been submitting up to now are both impactful and relatively non-invasive, this one is another category of change, is all. In any case, of course I'm OK with any experiment you'd like to try, and would be happy to review too :) Just as long as you understand we end up not being able to complete and merge this work for 9.0 (FYI we are starting to see the end of the feature work phase on the horizon). So in a sense it may be better to work on less invasive changes for 9.0, and defer the more ambitious architectural changes to 10 - but I'm fully OK with whatever you want to work on, and in whatever order. |
The following are some just some architecture thoughts at this point, we shouldn't invest time in this in the near future
Our current query architecture currently doesn't represent nullability in the tree; certain specific nodes do that (e.g. ColumnExpression has IsNullable), but for the rest, it's impossible for translation/visitation code to know whether it's dealing with a node that could contain nulls. Instead, our current design concentrates nullability knowledge at a single visitation step - SqlNullabilityProcessor - where nullability is calculated as part of the visitation pass (in ephemeral state), rather than represented inside the tree itself.
One negative result of this is that we can't know that nodes aren't nullable in earlier phases (e.g. in translation), which prevents some possible SQL optimizations - we sometimes do best-effort checks for non-nullable columns and optimize for those, but otherwise can't do anything for more complex expression types.
We could introduce nullability information on SqlExpression itself, alongside (or possibly even inside) the type mapping. Note that we currently disallow nullable value types on SqlExpression; we'd start allowing that, and introduce an additional bool flag for reference types only (since reference nullability isn't represented on .NET Type). Bubbling nullability up the tree would happen as nodes are constructed; for example, when a SqlBinaryExpression is constructed, at that point we'd calculcate its nullability, based on its operands - similar to what we do today for type mapping. Note that we propose to do the same for collations in #29620.
One problem with the above approach, is that we have no knowledge of parameter nullability anywhere in the pipeline, until SqlNullabilityProcessor; this processor runs after our 2nd-level cache, which is there precisely to sniff parameter nullability and cache SQL based on that. In other words, in the translation phase, all parameters would have to be considered nullable, and the nullability processor would have to recalculate and bubble up nullability anyway, because at that point parameter nullability is known. The mechanism for calculating and bubbling up nullability could be reused though, between translation and SqlNullabilityProcessor (i.e. both would use SqlExpressionFactory, which would do this). In effect, SqlNullabilityProcessor would simply replace a nullable SqlParameterExpression with a non-nullable one, and then bubble that up, reconstructing nodes on top as necessary.
An additional thing to keep in mind, is that any replacing of a SQL node in any visitor anywhere would in theory require recalculating nullability up the tree - unless we decide that nullability changes aren't supported. Note that we already have this problem with type mappings, at least in principle - if some visitor modifies the type mapping of some node, other nodes whose type mappings depend on it don't change, and could be incorrect. We could think of a general visitor architecture to deal with this, but this would be quite a big change.
/cc @maumar @ranma42
The text was updated successfully, but these errors were encountered: