Skip to content

Consider having a common relational representation for OPENJSON, json_each and so on #32576

@roji

Description

@roji

The problem

Whenever a JSON collection (primitive or complex) is encountered, our current query architecture immediately (eagerly) converts it to a relation table (via OPENJSON, json_each...) by calling TransformJsonQueryToTable() on it. From that point we have a regular ShapedQueryExpression that looks like any other - over regular tables - and translation of query operators can proceed as usual.

The problem is that since SqlServerOpenJsonExpression and similar are provider-specific expressions, we lose the ability to know - in general relational code - that we're dealing with JSON data, and specifically to remove the OPENJSON and get back to the original column. Examples include:

  • Array indexer. Instead of translating these via regular relational means (LIMIT and OFFSET over the OPENJSON), we want to use dedicated JSON mechanisms (e.g. JSON_VALUE), which are must terser, more efficient and can be indexed. But the relational implementation of TranslateElementAtOrDefault can't do that, since it can't identify the SQL Server-specific SqlServerOpenJsonExpression and unwrap it to get the original column. So we implement this in SqlServerQueryableMethodExpressionvisitor.
  • SqlNullabilityProcessor has a hacky IsCollectionTable hook for unwrapping SqlServerOpenJsonExpression (see this).
  • When translating Contains over a complex JSON collection, because it's converted to a table via OPENJSON, our structural equality produces column comparisons over the the OPENJSON. We could instead unwrap the OPENJSON and do a simple, single JSON comparison (but again, that would need to happen in relational code).
  • PostgreSQL has several other cases where array unnested (equivalent to OPENJSON) is unwrapped (though since these are provider-specific anyway, it's less of a concern).

Unwrapping SelectExpression and OPENJSON

It's tempting to simply introduce a relational interface implemented by SqlServerOpenJsonExpression and co., which would allow unwrapping it and getting the JSON column. However, this isn't sufficient: when unwrapping, relational code needs to also verify that no additional LINQ operators have been composed on top of the collection; for example, if there's a Where(), the SelectExpression containing the OPENJSON must remain as a whole and we can't unwrap. In some cases, unwrapping is OK if OrderBy has been applied (e.g. Count, Contains), and in others it isn't (e.g. GetElementAt). Note also that in SQL Server we generate a SelectExpression with an ORDER BY for the key column projected from OPENJSON, to make sure that the original ordering is preserved.

In other words, any unwrapping check must happen at the level of the containing SelectExpression and not at the SqlServerOpenJsonExpression level. We could introduce such a provider-specific API on SelectExpression, but we don't currently allow providers to extend SelectExpression (and that kind of API doesn't seem like it belongs on expressions, which should mainly represent data).

Instead, some service could be introduced for performing operations on SelectExpression, and providers would provide their implementation for unwrapping SelectExpressions.

More extreme - introduce OPENJSON only when needed

A more extreme change would be to stop transforming JSON collections to tables (OPENJSON) so early/eagerly, but to do it only once we know OPENJSON is needed. This means that instead of getting a ShapedQueryExpression over SelectExpression (with the latter containing SqlServerOpenJsonExpression), we'd get a ShapedQueryExpression over e.g. a ColumnExpression representation JSON (or a JsonQueryExpression for representing a nested JSON collection).

This is a larger change - we have various places which simply assume that ShapedQueryExpression's QueryExpression always contains a SelectExpression. Every operator translation method, e.g. TranslateWhere, would have to first check if it's already composed on top of a SelectExpression, and if not, convert it to a SelectExpression at that point. Cases where we have JSON-specific mechanisms (ElementAtOrDefault, Contains...) would simply immediately get the JSON column out, without there ever being OPENJSON in the tree.

Note that we have a similar situation with set operations: these are currently always wrapped by SelectExpression, even though we don't emit these "bare wrappers" in SQL as they're not needed (see #32890). If we manage to clean up our SelectExpression design, we'd be able to just have standalone set operations, which would then need to be wrapped with SelectExpression if additional operators (e.g. Where) are composed on top. So we'd have a generic ShapedQueryExpression.GetOrCreateSelect() mechanism, which either gets an existing SelectExpression or wraps the given set operation (or JSON column) with a SelectExpression.

This more lazy design is better and more accurately represents the actual state of the tree at every point: it's always better to make a change late when we know we need to, rather than making it early and then reverting, unwrapping or ignoring it later. But there's one problematic point here: translations of Contains/Count need to disregard composed OrderBy; and if the translation of OrderBy wraps the JSON column with a SelectExpression (which it needs to), then we again need to unwrap that for Contains/Count, and are back at square one without the lazy design helping us much. A possible alternative would be to represent the OrderBy in some other way in the tree ("pending OrderBy"), but that leads to more complexity and a departure from representation the actual tree.

Metadata

Metadata

Assignees

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions