Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Query: Adds hybrid search query pipeline stage (#4794)
## Description Adds hybrid search query pipeline stage. This requires the new Direct package and gateway to be available in order to light up. Given an input SQL such as: ```sql SELECT TOP 100 c.text, c.abstract FROM c ORDER BY RANK RRF(FullTextScore(c.text, ['swim', 'run']), FullTextScore(c.abstract, ['energy'])) ``` The new query plan (encoded below as XML instead of JSON to help readability) is as follows: ``` <queryRanges> <Item>{"min":[],"max":"Infinity","isMinInclusive":true,"isMaxInclusive":false}</Item> </queryRanges> <hybridSearchQueryInfo> <globalStatisticsQuery><![CDATA[ SELECT COUNT(1) AS documentCount, [ { totalWordCount: SUM(_FullTextWordCount(c.text)), hitCounts: [ COUNTIF(FullTextContains(c.text, "swim")), COUNTIF(FullTextContains(c.text, "run")) ] }, { totalWordCount: SUM(_FullTextWordCount(c.abstract)), hitCounts: [ COUNTIF(FullTextContains(c.abstract, "energy")) ] } ] AS fullTextStatistics FROM c ]]></globalStatisticsQuery> <componentQueryInfos> <Item> <distinctType>None</distinctType> <top>200</top> <orderBy> <Item>Descending</Item> </orderBy> <orderByExpressions> <Item>_FullTextScore(c.text, ["swim", "run"], {documentdb-formattablehybridsearchquery-totaldocumentcount}, {documentdb-formattablehybridsearchquery-totalwordcount-0}, {documentdb-formattablehybridsearchquery-hitcountsarray-0})</Item> </orderByExpressions> <hasSelectValue>false</hasSelectValue> <rewrittenQuery><![CDATA[ SELECT TOP 200 c._rid, [ { item: _FullTextScore(c.text, ["swim", "run"], {documentdb-formattablehybridsearchquery-totaldocumentcount}, {documentdb-formattablehybridsearchquery-totalwordcount-0}, {documentdb-formattablehybridsearchquery-hitcountsarray-0}) } ] AS orderByItems, { payload: { text: c.text, abstract: c.abstract }, componentScores: [ _FullTextScore(c.text, ["swim", "run"], {documentdb-formattablehybridsearchquery-totaldocumentcount}, {documentdb-formattablehybridsearchquery-totalwordcount-0}, {documentdb-formattablehybridsearchquery-hitcountsarray-0}), _FullTextScore(c.abstract, ["energy"], {documentdb-formattablehybridsearchquery-totaldocumentcount}, {documentdb-formattablehybridsearchquery-totalwordcount-1}, {documentdb-formattablehybridsearchquery-hitcountsarray-1}) ] } AS payload FROM c WHERE {documentdb-formattableorderbyquery-filter} ORDER BY _FullTextScore(c.text, ["swim", "run"], {documentdb-formattablehybridsearchquery-totaldocumentcount}, {documentdb-formattablehybridsearchquery-totalwordcount-0}, {documentdb-formattablehybridsearchquery-hitcountsarray-0}) DESC ]]></rewrittenQuery> <hasNonStreamingOrderBy>true</hasNonStreamingOrderBy> </Item> <Item> <distinctType>None</distinctType> <top>200</top> <orderBy> <Item>Descending</Item> </orderBy> <orderByExpressions> <Item>_FullTextScore(c.abstract, ["energy"], {documentdb-formattablehybridsearchquery-totaldocumentcount}, {documentdb-formattablehybridsearchquery-totalwordcount-1}, {documentdb-formattablehybridsearchquery-hitcountsarray-1})</Item> </orderByExpressions> <hasSelectValue>false</hasSelectValue> <rewrittenQuery><![CDATA[ SELECT TOP 200 c._rid, [ { item: _FullTextScore(c.abstract, ["energy"], {documentdb-formattablehybridsearchquery-totaldocumentcount}, {documentdb-formattablehybridsearchquery-totalwordcount-1}, {documentdb-formattablehybridsearchquery-hitcountsarray-1}) } ] AS orderByItems, { payload: { text: c.text, abstract: c.abstract }, componentScores: [ _FullTextScore(c.text, ["swim", "run"], {documentdb-formattablehybridsearchquery-totaldocumentcount}, {documentdb-formattablehybridsearchquery-totalwordcount-0}, {documentdb-formattablehybridsearchquery-hitcountsarray-0}), _FullTextScore(c.abstract, ["energy"], {documentdb-formattablehybridsearchquery-totaldocumentcount}, {documentdb-formattablehybridsearchquery-totalwordcount-1}, {documentdb-formattablehybridsearchquery-hitcountsarray-1}) ] } AS payload FROM c WHERE {documentdb-formattableorderbyquery-filter} ORDER BY _FullTextScore(c.abstract, ["energy"], {documentdb-formattablehybridsearchquery-totaldocumentcount}, {documentdb-formattablehybridsearchquery-totalwordcount-1}, {documentdb-formattablehybridsearchquery-hitcountsarray-1}) DESC ]]></rewrittenQuery> <hasNonStreamingOrderBy>true</hasNonStreamingOrderBy> </Item> </componentQueryInfos> <take>100</take> <requiresGlobalStatistics>true</requiresGlobalStatistics> </hybridSearchQueryInfo> ``` We have a custom implementation for the global statistics inside the `HybridSearchCrossPartitionQueryPipelineStage` because it uses nested aggregates. Each of the component queries in the hybrid search query plan is cross partition, and we run them using the existing cross partition query pipelines. Note the use of placeholders such as `{documentdb-formattablehybridsearchquery-totaldocumentcount}` in the query plan. These need to be replaced by the global statistics. ## Type of change - [x] New feature (non-breaking change which adds functionality)
- Loading branch information