Adding agg_for_doc_count even if distinct count is present in query #29

expani · 2025-12-23T06:48:14Z

Description

we are missing this check in SQL plugin causing agg_for_doc_count to be not added as an aggregate when Distinct Cardinality Aggregate is part of the query.

Adding this the plan's look like follows

[INFO ][o.o.s.o.r.OpenSearchQueryRequest] [runTask-0] Calcite Logical Plan after Conversion
 LogicalSystemLimit(sort0=[$0], dir0=[DESC-nulls-last], fetch=[10000], type=[QUERY_SIZE_LIMIT])
  LogicalSort(sort0=[$0], dir0=[DESC-nulls-last], fetch=[10])
    LogicalProject(u=[$1], RegionID=[$0], agg_for_doc_count=[$2])
      LogicalAggregate(group=[{0}], u=[APPROX_COUNT_DISTINCT($1)], agg_for_doc_count=[COUNT()])
        LogicalProject(RegionID=[$68], UserID=[$84])
          CalciteLogicalIndexScan(table=[[OpenSearch, hits]])

doc_count is expected for all scenarios.

This is required for merging at Coordinator to work properly. See https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/search/aggregations/bucket/terms/InternalTerms.java#L395

There are multiple places that rely on doc count for instance https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/search/aggregations/bucket/terms/InternalTerms.java#L474 relies on the same to prune buckets lower that a certain doc count.

More context in opensearch-project/OpenSearch#20263

Signed-off-by: expani <anijainc@amazon.com>

Merge pull request vinaykpud#29 from expani/feature/substrait-plan Adding agg_for_doc_count even if distinct count is present in query Adding agg_for_doc_count even if distinct count is present in query Signed-off-by: expani <anijainc@amazon.com> Merge pull request vinaykpud#28 from expani/df_lucene_path_sql Skip generating and setting the Substrait plan based on Index Setting Updated to test to use new index setting Signed-off-by: expani <anijainc@amazon.com> Using Optimized index setting to not generate substrait for existing path Signed-off-by: expani <anijainc@amazon.com> Resolved Sort Pushdown and updated integration tests queries and results (vinaykpud#27) * Added sort in pushdown context Signed-off-by: Vinay Krishna Pudyodu <vinkrish.neo@gmail.com> * Removed unwanted sort pushdown in pushDownSortAggregateMeasure Signed-off-by: Vinay Krishna Pudyodu <vinkrish.neo@gmail.com> * Accessed the fullReNodeTree in CalciteEnumerableIndexScan removed the redundancy in OpenSearchRequest Also i did this since earlier way of accessing RelNode from ThreadLocal from OpenSearchRequest class is not working as it looks like the thread which sets the variable and accesses it now are different Signed-off-by: Vinay Krishna Pudyodu <vinkrish.neo@gmail.com> * Updated some integ test queries and results Signed-off-by: Vinay Krishna Pudyodu <vinkrish.neo@gmail.com> --------- Signed-off-by: Vinay Krishna Pudyodu <vinkrish.neo@gmail.com> Added substrait type conversion support for IP field (vinaykpud#26) Signed-off-by: Vinay Krishna Pudyodu <vinkrish.neo@gmail.com> Use Full RelNode tree for execution Signed-off-by: Vinay Krishna Pudyodu <vinkrish.neo@gmail.com> Fixed the sort condition for queries with multiple aggregation (vinaykpud#24) without this change if we have a PPL query with multiple aggregations and sort condition, it will sort only based on the first aggregated field. Also found similar changes done in latest mainline: https://tinyurl.com/u32kpyk9 Signed-off-by: Vinay Krishna Pudyodu <vinkrish.neo@gmail.com> Added sort after aggregarte relNode for pushdown context (vinaykpud#23) Signed-off-by: Vinay Krishna Pudyodu <vinkrish.neo@gmail.com> Added logs for reconstructPushedDownRelNodeTree (vinaykpud#22) Signed-off-by: Vinay Krishna Pudyodu <vinkrish.neo@gmail.com> Added min function for string field and fixed query 29 in tests (vinaykpud#20) * Added min function for string field Signed-off-by: Vinay Krishna Pudyodu <vinkrish.neo@gmail.com> * Fixed the query 29 PPL Signed-off-by: Vinay Krishna Pudyodu <vinkrish.neo@gmail.com> --------- Signed-off-by: Vinay Krishna Pudyodu <vinkrish.neo@gmail.com> All changes related to convert only pushed down operations as RelNodeTree and Substrait (vinaykpud#19) * Changes to convert only pushed down operations RelNode and Substrait Signed-off-by: Vinay Krishna Pudyodu <vinkrish.neo@gmail.com> * Adding the missing Project operator after the filter Signed-off-by: Vinay Krishna Pudyodu <vinkrish.neo@gmail.com> * Small refactors in PushDownContext and logs Signed-off-by: Vinay Krishna Pudyodu <vinkrish.neo@gmail.com> * Added full relNode log Signed-off-by: Vinay Krishna Pudyodu <vinkrish.neo@gmail.com> * Removed unwanted comments Signed-off-by: Vinay Krishna Pudyodu <vinkrish.neo@gmail.com> * Use Direct RelNode.copy() Instead of RelBuilder Signed-off-by: Vinay Krishna Pudyodu <vinkrish.neo@gmail.com> * Disabled CoreRules.FILTER_REDUCE_EXPRESSIONS this is done to disable the calcite optimization on timerange queries ie calcite optimizes `>=($0, TIMESTAMP('2013-07-01 00:00:00')) AND <=($0, TIMESTAMP('2013-07-31 00:00:00'))` to `SEARCH($0, Sarg[['2013-07-01 00:00:00'..'2013-07-31 00:00:00']])` Signed-off-by: Vinay Krishna Pudyodu <vinkrish.neo@gmail.com> --------- Signed-off-by: Vinay Krishna Pudyodu <vinkrish.neo@gmail.com>

Adding agg_for_doc_count even if distinct count is present in query

fa01178

Signed-off-by: expani <anijainc@amazon.com>

expani merged commit a0b649f into vinaykpud:feature/substrait-plan Dec 23, 2025
5 of 33 checks passed

expani mentioned this pull request Dec 23, 2025

Added support for groupby with String term for distinct count opensearch-project/OpenSearch#20263

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding agg_for_doc_count even if distinct count is present in query #29

Adding agg_for_doc_count even if distinct count is present in query #29

Uh oh!

expani commented Dec 23, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Adding agg_for_doc_count even if distinct count is present in query #29

Adding agg_for_doc_count even if distinct count is present in query #29

Uh oh!

Conversation

expani commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

expani commented Dec 23, 2025 •

edited

Loading