Skip to content

[FEATURE] PPL distinct_count/dc function support for eventstats command #4052

@dai-chen

Description

@dai-chen

Is your feature request related to a problem?

Yes. Currently distinct_count/dc work in stats command, but it’s unavailable in eventstatscommand. This prevents annotating each event with the distinct count per group (or globally), which is a common log-analytics need byeventstats` command.

What solution would you like?

  1. Enable both in eventstats grammar and translate it to COUNT(DISTINCT field) OVER (...).
  2. Support general SQL window function pushdown optimization (tracked separately).

What alternatives have you considered?

N/A

Do you have any additional context?

Test result:

opensearchsql> source=accounts | stats distinct_count(age), dc(gender);
fetched rows / total rows = 1/1
+-----------------------+--------------+
| distinct_count(age)   | dc(gender)   |
|-----------------------+--------------|
| 5                     | 2            |
+-----------------------+--------------+

opensearchsql> explain source=accounts | stats distinct_count(age), dc(gender);
OpenSearchException({'calcite': {
'logical': 'LogicalSystemLimit(fetch=[10000], type=[QUERY_SIZE_LIMIT])\n  
  LogicalAggregate(group=[{}],
    distinct_count(age)=[COUNT(DISTINCT $0)], 
    dc(gender)=[COUNT(DISTINCT $1)])\n
    LogicalProject(age=[$2], gender=[$0])\n
      CalciteLogicalIndexScan(table=[[OpenSearch, accounts]])\n',
'physical': 'EnumerableLimit(fetch=[10000])\n
  CalciteEnumerableIndexScan(table=[[OpenSearch, accounts]],
    PushDownContext=[[AGGREGATION->rel#499:LogicalAggregate.NONE.[](input=RelSubset#498,group={},
      distinct_count(age)=COUNT(DISTINCT $0),dc(gender)=COUNT(DISTINCT $1))],
    OpenSearchRequestBuilder(sourceBuilder={"from":0,"size":0,"timeout":"1m",
      "aggregations":{
        "distinct_count(age)":{"cardinality":{"field":"age"}},
        "dc(gender)":{"cardinality":{"field":"gender.keyword"}}}},
      requestedTotalSize=2147483647, pageSize=null, startFrom=0)])\n'}})

opensearchsql> source=accounts | eventstats distinct_count(age);
{'reason': 'Invalid Query', 'details': "[distinct_count] is not a valid term at this part of the query:
  '...counts | eventstats distinct_count' <-- HERE. Expecting one of 22 possible tokens.
  Some examples: 'AVG', 'COUNT', 'MAX', 'MIN', 'SUM', ...", 'type': 'SyntaxCheckException'}

Metadata

Metadata

Assignees

Labels

PPLPiped processing languageenhancementNew feature or requestv3.3.0

Type

No type

Projects

Status

New

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions