Skip to content

[BUG] Scalar and aggregate MIN/MAX conflict when translating PPL to SQL #4774

@dai-chen

Description

@dai-chen

What is the bug?

PPL introduced scalar version of MIN/MAX function in #4333. In PPL this is unambiguous because scalar vs aggregate version are distinguished by the surrounding command (eval vs. stats).

However, when we translate PPL to SQL (either as SQL text or as a SqlNode), the same function names (MIN/MAX) are reused for both scalar and aggregate semantics. Depending on the engine’s function registry and resolution rules, this name collision can lead to ambiguous resolution or the wrong implementation being chosen.

How can one reproduce the bug?
Steps to reproduce the behavior:

-- Standard aggregate usage
SELECT MIN(age) FROM accounts;

-- Scalar-style usage that comes from PPL translation
SELECT MIN(age, 'test', NULL) FROM accounts;

What is the expected behavior?

One possible solution is to keep exposing scalar MIN/MAX to PPL users under the same names, but have the translator map scalar calls to internal, non-conflicting function names.

What is your host/environment?

  • OS: 3.4

Do you have any screenshots?
N/A

Do you have any additional context?

This issue was discovered during the PPL function unification PoC in
opensearch-project/opensearch-spark#1281 (comment). After registering the PPL scalar min/max functions in the shared function registry, SparkSQL’s native aggregate MIN implementation is effectively overridden. For example, in spark-sql:

spark-sql (default)> SELECT MIN(packets) FROM test_events;
min(packets)
60
120
60
180

Instead of returning a single aggregated value, multiple rows are returned, which indicates that the scalar variant (or a conflicting resolution) is being used.

Metadata

Metadata

Assignees

Labels

PPLPiped processing languagebugSomething isn't working

Type

No type

Projects

Status

New

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions