[RFC] Support Schema-on-Read in PPL

## 1. Summary
This RFC proposes a solution for OpenSearch PPL to support schema-on-read data sources where fields and types are unknown until data reading begins. The proposal focuses on a long-term solution to support both schema-on-read and commands that generate fields dynamically, such as `spath`. The core approach introduces a field resolution phase to determine the input schema based on the AST before logical planning.

Related issues
- https://github.com/opensearch-project/sql/issues/4112 (motivation of this RFC)
- https://github.com/opensearch-project/sql/issues/4433 (This RFC will be deprecate)
- https://github.com/opensearch-project/sql/issues/4349

## 2. Motivation
The current PPL engine utilizes Apache Calcite for planning and execution, but Calcite requires a fixed schema during the planning stage. To handle the case where input schema is not known (or partially known), and also dynamic field generation commands, we need mechanism to handle query without schema.

Existing mechanisms in Calcite, such as `DynamicRecordType`, are insufficient for PPL due to several gaps:
* **Validation Stage Missing:** PPL lacks a query validation stage where `DynamicRecordType` typically collects fields.
* **Field Generation:** `DynamicRecordType` does not support commands that dynamically produce fields (e.g., `spath`, `multikv`).
* **Merge Conflicts:** PPL merges fields from multiple inputs by overwriting non-null values, whereas Calcite retains dynamic stars as separate fields.
* **Complexity:** Previous attempts to mix static and dynamic fields resulted in complicated planning logic.

## 3. Proposed Solution: Field Resolution Phase
To resolve the schema issues, we propose introducing a **Field Resolution** phase. This phase analyzes the Abstract Syntax Tree (AST) to decide the input schema before converting it to a Calcite `RelNode` tree.

### 3.1 The Process
The new execution flow is as follows:
1.  **Parse PPL:** Build AST from input PPL query.
2.  **Field Resolution (New):** Decide input schema based on AST.
    * Traverse AST from root to leaf.
    * Calculate required input fields for each data source based on output requirements.
3.  **Logical Plan:** Convert AST to Calcite `RelNode` tree using the resolved static schema.
4.  **Physical Plan & Execution:** Apply rules and execute.

### 3.2 Key Logic
* **Static Mapping:** By calculating requirements through the AST, fields used by commands are mapped as static fields in the data schema.
* **Dynamic Fields (`_MAP`):** To support wildcards (e.g., `prefix*`) and dynamic field generation (e.g., `spath`), a dynamic map field is still required to store unmapped or pattern-matched fields.
* **Field Ordering:** Since Calcite's `Map` structure does not preserve order, we will calculate the final field order based on the query AST separately and reorder the results during execution to match Splunk-like behavior (lexicographical order for unmatched fields).

## 4. Implementation Roadmap
The implementation is broken down into five steps:

* **Step 1: `spath` Limited Version:** Implement `spath` without dynamic fields. It will extract only fields identified by the field resolution phase and treat them as `STRING` types.
* **Step 2: Dynamic `spath`:** Extend `spath` to produce dynamic fields and update commands like `join` and `append` to handle dynamic field expansion.
* **Step 3: Schema-on-Read (Static Type):** Implement `LogicalIndexScan` that applies resolved fields to the schema.
* **Step 4: ANY Type Support:** Extend schema-on-read to handle `ANY` types and adjust type coercion.
* **Step 5: Performance:** Improve performance by automatically identifying and indexing JSON internal fields.

## 5. Alternative Solutions Considered

### 5.1 Store All Fields into `_MAP`
* **Description:** Store all available fields into a single `Map<String, Any>` column, similar to the Elasticsearch/MongoDB adapter implementation.
* **Pros:** Simplifies implementation as commands simply manipulate the map.
* **Cons:** High overhead for field access and copying; hard to optimize using standard column pruning.

### 5.2 Convert PPL to SQL
* **Description:** Transpile PPL queries directly to SQL and execute them via the SQL engine.
* **Pros:** Flexible generation logic. Utilize DynamicRecordType usable during SQL validation phase.
* **Cons:** Non-trivial work to convert PPL logic to SQL; requires handling dynamic field logic within the generated SQL.

### 5.3 Unresolved Logical Plan (ULP)
* **Description:** Introduce a new abstraction (ULP) between the AST and Calcite's RelNode tree to separate command logic from schema resolution.
* **Pros:** Decouples logic from Calcite, facilitating future engine migrations.
* **Cons:** Requires a full rewrite of the current Calcite `RelNodeVisitor` and a new abstraction layer.

## 6. Appendix: Field Resolution Examples

This section illustrates how the field resolution phase traverses the AST from root to leaf to identify the necessary static fields.

### 6.1 Basic Backward Propagation
In this example, requirements propagate backward from the final command to the source.

**Query:** `source idx | filter a > 1 | eval b = c * 2 | fields a, b, d`

* **`fields a, b, d`**: Requires inputs `(a, b, d)`.
* **`eval b = c * 2`**:
    * Needs `c` to produce `b`.
    * Needs `a` and `d` to satisfy the subsequent `fields` command.
    * **Result:** Requires `(a, c, d)`.
* **`filter a > 1`**: Needs `a` for filtering, plus `(a, c, d)` from the previous step.
    * **Result:** Requires `(a, c, d)`.
* **`source idx`**: The final resolved schema for the source is `(a, c, d)`.

### 6.2 Field Pruning
The `fields` command can block the propagation of unneeded fields, allowing the source to fetch fewer columns.

**Query:** `source idx | fields a, b | eval d = c * 2`

* **`eval d = c * 2`**: Needs `c`.
* **`fields a, b`**: explicitly requests only `a` and `b`. It does *not* produce or pass `c`.
* **`source idx`**: Resolved schema is `(a, b)`. Field `c` is pruned because the explicit `fields` command prevents it from reaching the `eval` step where it is needed.

### 6.3 Wildcards and Dynamic Fields
When wildcards are used, the resolution phase identifies specific static fields where possible and delegates the rest to the dynamic map.

**Query:** `source=idx1 | eval a=b+2 | fields a, c, suffix*`

* **`fields a, c, suffix*`**: Requires `a`, `c`, and the wildcard pattern `suffix*`.
* **`eval a=b+2`**:
    * Needs `b` to produce `a`.
    * `a` is generated here, so it is removed from the input requirement.
    * Passes through `c` and `suffix*`.
    * **Result:** Requires `(b, c, suffix*)`.
* **`source idx1`**: Resolved schema includes static fields `b`, `c` and a dynamic map for matching `suffix*`.

### 6.4 Multi-Input (Join)
For multi-input commands like `join`, requirements are distributed to the appropriate branches.

**Query:** `source=idx1 | join a [source=idx2 | where b > 1] | fields prefix*`

* **`fields prefix*` **: Requires `prefix*`.
* **`join a`**:
    * Requires the join key `a` from both sides.
    * Distributes `prefix*` requirement to both side.
    * **Result (Left):** `(a, prefix*)`
    * **Result (Right):** `(a, prefix*)`.
* **`source idx2` (Right Branch)**:
    * Needs `a` and `prefix*` from the join.
    * Needs `b` for the `where b > 1` clause.
    * **Result:** `(a, b, prefix*)`.
* **`source idx1` (Left Branch)**:
    * **Result:** `(a, prefix*)`.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] Support Schema-on-Read in PPL #4984

1. Summary

2. Motivation

3. Proposed Solution: Field Resolution Phase

3.1 The Process

3.2 Key Logic

4. Implementation Roadmap

5. Alternative Solutions Considered

5.1 Store All Fields into `_MAP`

5.2 Convert PPL to SQL

5.3 Unresolved Logical Plan (ULP)

6. Appendix: Field Resolution Examples

6.1 Basic Backward Propagation

6.2 Field Pruning

6.3 Wildcards and Dynamic Fields

6.4 Multi-Input (Join)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Support Schema-on-Read in PPL #4984

Description

1. Summary

2. Motivation

3. Proposed Solution: Field Resolution Phase

3.1 The Process

3.2 Key Logic

4. Implementation Roadmap

5. Alternative Solutions Considered

5.1 Store All Fields into _MAP

5.2 Convert PPL to SQL

5.3 Unresolved Logical Plan (ULP)

6. Appendix: Field Resolution Examples

6.1 Basic Backward Propagation

6.2 Field Pruning

6.3 Wildcards and Dynamic Fields

6.4 Multi-Input (Join)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

5.1 Store All Fields into `_MAP`