Skip to content

[RFC] Support Schema-on-Read in PPL #4984

@ykmr1224

Description

@ykmr1224

1. Summary

This RFC proposes a solution for OpenSearch PPL to support schema-on-read data sources where fields and types are unknown until data reading begins. The proposal focuses on a long-term solution to support both schema-on-read and commands that generate fields dynamically, such as spath. The core approach introduces a field resolution phase to determine the input schema based on the AST before logical planning.

Related issues

2. Motivation

The current PPL engine utilizes Apache Calcite for planning and execution, but Calcite requires a fixed schema during the planning stage. To handle the case where input schema is not known (or partially known), and also dynamic field generation commands, we need mechanism to handle query without schema.

Existing mechanisms in Calcite, such as DynamicRecordType, are insufficient for PPL due to several gaps:

  • Validation Stage Missing: PPL lacks a query validation stage where DynamicRecordType typically collects fields.
  • Field Generation: DynamicRecordType does not support commands that dynamically produce fields (e.g., spath, multikv).
  • Merge Conflicts: PPL merges fields from multiple inputs by overwriting non-null values, whereas Calcite retains dynamic stars as separate fields.
  • Complexity: Previous attempts to mix static and dynamic fields resulted in complicated planning logic.

3. Proposed Solution: Field Resolution Phase

To resolve the schema issues, we propose introducing a Field Resolution phase. This phase analyzes the Abstract Syntax Tree (AST) to decide the input schema before converting it to a Calcite RelNode tree.

3.1 The Process

The new execution flow is as follows:

  1. Parse PPL: Build AST from input PPL query.
  2. Field Resolution (New): Decide input schema based on AST.
    • Traverse AST from root to leaf.
    • Calculate required input fields for each data source based on output requirements.
  3. Logical Plan: Convert AST to Calcite RelNode tree using the resolved static schema.
  4. Physical Plan & Execution: Apply rules and execute.

3.2 Key Logic

  • Static Mapping: By calculating requirements through the AST, fields used by commands are mapped as static fields in the data schema.
  • Dynamic Fields (_MAP): To support wildcards (e.g., prefix*) and dynamic field generation (e.g., spath), a dynamic map field is still required to store unmapped or pattern-matched fields.
  • Field Ordering: Since Calcite's Map structure does not preserve order, we will calculate the final field order based on the query AST separately and reorder the results during execution to match Splunk-like behavior (lexicographical order for unmatched fields).

4. Implementation Roadmap

The implementation is broken down into five steps:

  • Step 1: spath Limited Version: Implement spath without dynamic fields. It will extract only fields identified by the field resolution phase and treat them as STRING types.
  • Step 2: Dynamic spath: Extend spath to produce dynamic fields and update commands like join and append to handle dynamic field expansion.
  • Step 3: Schema-on-Read (Static Type): Implement LogicalIndexScan that applies resolved fields to the schema.
  • Step 4: ANY Type Support: Extend schema-on-read to handle ANY types and adjust type coercion.
  • Step 5: Performance: Improve performance by automatically identifying and indexing JSON internal fields.

5. Alternative Solutions Considered

5.1 Store All Fields into _MAP

  • Description: Store all available fields into a single Map<String, Any> column, similar to the Elasticsearch/MongoDB adapter implementation.
  • Pros: Simplifies implementation as commands simply manipulate the map.
  • Cons: High overhead for field access and copying; hard to optimize using standard column pruning.

5.2 Convert PPL to SQL

  • Description: Transpile PPL queries directly to SQL and execute them via the SQL engine.
  • Pros: Flexible generation logic. Utilize DynamicRecordType usable during SQL validation phase.
  • Cons: Non-trivial work to convert PPL logic to SQL; requires handling dynamic field logic within the generated SQL.

5.3 Unresolved Logical Plan (ULP)

  • Description: Introduce a new abstraction (ULP) between the AST and Calcite's RelNode tree to separate command logic from schema resolution.
  • Pros: Decouples logic from Calcite, facilitating future engine migrations.
  • Cons: Requires a full rewrite of the current Calcite RelNodeVisitor and a new abstraction layer.

6. Appendix: Field Resolution Examples

This section illustrates how the field resolution phase traverses the AST from root to leaf to identify the necessary static fields.

6.1 Basic Backward Propagation

In this example, requirements propagate backward from the final command to the source.

Query: source idx | filter a > 1 | eval b = c * 2 | fields a, b, d

  • fields a, b, d: Requires inputs (a, b, d).
  • eval b = c * 2:
    • Needs c to produce b.
    • Needs a and d to satisfy the subsequent fields command.
    • Result: Requires (a, c, d).
  • filter a > 1: Needs a for filtering, plus (a, c, d) from the previous step.
    • Result: Requires (a, c, d).
  • source idx: The final resolved schema for the source is (a, c, d).

6.2 Field Pruning

The fields command can block the propagation of unneeded fields, allowing the source to fetch fewer columns.

Query: source idx | fields a, b | eval d = c * 2

  • eval d = c * 2: Needs c.
  • fields a, b: explicitly requests only a and b. It does not produce or pass c.
  • source idx: Resolved schema is (a, b). Field c is pruned because the explicit fields command prevents it from reaching the eval step where it is needed.

6.3 Wildcards and Dynamic Fields

When wildcards are used, the resolution phase identifies specific static fields where possible and delegates the rest to the dynamic map.

Query: source=idx1 | eval a=b+2 | fields a, c, suffix*

  • fields a, c, suffix*: Requires a, c, and the wildcard pattern suffix*.
  • eval a=b+2:
    • Needs b to produce a.
    • a is generated here, so it is removed from the input requirement.
    • Passes through c and suffix*.
    • Result: Requires (b, c, suffix*).
  • source idx1: Resolved schema includes static fields b, c and a dynamic map for matching suffix*.

6.4 Multi-Input (Join)

For multi-input commands like join, requirements are distributed to the appropriate branches.

Query: source=idx1 | join a [source=idx2 | where b > 1] | fields prefix*

  • **fields prefix* **: Requires prefix*.
  • join a:
    • Requires the join key a from both sides.
    • Distributes prefix* requirement to both side.
    • Result (Left): (a, prefix*)
    • Result (Right): (a, prefix*).
  • source idx2 (Right Branch):
    • Needs a and prefix* from the join.
    • Needs b for the where b > 1 clause.
    • Result: (a, b, prefix*).
  • source idx1 (Left Branch):
    • Result: (a, prefix*).

Metadata

Metadata

Assignees

Labels

PPLPiped processing languageRFCRequest For Comments

Type

No type

Projects

Status

New

Status

Not Started

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions