-
Notifications
You must be signed in to change notification settings - Fork 181
Description
1. Summary
This RFC proposes a solution for OpenSearch PPL to support schema-on-read data sources where fields and types are unknown until data reading begins. The proposal focuses on a long-term solution to support both schema-on-read and commands that generate fields dynamically, such as spath. The core approach introduces a field resolution phase to determine the input schema based on the AST before logical planning.
Related issues
- [FEATURE] Support dynamic columns in PPL #4112 (motivation of this RFC)
- [RFC] Dynamic Fields in PPL Calcite Implementation #4433 (This RFC will be deprecate)
- [RFC] Support permissive mode in PPL #4349
2. Motivation
The current PPL engine utilizes Apache Calcite for planning and execution, but Calcite requires a fixed schema during the planning stage. To handle the case where input schema is not known (or partially known), and also dynamic field generation commands, we need mechanism to handle query without schema.
Existing mechanisms in Calcite, such as DynamicRecordType, are insufficient for PPL due to several gaps:
- Validation Stage Missing: PPL lacks a query validation stage where
DynamicRecordTypetypically collects fields. - Field Generation:
DynamicRecordTypedoes not support commands that dynamically produce fields (e.g.,spath,multikv). - Merge Conflicts: PPL merges fields from multiple inputs by overwriting non-null values, whereas Calcite retains dynamic stars as separate fields.
- Complexity: Previous attempts to mix static and dynamic fields resulted in complicated planning logic.
3. Proposed Solution: Field Resolution Phase
To resolve the schema issues, we propose introducing a Field Resolution phase. This phase analyzes the Abstract Syntax Tree (AST) to decide the input schema before converting it to a Calcite RelNode tree.
3.1 The Process
The new execution flow is as follows:
- Parse PPL: Build AST from input PPL query.
- Field Resolution (New): Decide input schema based on AST.
- Traverse AST from root to leaf.
- Calculate required input fields for each data source based on output requirements.
- Logical Plan: Convert AST to Calcite
RelNodetree using the resolved static schema. - Physical Plan & Execution: Apply rules and execute.
3.2 Key Logic
- Static Mapping: By calculating requirements through the AST, fields used by commands are mapped as static fields in the data schema.
- Dynamic Fields (
_MAP): To support wildcards (e.g.,prefix*) and dynamic field generation (e.g.,spath), a dynamic map field is still required to store unmapped or pattern-matched fields. - Field Ordering: Since Calcite's
Mapstructure does not preserve order, we will calculate the final field order based on the query AST separately and reorder the results during execution to match Splunk-like behavior (lexicographical order for unmatched fields).
4. Implementation Roadmap
The implementation is broken down into five steps:
- Step 1:
spathLimited Version: Implementspathwithout dynamic fields. It will extract only fields identified by the field resolution phase and treat them asSTRINGtypes. - Step 2: Dynamic
spath: Extendspathto produce dynamic fields and update commands likejoinandappendto handle dynamic field expansion. - Step 3: Schema-on-Read (Static Type): Implement
LogicalIndexScanthat applies resolved fields to the schema. - Step 4: ANY Type Support: Extend schema-on-read to handle
ANYtypes and adjust type coercion. - Step 5: Performance: Improve performance by automatically identifying and indexing JSON internal fields.
5. Alternative Solutions Considered
5.1 Store All Fields into _MAP
- Description: Store all available fields into a single
Map<String, Any>column, similar to the Elasticsearch/MongoDB adapter implementation. - Pros: Simplifies implementation as commands simply manipulate the map.
- Cons: High overhead for field access and copying; hard to optimize using standard column pruning.
5.2 Convert PPL to SQL
- Description: Transpile PPL queries directly to SQL and execute them via the SQL engine.
- Pros: Flexible generation logic. Utilize DynamicRecordType usable during SQL validation phase.
- Cons: Non-trivial work to convert PPL logic to SQL; requires handling dynamic field logic within the generated SQL.
5.3 Unresolved Logical Plan (ULP)
- Description: Introduce a new abstraction (ULP) between the AST and Calcite's RelNode tree to separate command logic from schema resolution.
- Pros: Decouples logic from Calcite, facilitating future engine migrations.
- Cons: Requires a full rewrite of the current Calcite
RelNodeVisitorand a new abstraction layer.
6. Appendix: Field Resolution Examples
This section illustrates how the field resolution phase traverses the AST from root to leaf to identify the necessary static fields.
6.1 Basic Backward Propagation
In this example, requirements propagate backward from the final command to the source.
Query: source idx | filter a > 1 | eval b = c * 2 | fields a, b, d
fields a, b, d: Requires inputs(a, b, d).eval b = c * 2:- Needs
cto produceb. - Needs
aanddto satisfy the subsequentfieldscommand. - Result: Requires
(a, c, d).
- Needs
filter a > 1: Needsafor filtering, plus(a, c, d)from the previous step.- Result: Requires
(a, c, d).
- Result: Requires
source idx: The final resolved schema for the source is(a, c, d).
6.2 Field Pruning
The fields command can block the propagation of unneeded fields, allowing the source to fetch fewer columns.
Query: source idx | fields a, b | eval d = c * 2
eval d = c * 2: Needsc.fields a, b: explicitly requests onlyaandb. It does not produce or passc.source idx: Resolved schema is(a, b). Fieldcis pruned because the explicitfieldscommand prevents it from reaching theevalstep where it is needed.
6.3 Wildcards and Dynamic Fields
When wildcards are used, the resolution phase identifies specific static fields where possible and delegates the rest to the dynamic map.
Query: source=idx1 | eval a=b+2 | fields a, c, suffix*
fields a, c, suffix*: Requiresa,c, and the wildcard patternsuffix*.eval a=b+2:- Needs
bto producea. ais generated here, so it is removed from the input requirement.- Passes through
candsuffix*. - Result: Requires
(b, c, suffix*).
- Needs
source idx1: Resolved schema includes static fieldsb,cand a dynamic map for matchingsuffix*.
6.4 Multi-Input (Join)
For multi-input commands like join, requirements are distributed to the appropriate branches.
Query: source=idx1 | join a [source=idx2 | where b > 1] | fields prefix*
- **
fields prefix***: Requiresprefix*. join a:- Requires the join key
afrom both sides. - Distributes
prefix*requirement to both side. - Result (Left):
(a, prefix*) - Result (Right):
(a, prefix*).
- Requires the join key
source idx2(Right Branch):- Needs
aandprefix*from the join. - Needs
bfor thewhere b > 1clause. - Result:
(a, b, prefix*).
- Needs
source idx1(Left Branch):- Result:
(a, prefix*).
- Result:
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Status