Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
84 changes: 81 additions & 3 deletions core/src/main/java/org/opensearch/sql/ast/tree/SPath.java
Original file line number Diff line number Diff line change
Expand Up @@ -5,21 +5,27 @@

package org.opensearch.sql.ast.tree;

import static org.opensearch.sql.common.utils.StringUtils.unquoteIdentifier;

import com.google.common.collect.ImmutableList;
import java.util.List;
import lombok.AllArgsConstructor;
import lombok.EqualsAndHashCode;
import lombok.RequiredArgsConstructor;
import lombok.ToString;
import org.apache.calcite.rel.type.RelDataType;
import org.apache.calcite.tools.RelBuilder;
import org.checkerframework.checker.nullness.qual.Nullable;
import org.opensearch.sql.ast.AbstractNodeVisitor;
import org.opensearch.sql.ast.dsl.AstDSL;
import org.opensearch.sql.calcite.CalcitePlanContext;

@ToString
@EqualsAndHashCode(callSuper = false)
@RequiredArgsConstructor
@AllArgsConstructor
public class SPath extends UnresolvedPlan {
private final char DOT = '.';
private UnresolvedPlan child;

private final String inField;
Expand All @@ -44,17 +50,89 @@ public <T, C> T accept(AbstractNodeVisitor<T, C> nodeVisitor, C context) {
return nodeVisitor.visitSpath(this, context);
}

public Eval rewriteAsEval() {
private String fullPath() {
return this.inField + DOT + this.path;
}

/**
* Determine whether a provided match string is better than the current best match available, for
* path matching.
*
* @param maybeMatch A field name that we're testing for path matching
* @param currentRecordMatch Our best field match so far. Should at least be as good as
* `this.inField`
* @return The better match between the two provided options
*/
private String preferredPathMatch(String maybeMatch, String currentRecordMatch) {
String path = this.fullPath();
// If the provided match isn't even a match, skip it
if (!path.startsWith(maybeMatch) || maybeMatch.length() <= currentRecordMatch.length()) {
return currentRecordMatch;
}
// Ensure the match is on a proper segment boundary (either dot-delimited, or exactly matches
// the path)
if (path.length() == maybeMatch.length() || path.charAt(maybeMatch.length()) == '.') {
return maybeMatch;
}
// We had a match, but it wasn't better than our current record
return currentRecordMatch;
}

/**
* We want input=outer, path=inner.data to match records like `{ "outer": { "inner": "{\"data\":
* 0}" }}`. To rewrite this as eval, that means we need to detect the longest prefix match in the
* fields (`outer.inner`) and parse `data` out of it. We need to match on segments, so
* `outer.inner` shouldn't match `outer.inner_other`.
*
* @return The field from the RelBuilder with the most overlap, or inField if none exists.
Comment on lines +82 to +87
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused. Is input parameter for specifying where we want to read JSON from? This description looks like we are reading JSON which includes outer attribute inside.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Input is the outermost field we want to start extracting the inner values from, so on something like { "outer": { "inner": "{\"data\": 0}" }} then input=outer means that Spath will be processing the document { "inner": "{\"data\": 0}" }. Then path=inner.data would access the value 0, or path=inner would access the value "{\"data\": 0}"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that mean outer is a column? In that case I think we want to separate that from JSON string.

Copy link
Collaborator Author

@Swiddis Swiddis Sep 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think in the future it makes sense to remove the input field and just navigate directly to the specified path. For this change, the intended behavior is to allow this type of mixing so you don't need to worry about where exactly the boundary is. I might cut a future PR to make input be an empty string by default?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel it is making thing more complicated.
In my opinion, it is simpler and easier to understand if input simply point the column containing JSON, and path specifies the path in JSON.

But I am open if others feels current approach is better.

*/
private String computePathField(RelBuilder builder) {
RelDataType rowType = builder.peek().getRowType();
List<String> rowFieldNames = rowType.getFieldNames();
String result = this.inField;

for (String name : rowFieldNames) {
result = this.preferredPathMatch(name, result);
}

return result;
}

/**
* Convert this `spath` expression to an equivalent `json_extract` eval.
*
* @param context The planning context for the rewrite, which has access to the available fields.
* @return The rewritten expression.
*/
public Eval rewriteAsEval(CalcitePlanContext context) {
String outField = this.outField;
if (outField == null) {
outField = this.path;
outField = unquoteIdentifier(this.path);
}

String pathField = computePathField(context.relBuilder);
String reducedPath = this.fullPath().substring(pathField.length());

String[] pathFieldParts = unquoteIdentifier(pathField).split("\\.");

if (reducedPath.isEmpty()) {
// Special case: We're spath-extracting a path that already exists in the data. This is just a
// rename.
return AstDSL.eval(
this.child,
AstDSL.let(AstDSL.field(outField), AstDSL.field(AstDSL.qualifiedName(pathFieldParts))));
}
// Since pathField must be on a segment line, there's a leftover leading dot if we didn't match
// the whole path.
reducedPath = reducedPath.substring(1);

return AstDSL.eval(
this.child,
AstDSL.let(
AstDSL.field(outField),
AstDSL.function(
"json_extract", AstDSL.field(inField), AstDSL.stringLiteral(this.path))));
"json_extract",
AstDSL.field(AstDSL.qualifiedName(pathFieldParts)),
AstDSL.stringLiteral(unquoteIdentifier(reducedPath)))));
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -670,7 +670,8 @@ public RelNode visitParse(Parse node, CalcitePlanContext context) {

@Override
public RelNode visitSpath(SPath node, CalcitePlanContext context) {
return visitEval(node.rewriteAsEval(), context);
visitChildren(node, context);
return visitEval(node.rewriteAsEval(context), context);
}

@Override
Expand Down
1 change: 1 addition & 0 deletions docs/category.json
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@
"user/ppl/cmd/syntax.rst",
"user/ppl/cmd/timechart.rst",
"user/ppl/cmd/search.rst",
"user/ppl/cmd/spath.rst",
"user/ppl/functions/statistical.rst",
"user/ppl/cmd/top.rst",
"user/ppl/cmd/trendline.rst",
Expand Down
3 changes: 2 additions & 1 deletion docs/user/dql/metadata.rst
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ Example 1: Show All Indices Information
SQL query::

os> SHOW TABLES LIKE '%'
fetched rows / total rows = 18/18
fetched rows / total rows = 19/19
+----------------+-------------+------------------+------------+---------+----------+------------+-----------+---------------------------+----------------+
| TABLE_CAT | TABLE_SCHEM | TABLE_NAME | TABLE_TYPE | REMARKS | TYPE_CAT | TYPE_SCHEM | TYPE_NAME | SELF_REFERENCING_COL_NAME | REF_GENERATION |
|----------------+-------------+------------------+------------+---------+----------+------------+-----------+---------------------------+----------------|
Expand All @@ -52,6 +52,7 @@ SQL query::
| docTestCluster | null | otellogs | BASE TABLE | null | null | null | null | null | null |
| docTestCluster | null | people | BASE TABLE | null | null | null | null | null | null |
| docTestCluster | null | state_country | BASE TABLE | null | null | null | null | null | null |
| docTestCluster | null | structured | BASE TABLE | null | null | null | null | null | null |
| docTestCluster | null | time_test | BASE TABLE | null | null | null | null | null | null |
| docTestCluster | null | weblogs | BASE TABLE | null | null | null | null | null | null |
| docTestCluster | null | wildcard | BASE TABLE | null | null | null | null | null | null |
Expand Down
29 changes: 24 additions & 5 deletions docs/user/ppl/cmd/spath.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@ Description
============
| The `spath` command allows extracting fields from structured text data. It currently allows selecting from JSON data with JSON paths.

If the inner data is unable to be extracted (malformed data, missing keys), `"null"` is returned. `spath` always returns string values, except if the input field is `null`.

Version
=======
3.3.0
Expand All @@ -37,10 +39,10 @@ The simplest spath is to extract a single field. This extracts `n` from the `doc

PPL query::

PPL> source=test_spath | spath input=doc n;
os> source=structured | spath input=doc_n n | fields doc_n n;
fetched rows / total rows = 3/3
+----------+---+
| doc | n |
| doc_n | n |
|----------+---|
| {"n": 1} | 1 |
| {"n": 2} | 2 |
Expand All @@ -54,10 +56,10 @@ These queries demonstrate more JSON path uses, like traversing nested fields and

PPL query::

PPL> source=test_spath | spath input=doc output=first_element list{0} | spath input=doc output=all_elements list{} | spath input=doc output=nested nest_out.nest_in;
os> source=structured | spath input=doc_list output=first_element list{0} | spath input=doc_list output=all_elements list{} | spath input=doc_list output=nested nest_out.nest_in | fields doc_list first_element all_elements nested;
fetched rows / total rows = 3/3
+------------------------------------------------------+---------------+--------------+--------+
| doc | first_element | all_elements | nested |
| doc_list | first_element | all_elements | nested |
|------------------------------------------------------+---------------+--------------+--------|
| {"list": [1, 2, 3, 4], "nest_out": {"nest_in": "a"}} | 1 | [1,2,3,4] | a |
| {"list": [], "nest_out": {"nest_in": "a"}} | null | [] | a |
Expand All @@ -71,10 +73,27 @@ The example shows extracting an inner field and doing statistics on it, using th

PPL query::

PPL> source=test_spath | spath input=doc n | eval n=cast(n as int) | stats sum(n);
os> source=structured | spath input=doc_n n | eval n=cast(n as int) | stats sum(n) | fields `sum(n)`;
fetched rows / total rows = 1/1
+--------+
| sum(n) |
|--------|
| 6 |
+--------+

Example 4: Field traversal
============================

SPath can also traverse plain documents, in which case it acts similarly to renaming.

PPL query::

os> source=structured | spath input=obj_field path=field | fields obj_field field;
fetched rows / total rows = 3/3
+----------------+-------+
| obj_field | field |
|----------------+-------|
| {'field': 'a'} | a |
| {'field': 'b'} | b |
| {'field': 'c'} | c |
+----------------+-------+
3 changes: 3 additions & 0 deletions doctest/test_data/structured.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{"doc_n":"{\"n\": 1}","doc_list":"{\"list\": [1, 2, 3, 4], \"nest_out\": {\"nest_in\": \"a\"}}","obj_field":{"field": "a"}}
{"doc_n":"{\"n\": 2}","doc_list":"{\"list\": [], \"nest_out\": {\"nest_in\": \"a\"}}","obj_field":{"field": "b"}}
{"doc_n":"{\"n\": 3}","doc_list":"{\"list\": [5, 6], \"nest_out\": {\"nest_in\": \"a\"}}","obj_field":{"field": "c"}}
1 change: 1 addition & 0 deletions doctest/test_docs.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@
'work_information': 'work_information.json',
'events': 'events.json',
'otellogs': 'otellogs.json',
'structured': 'structured.json',
'time_test': 'time_test.json'
}

Expand Down
17 changes: 17 additions & 0 deletions doctest/test_mapping/structured.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
{
"mappings": {
"properties": {
"doc_n": {
"type": "text"
},
"doc_list": {
"type": "text"
},
"obj_field": {
"properties": {
"field": { "type": "text" }
}
}
}
}
}
3 changes: 2 additions & 1 deletion ppl/src/main/antlr/OpenSearchPPLParser.g4
Original file line number Diff line number Diff line change
Expand Up @@ -332,7 +332,7 @@ indexablePath
;

pathElement
: ident pathArrayAccess?
: ident pathArrayAccess*
;

pathArrayAccess
Expand Down Expand Up @@ -1457,6 +1457,7 @@ searchableKeyWord
| PATH
| INPUT
| OUTPUT
| FIELD

// AGGREGATIONS AND WINDOW
| statsFunctionName
Expand Down
Loading
Loading