-
Notifications
You must be signed in to change notification settings - Fork 181
Starter implementation for spath command
#4120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
9e3a4c0
944e160
97d48d8
3f515f3
26cb9bb
7d978af
1a02210
1bb05ce
d65fc03
d401382
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,55 @@ | ||
| package org.opensearch.sql.ast.tree; | ||
|
|
||
| import com.google.common.collect.ImmutableList; | ||
| import java.util.List; | ||
| import lombok.AllArgsConstructor; | ||
| import lombok.EqualsAndHashCode; | ||
| import lombok.RequiredArgsConstructor; | ||
| import lombok.ToString; | ||
| import org.checkerframework.checker.nullness.qual.Nullable; | ||
| import org.opensearch.sql.ast.AbstractNodeVisitor; | ||
| import org.opensearch.sql.ast.dsl.AstDSL; | ||
|
|
||
| @ToString | ||
| @EqualsAndHashCode(callSuper = false) | ||
| @RequiredArgsConstructor | ||
| @AllArgsConstructor | ||
| public class SPath extends UnresolvedPlan { | ||
| private UnresolvedPlan child; | ||
|
|
||
| private final String inField; | ||
|
|
||
| @Nullable private final String outField; | ||
|
|
||
| private final String path; | ||
|
|
||
| @Override | ||
| public UnresolvedPlan attach(UnresolvedPlan child) { | ||
| this.child = child; | ||
| return this; | ||
| } | ||
|
|
||
| @Override | ||
| public List<UnresolvedPlan> getChild() { | ||
| return this.child == null ? ImmutableList.of() : ImmutableList.of(this.child); | ||
| } | ||
|
|
||
| @Override | ||
| public <T, C> T accept(AbstractNodeVisitor<T, C> nodeVisitor, C context) { | ||
| return nodeVisitor.visitSpath(this, context); | ||
| } | ||
|
|
||
| public Eval rewriteAsEval() { | ||
| String outField = this.outField; | ||
| if (outField == null) { | ||
| outField = this.path; | ||
| } | ||
|
|
||
| return AstDSL.eval( | ||
| this.child, | ||
| AstDSL.let( | ||
| AstDSL.field(outField), | ||
| AstDSL.function( | ||
| "json_extract", AstDSL.field(inField), AstDSL.stringLiteral(this.path)))); | ||
| } | ||
|
Comment on lines
+48
to
+54
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is the core behavior, the rest is plumbing/parsing boilerplate.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It spath is translate to eval, does spath node still required?
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We're going to be adding more functionality to it later, so I think it's worthwhile to have the full node here. We could probably be more efficient by parsing it directly into an eval at the tree level if it stays as just this step, though. |
||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -93,7 +93,9 @@ private static boolean isScalarObject(Object obj) { | |
| } | ||
|
|
||
| private static String doJsonize(Object candidate) { | ||
| if (isScalarObject(candidate)) { | ||
| if (candidate == null) { | ||
| return "null"; // Matches isScalarObject, but not toString-able. | ||
|
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Usability patch. Without this, JSON_EXTRACT raises exceptions on any missing fields (which kills extracting from flexibly-typed inputs, one of the major reasons to rely on string types). The user-facing error without this on a missing value is java.lang.NullPointerException: Cannot invoke "Object.toString()" because "candidate" is null. |
||
| } else if (isScalarObject(candidate)) { | ||
| return candidate.toString(); | ||
| } else { | ||
| return JsonFunctions.jsonize(candidate); | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,80 @@ | ||
| ============= | ||
| spath | ||
| ============= | ||
|
|
||
| .. rubric:: Table of contents | ||
|
|
||
| .. contents:: | ||
| :local: | ||
| :depth: 2 | ||
|
|
||
|
|
||
| Description | ||
| ============ | ||
| | The `spath` command allows extracting fields from structured text data. It currently allows selecting from JSON data with JSON paths. | ||
|
|
||
| Version | ||
| ======= | ||
| 3.3.0 | ||
|
|
||
| Syntax | ||
| ============ | ||
| spath input=<field> [output=<field>] [path=]<path> | ||
|
|
||
|
|
||
| * input: mandatory. The field to scan for JSON data. | ||
| * output: optional. The destination field that the data will be loaded to. Defaults to the value of `path`. | ||
| * path: mandatory. The path of the data to load for the object. For more information on path syntax, see `json_extract <../functions/json.rst#json_extract>`_. | ||
|
|
||
| Note | ||
| ===== | ||
| The `spath` command currently does not support pushdown behavior for extraction. It will be slow on large datasets. It's generally better to index fields needed for filtering directly instead of using `spath` to filter nested fields. | ||
|
|
||
| Example 1: Simple Field Extraction | ||
| ================================== | ||
|
|
||
| The simplest spath is to extract a single field. This extracts `n` from the `doc` field of type `text`. | ||
|
|
||
| PPL query:: | ||
|
|
||
| PPL> source=test_spath | spath input=doc n; | ||
| fetched rows / total rows = 3/3 | ||
| +----------+---+ | ||
| | doc | n | | ||
| |----------+---| | ||
| | {"n": 1} | 1 | | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. results is string? "1"
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yep, returns a string -- we can't really have it dynamically choose the type until we have the ability to dynamically make schemas at planning time |
||
| | {"n": 2} | 2 | | ||
| | {"n": 3} | 3 | | ||
| +----------+---+ | ||
|
|
||
| Example 2: Lists & Nesting | ||
| ============================ | ||
|
|
||
| These queries demonstrate more JSON path uses, like traversing nested fields and extracting list elements. | ||
|
|
||
| PPL query:: | ||
|
|
||
| PPL> source=test_spath | spath input=doc output=first_element list{0} | spath input=doc output=all_elements list{} | spath input=doc output=nested nest_out.nest_in; | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is the expectation of conflict output?
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same as with |
||
| fetched rows / total rows = 3/3 | ||
| +------------------------------------------------------+---------------+--------------+--------+ | ||
| | doc | first_element | all_elements | nested | | ||
| |------------------------------------------------------+---------------+--------------+--------| | ||
| | {"list": [1, 2, 3, 4], "nest_out": {"nest_in": "a"}} | 1 | [1,2,3,4] | a | | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. result is string? "[1,2,3,4]" "null", and "[]"
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We could also return |
||
| | {"list": [], "nest_out": {"nest_in": "a"}} | null | [] | a | | ||
| | {"list": [5, 6], "nest_out": {"nest_in": "a"}} | 5 | [5,6] | a | | ||
| +------------------------------------------------------+---------------+--------------+--------+ | ||
|
|
||
| Example 3: Sum of inner elements | ||
| ============================ | ||
|
|
||
| The example shows extracting an inner field and doing statistics on it, using the docs from example 1. It also demonstrates that `spath` always returns strings for inner types. | ||
|
|
||
| PPL query:: | ||
|
|
||
| PPL> source=test_spath | spath input=doc n | eval n=cast(n as int) | stats sum(n); | ||
| fetched rows / total rows = 1/1 | ||
| +--------+ | ||
| | sum(n) | | ||
| |--------| | ||
| | 6 | | ||
| +--------+ | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,48 @@ | ||
| /* | ||
| * Copyright OpenSearch Contributors | ||
| * SPDX-License-Identifier: Apache-2.0 | ||
| */ | ||
|
|
||
| package org.opensearch.sql.calcite.remote; | ||
|
|
||
| import static org.opensearch.sql.util.MatcherUtils.rows; | ||
| import static org.opensearch.sql.util.MatcherUtils.schema; | ||
| import static org.opensearch.sql.util.MatcherUtils.verifyDataRows; | ||
| import static org.opensearch.sql.util.MatcherUtils.verifySchema; | ||
|
|
||
| import java.io.IOException; | ||
| import org.json.JSONObject; | ||
| import org.junit.jupiter.api.Test; | ||
| import org.opensearch.client.Request; | ||
| import org.opensearch.sql.ppl.PPLIntegTestCase; | ||
|
|
||
| public class CalcitePPLSpathCommandIT extends PPLIntegTestCase { | ||
| @Override | ||
| public void init() throws Exception { | ||
| super.init(); | ||
| enableCalcite(); | ||
|
|
||
| loadIndex(Index.BANK); | ||
|
|
||
| // Create test data for string concatenation | ||
| Request request1 = new Request("PUT", "/test_spath/_doc/1?refresh=true"); | ||
| request1.setJsonEntity("{\"doc\": \"{\\\"n\\\": 1}\"}"); | ||
| client().performRequest(request1); | ||
|
|
||
| Request request2 = new Request("PUT", "/test_spath/_doc/2?refresh=true"); | ||
| request2.setJsonEntity("{\"doc\": \"{\\\"n\\\": 2}\"}"); | ||
| client().performRequest(request2); | ||
|
|
||
| Request request3 = new Request("PUT", "/test_spath/_doc/3?refresh=true"); | ||
| request3.setJsonEntity("{\"doc\": \"{\\\"n\\\": 3}\"}"); | ||
| client().performRequest(request3); | ||
| } | ||
|
|
||
| @Test | ||
| public void testSimpleSpath() throws IOException { | ||
| JSONObject result = | ||
| executeQuery("source=test_spath | spath input=doc output=result path=n | fields result"); | ||
| verifySchema(result, schema("result", "string")); | ||
| verifyDataRows(result, rows("1"), rows("2"), rows("3")); | ||
| } | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: maybe add another error case of malformed JSON / invalid paths |
||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -80,6 +80,7 @@ | |
| import org.opensearch.sql.ast.tree.Relation; | ||
| import org.opensearch.sql.ast.tree.Rename; | ||
| import org.opensearch.sql.ast.tree.Reverse; | ||
| import org.opensearch.sql.ast.tree.SPath; | ||
| import org.opensearch.sql.ast.tree.Sort; | ||
| import org.opensearch.sql.ast.tree.SubqueryAlias; | ||
| import org.opensearch.sql.ast.tree.TableFunction; | ||
|
|
@@ -527,6 +528,34 @@ public UnresolvedPlan visitParseCommand(OpenSearchPPLParser.ParseCommandContext | |
| return new Parse(ParseMethod.REGEX, sourceField, pattern, ImmutableMap.of()); | ||
| } | ||
|
|
||
| @Override | ||
| public UnresolvedPlan visitSpathCommand(OpenSearchPPLParser.SpathCommandContext ctx) { | ||
| String inField = null; | ||
| String outField = null; | ||
| String path = null; | ||
|
|
||
| for (OpenSearchPPLParser.SpathParameterContext param : ctx.spathParameter()) { | ||
Swiddis marked this conversation as resolved.
Show resolved
Hide resolved
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If an argument appears multiple times, we overwrite it with the new value. This is consistent with how the other commands behave under the same scenario.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Even input parameter can be multiple? I think it may be worth adding this comment in code. |
||
| if (param.input != null) { | ||
| inField = param.input.getText(); | ||
| } | ||
| if (param.output != null) { | ||
| outField = param.output.getText(); | ||
| } | ||
| if (param.path != null) { | ||
| path = param.path.getText(); | ||
| } | ||
| } | ||
|
|
||
| if (inField == null) { | ||
| throw new IllegalArgumentException("`input` parameter is required for `spath`"); | ||
| } | ||
| if (path == null) { | ||
| throw new IllegalArgumentException("`path` parameter is required for `spath`"); | ||
| } | ||
|
|
||
| return new SPath(inField, outField, path); | ||
| } | ||
|
|
||
| @Override | ||
| public UnresolvedPlan visitPatternsCommand(OpenSearchPPLParser.PatternsCommandContext ctx) { | ||
| UnresolvedExpression sourceField = internalVisitExpression(ctx.source_field); | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing license header