-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support RelationSubquery
PPL
#775
Conversation
Signed-off-by: Lantao Jin <ltjin@amazon.com>
@@ -55,9 +55,9 @@ commands | |||
; | |||
|
|||
searchCommand |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LantaoJin can you please share more details of the usage of FROM here ?
maybe give some examples when to use SEARCH vs FROM ?
thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no specific differences between the keywords SEARCH and FROM on the syntax perspective. I just thought we might need an alias for the SEARCH keyword since the PPL should be more aggressive on query semantic. PPL will have more powerful and functionality on data exploration and analytics beyond searching text or operating time-series data. The similar things are happening in industry, such as pipe-syntax BigQuery in Google and KQL tabular operator.
Even in Splunk SPL2, from
command is separated from search
command.
A query will be well-rewritten with a piped query language by who are familiar with SQL:
Ansi SQL:
SELECT sum(bytes) AS sum, host
FROM main WHERE earliest=-5m@m
GROUP BY host
SPL2 from
query:
| FROM main
WHERE earliest=-5m@m GROUP BY host
SELECT sum(bytes) AS sum, host
But now I think I will revert this keyword alias since it should be fully discussed before delivering to our customers. Hope we can enhance the vision of PPL to contain more data analytics semantic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LantaoJin this sounds a great idea and a very clean and clear vision going forward.
IMO please add this as a separate PR (under the doc/planning folder) including the explanations, concepts and comparative analysis u've shown here ...
thanks again !!
Signed-off-by: Lantao Jin <ltjin@amazon.com>
@@ -32,7 +32,7 @@ The example show fetch all the document from accounts index with . | |||
|
|||
PPL query: | |||
|
|||
os> source=accounts account_number=1 or gender="F"; | |||
os> SEARCH source=accounts account_number=1 or gender="F"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two queries as examples. One example query ignores SEARCH keyword. Keep a SEARCH keyword in this example query.
tableSourceClause | ||
: tableSource (COMMA tableSource)* | ||
: tableSource (COMMA tableSource)* (AS alias = qualifiedName)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Table alias is useful in query which contains a subquery, for example
select a, (
select sum(b)
from catalog.schema.table1 as t1
where t1.a = t2.a
) sum_b
from catalog.schema.table2 as t2
t1
and t2
are table aliases which are used in correlated subquery, sum_b
are subquery alias
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the detailed review - can you also add this explanation to the ppl-subquery-command doc ?
thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I will give more examples in doc.
|
||
// One tableSourceClause will generate one Relation node with/without one alias | ||
// even if the relation contains more than one table sources. | ||
// These table sources in one relation will be readed one by one in OpenSearch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we support this source = tb1, tb2, tb3 as tbl
should we treat tb1,tb2,tb3 as single table, and let datasource connector handle it?, right? for instance,
source=`tb1, tb2, tb3` as tbl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. The current valid syntax is source = tb1, tb2, tb3 as tbl
or source = `tb1`, `tb2`, `tb3` as tbl
.
Indexes tb1, tb2 and tb3 will be converted to one relation in OpenSearch and tbl
is the alias of this relation.
Spark doesn't support comma-tables as a relation. Catalyst will throw Table not found if a UnresolvedRelation with comma-named table identifier.
source=`tb1, tb2, tb3` as tbl
equals to source = tb1, tb2, tb3 as tbl
for Spark and fail in resolution because the reason I just said.
For OpenSearch index, source=`tb1, tb2, tb3` as tbl
cannot work either since "tb1, tb2, tb3" is not a valid index name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
source = tb1, tb2, tb3 as tbl
and source = `tb1`, `tb2`, `tb3` as tbl
are valid because we never need the pattern
source = tb1 as t1, tb2 as t2, tb3 as t3
since there is no meaningful. tb1 tb2 tb3 are combined in one relation in plan.
That's why source=`tb1, tb2, tb3` as tbl
will be treated the relation name "tb1, tb2, tb3" and should fail with table not found.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current valid syntax is source = tb1, tb2, tb3 as tbl or source =
tb1
,tb2
,tb3
as tbl
is it a valid grammer in spark-sql. If not, does it confuse user?
OpenSearch index, source=
tb1, tb2, tb3
as tbl cannot work either since "tb1, tb2, tb3" is not a valid index name.
PPL on OpenSearch support it, it is multiple opensearch index.
Should we let the Catalog handle table name resolution? for openserach catalog, it can resolve table name as multiple index properly. for instance,catalog
.namespace
.index-2024*,index-2023-12
,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current valid syntax is source = tb1, tb2, tb3 as tbl or source = `tb1`, `tb2`, `tb3` as tbl
is it a valid grammer in spark-sql. If not, does it confuse user?
Yes. It is a valid grammer in opensearch-spark. For example
search source=test1, test2 or search source=`test1`, `test2`
generate a Spark plan with Union
'Union
:- 'Project [*]
: +- 'UnresolvedRelation [spark_catalog, default, flint_ppl_test1], [], false
+- 'Project [*]
+- 'UnresolvedRelation [spark_catalog, default, flint_ppl_test2], [], false
OpenSearch index, source=`tb1, tb2, tb3` as tbl cannot work either since "tb1, tb2, tb3" is not a valid index name.
PPL on OpenSearch support it, it is multiple opensearch index.
Oh, that is the key difference, opensearch-spark can't handle it since "tb1, tb2, tb3" in backticks will be handled as a whole and name with comma is invalid in Spark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PPL on OpenSearch supports:
- source=accounts, account2
- source=`accounts`,`account2`
- source=`accounts, account2`
But PPL on Spark supports the first two. I would suggest to mark the third as invalid since users will treat the content in backticks as a whole as usual. `accounts, account2` seems more specific for OpenSearch domain. For the instance you provided above, my suggestion is treating content in backticks as a whole. @penghuo
- √ source=`catalog`.`namespace`.`index-2024*`, `catalog`.`namespace`.`index-2023-12`
- √ source=`catalog`.`namespace`.index-2024*, index-2023-12
- × source=`catalog`.`namespace`.`index-2024*, index-2023-12`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any different thoughts? I think it's worth to open a meta issue in sql repo for further discussion if we couldn't get align here. This context in a closed PR could be easily lost.
Signed-off-by: Lantao Jin <ltjin@amazon.com>
@penghuo @YANG-DB In commit 1a4451a, I added integ-test and documentation for the case which subquery in search filter. It's to explain how a subquery to rewrite a subsearch in SPL we discussed offline.
PPL:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LantaoJin excellent documentation - thanks !!
* Support RelationSubquery PPL Signed-off-by: Lantao Jin <ltjin@amazon.com> * fix doc Signed-off-by: Lantao Jin <ltjin@amazon.com> * revert the FROM alias Signed-off-by: Lantao Jin <ltjin@amazon.com> * add the case for subquery in search filter Signed-off-by: Lantao Jin <ltjin@amazon.com> --------- Signed-off-by: Lantao Jin <ltjin@amazon.com>
Description
(Relation) Subquery usage
InSubquery
,ExistsSubquery
andScalarSubquery
are all subquery expressions. ButRelationSubquery
is not a subquery expression, it is a subquery plan which is common used in Join or Search/From clause.source = table1 | join left = l right = r [ source = table2 | where d > 10 | head 5 ]
(subquery in join right side)source = [ source = table1 | join left = l right = r [ source = table2 | where d > 10 | head 5 ] | stats count(a) by b ] as outer | head 1
SQL Migration examples with Subquery PPL:
tpch q13
Rewritten by PPL (Relation) Subquery:
Issues Resolved
Resolve #713 as a sub-task of #661
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.