-
Notifications
You must be signed in to change notification settings - Fork 180
Support pushdown physical sort operator to speedup SortMergeJoin #3864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Lantao Jin <ltjin@amazon.com>
…ta=true Signed-off-by: Lantao Jin <ltjin@amazon.com>
...earch/src/main/java/org/opensearch/sql/opensearch/storage/scan/AbstractCalciteIndexScan.java
Outdated
Show resolved
Hide resolved
opensearch/src/main/java/org/opensearch/sql/opensearch/data/type/OpenSearchTextType.java
Outdated
Show resolved
Hide resolved
...earch/src/main/java/org/opensearch/sql/opensearch/storage/scan/AbstractCalciteIndexScan.java
Outdated
Show resolved
Hide resolved
Signed-off-by: Lantao Jin <ltjin@amazon.com>
Signed-off-by: Lantao Jin <ltjin@amazon.com>
Signed-off-by: Lantao Jin <ltjin@amazon.com>
Signed-off-by: Lantao Jin <ltjin@amazon.com>
Signed-off-by: Lantao Jin <ltjin@amazon.com>
| Map<String, OpenSearchDataType> finalResult = new HashMap<>(); | ||
| for (Map<String, OpenSearchDataType> map : candidateMaps) { | ||
| OpenSearchDescribeIndexRequest.mergeObjectAndArrayInsideMap(finalResult, map); | ||
| MergeRuleHelper.merge(finalResult, map); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not related. just fix a compile error in benchmarks module.
| "fielddata": true, | ||
| "fields": { | ||
| "keyword": { | ||
| "type": "keyword", | ||
| "ignore_above": 256 | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gender in test sample data should contain keyword subfield. If else, we cannot sorting, aggregate on it. Sorting on a fielddata field would get unstable result. ref link
| rows(1, 20, "f", "VA"), | ||
| rows(1, 30, "f", "IN"), | ||
| rows(1, 30, "f", "PA"), | ||
| rows(1, 30, "m", "IL"), | ||
| rows(1, 30, "m", "MD"), | ||
| rows(1, 30, "m", "TN"), | ||
| rows(1, 30, "m", "WA")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this lower case values in v2 were caused by grouping by a fielddata field.
| explainQueryToString( | ||
| "source=opensearch-sql_test_index_account " | ||
| + "| sort account_number, firstname, address, balance " | ||
| + "| sort - balance, - gender, address " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not sort on text type without keyword subfield in v3. Even sorting on text type with fielddata=true will get unexpected order:
For example, if gender is a text type field with fielddata=true.
Check following queries and their outputs in v2.
index=test1:
| gender |
|---|
| M a 1 |
| M b 1 |
| F c 2 |
| F d d 1 |
source = test1 | sort gender returns
| gender |
|---|
| M a 1 |
| M b 1 |
| F d d 1 |
| F c 2 |
source = test1 | sort - gender returns
| gender |
|---|
| M a 1 |
| M b 1 |
| F c 2 |
| F d d 1 |
index=test2:
| gender |
|---|
| F d d 1 |
| M a 1 |
| F c 2 |
| M b 1 |
source=test2 | sort gender returns
| gender |
|---|
| F d d 1 |
| M a 1 |
| M b 1 |
| F c 2 |
source = test2 | sort - gender returns
| gender |
|---|
| M a 1 |
| M b 1 |
| F d d 1 |
| F c 2 |
The logic of sorting a fielddata is 1) tokenization 2) internal sort by token 3) sort by first token
ASC: "F d d 1" =tokenization=> "f", "d", "d", "1" =internal sort by token=> "1", "f", "d", "d" =sort by first token=> "1"
DESC: "F d d 1" =tokenization=> "f", "d", "d", "1" =internal sort by token=> "f", "d", "d", "1" =sort by first token=> "f"
|
Could you review this again?@qianheng-aws @penghuo |
Signed-off-by: Lantao Jin <ltjin@amazon.com>
Signed-off-by: Lantao Jin <ltjin@amazon.com>
| rows("F", "KS", 7), | ||
| rows("F", "CO", 7), | ||
| rows("F", "NV", 8), | ||
| isPushdownEnabled() ? rows("F", "AR", 8) : rows("F", "NV", 8), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we introduce this disparity between with&without pushdown?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added "keyword" subfield for the gender, agg pushdown by gender works in caclite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why pushdown enable impact results?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why pushdown enable impact results?
This result difference is not introduced by code changes in this PR. You can reproduce it in main branch when change the schema from
"gender": {
"type": "text",
"fielddata": true
},
to
"gender": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
The three results (v2 non-pushdown, v3 non-pushdown, v3 pushdown) are different.
| v2 non-pushdown | v3 non-pushdown | v3 pushdown |
|---|---|---|
| "F", "VA", 8 | "F", "NV", 8 | "F", "AR", 8 |
But they are all correct. The rare command is not a deterministic command, it result depends on the order of return from OpenSearch and implementation (for v2). When the bucket pushdown works (by gender), the fetched data order is different with data order in non-pushdown.
| rows("F", "KS", 7), | ||
| rows("F", "CO", 7), | ||
| rows("F", "NV", 8), | ||
| isPushdownEnabled() ? rows("F", "AR", 8) : rows("F", "NV", 8), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why pushdown enable impact results?
| public void onMatch(RelOptRuleCall call) { | ||
| final LogicalSort sort = call.rel(0); | ||
| final CalciteLogicalIndexScan scan = call.rel(1); | ||
| final Sort sort = call.rel(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LantaoJin @qianheng-aws couple questions?
- Is OpenSearchSortIndexScanRule intended to be a physical tweak rule for EnumerableRel?
- Do we still need Logical plan optimization rule and CalciteLogicalIndexScan? Can all existing rules be expressed as physical tweak rules for EnumerableRel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EnumerableSort can generate from logical sort operator and logical join operator with physical SortMerge implementation. So I changed the OpenSearchSortIndexScanRule rule from logical layer to physical layer. It equals keeping the OpenSearchSortIndexScanRule in logical layer and adding a new rule for EnumerableSort + SortMergeJoin in physical layer.
Almost pushdown rules can move to physical layer but I didn't see any benefit to do that. We need logical plan rules. some optimization should happen in logical layer instead of physical. For example a LogicalJoin operator may generate multiple physical Join operators. rule applying in logical should be more efficient.
|
The backport to To backport manually, run these commands in your terminal: # Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/sql/backport-2.19-dev 2.19-dev
# Navigate to the new working tree
pushd ../.worktrees/sql/backport-2.19-dev
# Create a new branch
git switch --create backport/backport-3864-to-2.19-dev
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 71aa9ba0490305b81284eb369f212a8fc6e3936f
# Push it to GitHub
git push --set-upstream origin backport/backport-3864-to-2.19-dev
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/sql/backport-2.19-devThen, create a pull request where the |
…nsearch-project#3864) * Support pushdown physical sort operator to speedup SortMergeJoin Signed-off-by: Lantao Jin <ltjin@amazon.com> * Enable sort pushdown when the type is TEXT without fields and fielddata=true Signed-off-by: Lantao Jin <ltjin@amazon.com> * Fix UT Signed-off-by: Lantao Jin <ltjin@amazon.com> * Add OpenSearchTextType.toKeywordSubField() Signed-off-by: Lantao Jin <ltjin@amazon.com> * Fix IT Signed-off-by: Lantao Jin <ltjin@amazon.com> * Fix IT Signed-off-by: Lantao Jin <ltjin@amazon.com> * Only push down sort when order is ASC for fielddata text Signed-off-by: Lantao Jin <ltjin@amazon.com> * revert CalciteNoPushdownIT Signed-off-by: Lantao Jin <ltjin@amazon.com> * revert the fielddata logic Signed-off-by: Lantao Jin <ltjin@amazon.com> * Fix IT Signed-off-by: Lantao Jin <ltjin@amazon.com> * 'gender' in test data should contain keyword subfield Signed-off-by: Lantao Jin <ltjin@amazon.com> * Fix IT Signed-off-by: Lantao Jin <ltjin@amazon.com> * revert the typo change in test Signed-off-by: Lantao Jin <ltjin@amazon.com> * fix no pushdown IT Signed-off-by: Lantao Jin <ltjin@amazon.com> --------- Signed-off-by: Lantao Jin <ltjin@amazon.com> (cherry picked from commit 71aa9ba)
…p SortMergeJoin (#3864) (#3889) * Support pushdown physical sort operator to speedup SortMergeJoin (#3864) * Support pushdown physical sort operator to speedup SortMergeJoin Signed-off-by: Lantao Jin <ltjin@amazon.com> * Enable sort pushdown when the type is TEXT without fields and fielddata=true Signed-off-by: Lantao Jin <ltjin@amazon.com> * Fix UT Signed-off-by: Lantao Jin <ltjin@amazon.com> * Add OpenSearchTextType.toKeywordSubField() Signed-off-by: Lantao Jin <ltjin@amazon.com> * Fix IT Signed-off-by: Lantao Jin <ltjin@amazon.com> * Fix IT Signed-off-by: Lantao Jin <ltjin@amazon.com> * Only push down sort when order is ASC for fielddata text Signed-off-by: Lantao Jin <ltjin@amazon.com> * revert CalciteNoPushdownIT Signed-off-by: Lantao Jin <ltjin@amazon.com> * revert the fielddata logic Signed-off-by: Lantao Jin <ltjin@amazon.com> * Fix IT Signed-off-by: Lantao Jin <ltjin@amazon.com> * 'gender' in test data should contain keyword subfield Signed-off-by: Lantao Jin <ltjin@amazon.com> * Fix IT Signed-off-by: Lantao Jin <ltjin@amazon.com> * revert the typo change in test Signed-off-by: Lantao Jin <ltjin@amazon.com> * fix no pushdown IT Signed-off-by: Lantao Jin <ltjin@amazon.com> --------- Signed-off-by: Lantao Jin <ltjin@amazon.com> (cherry picked from commit 71aa9ba) * Fix IT Signed-off-by: Lantao Jin <ltjin@amazon.com> * revert changes in build.gradle Signed-off-by: Lantao Jin <ltjin@amazon.com> --------- Signed-off-by: Lantao Jin <ltjin@amazon.com>
Description
In Calcite + OpenSearch, the default physical Join operator is
EnumerableMergeJoin:After this patching, the
EnumerableSortoperators could be pushed down to DSL:In v3, a string field can enable sort pushdown if
Related Issues
Resolves #3863
Check List
--signoff.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.