-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Translate PPL dedup
Command Part 1: allowedDuplication=1
#521
Conversation
Signed-off-by: Lantao Jin <ltjin@amazon.com>
Signed-off-by: Lantao Jin <ltjin@amazon.com>
Part 2To translate
One un-clarified point is what is the Option 1AFAIK, there always should be a time series field existing for log analysis, such as Option 2If we cannot point out an existing time series field as order_key, an option is change the the syntax of dedup command as following (make sure sql repo will change first)
Option 3partitioning and ordering by the same columns:
If that, the results can be non-deterministic for large datasets. This non-determinism arises because the sorting does not have a secondary tie-breaking rule to ensure a consistent order of rows within each partition. If multiple rows have the same values for the partition and order columns, their relative order can vary across different runs A complex case for option 3 is while we have a sorter before dedup command
Any thoughts? @YANG-DB @penghuo @dai-chen FYI: In Spark, window function requires window to be ordered. Query fails if order by clause is not specified. For example, |
The second un-clarified point is about node
Need more guidance on the purpose of this repository and how it should be used in production or in future. For current time-being, I choose |
Agree on Windows function. Questions
|
I updated the comment #521 (comment) |
SummaryDiscussed with @penghuo offline. Here is the conclusions:
Will choose the option 3 (partitioning and ordering by the same columns). Ref #521 (comment) Again, all above shouldn't block code review for this PR. |
@@ -271,7 +273,105 @@ public LogicalPlan visitWindowFunction(WindowFunction node, CatalystPlanContext | |||
|
|||
@Override | |||
public LogicalPlan visitDedupe(Dedupe node, CatalystPlanContext context) { | |||
throw new IllegalStateException("Not Supported operation : dedupe "); | |||
node.getChild().get(0).accept(this, context); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not mandatory but for future - lets extract the concrete login into a dedicated strategy
class and inject that into the QueryPlanner
this would reduce the complexity and simplify the testing and seperation of concerns
Signed-off-by: YANGDB <yang.db.dev@gmail.com>
* Translate PPL Dedup Command: only one duplication allowd Signed-off-by: Lantao Jin <ltjin@amazon.com> * add document Signed-off-by: Lantao Jin <ltjin@amazon.com> --------- Signed-off-by: Lantao Jin <ltjin@amazon.com> Signed-off-by: YANGDB <yang.db.dev@gmail.com> Co-authored-by: YANGDB <yang.db.dev@gmail.com> (cherry picked from commit 7c4244f) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
* Translate PPL Dedup Command: only one duplication allowd * add document --------- (cherry picked from commit 7c4244f) Signed-off-by: Lantao Jin <ltjin@amazon.com> Signed-off-by: YANGDB <yang.db.dev@gmail.com> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: YANGDB <yang.db.dev@gmail.com>
Description
Syntax
This PR translates following three PPL
dedup
commands to different Spark Logical Plans.Note, allowed duplication in this PR is
1
. Assuming the <field-list> isa, b
.| dedup [1] a, b [keepempty=false]
| dedup [1] a, b keepempty=true
| dedup [1] a, b [keepempty=true] consecutive=true
Issues Resolved
Resolves #523 (Subtask of #421)
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.