Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pre-load lazy vectors that are referenced by multiple sub expressions #2089

Closed

Conversation

barsondei
Copy link
Contributor

@barsondei barsondei commented Jul 22, 2022

Multiply-referenced LazyVector may load for different sets of rows due to
conditional expression with short circuit semantics. e.g. f(a) AND g(b), h(b).
If b is a LazyVector and f(a) AND g(b) expression is evaluated first,
it will load b only for rows where f(a) is true. However, h(b) projection
needs all rows for "b". Previous solution is to set final selection to false
in FilterProject::project method. LazyVector always loads for all rows with
final selection flag is false.

However, the problem also occurs within a single expression that contains
multiple expressions under a non-null-propagating expression. e.g.
array_constructor(c0 + c1, if(c1 >= 0, c1, 0)) . array_constuctor is the
non-null-propagating expression with two sub-expressions: c0 + c1 and if
(c1 >= 0, c1, 0). First sub-expression is evaluated for a subset of rows when
c0 is not null, but second sub-expression needs to be evaluated for all rows.

Current fix: 1) In Expr::computeMetadata(), identify fields multiply referenced
by input expressions. Load multiply referenced fields at the start of
evaluation in Expr::eval(). 2) Similar to (1), identify multiply referenced
fields references by expressions in ExprSet. Load multiply referenced fields in
ExprSet::eval(). 3) Remove the existing workaround, setting isFinalSelection to
false, from FilterProject::project().

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 22, 2022
@barsondei barsondei force-pushed the lazyVectorMissingRows branch from cd8ba50 to 5781e32 Compare July 22, 2022 16:54
@Yuhta
Copy link
Contributor

Yuhta commented Jul 22, 2022

Fix #2073

@barsondei barsondei changed the title Fix finalSelection when null value's evalution (#2073) Fix finalSelection when skip null value's evalution (#2073) Jul 22, 2022
Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@barsondei Thank you for reporting a problem and proposing a fix. I don't quite understand the fix and it appears in a surprising place. I would expect that a proper fix would reconcile existing workarounds in FilterProject::project and Expr::eval and, perhaps, move the logic of loading lazy vectors into ExprSet::eval.

Would you update the PR description to explain the current design, the problems with it and the new design?

@bikramSingh91
Copy link
Contributor

Thanks @barsondei for your contribution and thanks @mbasmanova for your valuable review comments.
I can help take a look at your patch @barsondei, in the meantime I'll also be looking forward to your changes once you get a chance to address @mbasmanova 's review comments

@barsondei
Copy link
Contributor Author

barsondei commented Jul 26, 2022

@mbasmanova Thank you for your comments.
I think this PR is a fix following current design, which omit to handle this case.
Bitmap is used to specify the rows to evaluate/load in expression evaluation. It became more complex when field was multiply referenced by expression. e.g. f(a) AND g(b), h(b). If b is a LazyVector and f(a) AND g(b) expression is evaluated first, it will load b only for rows where f(a) is true. However, h(b) projection needs all rows for "b".
Bitmap will change during evaluating the expression:

  1. f0(a) AND f1(a) AND f2(a), f1(a)'s bitmap is subset of f0(a)'s bitmap, f2(a)'s bitmap is subset of f1(a)'s bitmap.
  2. if(f0(a), f1(b), f2(b)) , f0(a)'s bitmap = f1(b)'s bitmap ∪ f2(b)'s bitmap.
  3. select expr1, expr2 from table where expr_filter, expr1/expr2 begin to evaluate with bitmap passed by expr_filter, no matter what the evaluation order is.

To deal with multiple reference, current design adopt a relatively simple scheme.
As Expression is evaluated with a postorder-DFS:

  1. if bitmap will not grow for the unvisited expr, identified by flag isFinalSelection_, use current expr's bitmap to load LazyVector.
  2. otherwise, use finalSelection_, which store in EvalCtx, to load LazyVector.
    finalSelection_ is maintained to store the superset of unvisited exprs' bitmap.

So, there is much code to maintain isFinalSelection_ and finalSelection_ when dealing with conditional expression. e.g. AND/ORSWITCH/IFCoalesce. But, omit to handle the case in Expr::evalAll. Null value skipping in Expr::evalAll has the similar short circuit semantics with AND/OR. SubExpr[i]'s bitmap is subset of SubExpr[i-1]'s bitmap.

@mbasmanova
Copy link
Contributor

@barsondei Thank you for additional clarifications. I took a closer look into this problem and shared my findings in #2073 (comment)

Null value skipping in Expr::evalAll has the similar short circuit semantics with AND/OR. SubExpr[i]'s bitmap is subset of SubExpr[i-1]'s bitmap.

This is indeed the case, but only for non-null-propagating expressions. array_constructor and row_constructor expressions do not propagate nulls, i.g. a null in one of the inputs doesn't produce a null result. For null-propagating expressions where null in any input produces null result it is sufficient to evaluate inputs for a subset of rows where none of the inputs are null.

@bikramSingh91
Copy link
Contributor

Thanks @barsondei for posting an update. I will get back to you with a review by end of day today (5 pm PST)

@barsondei
Copy link
Contributor Author

barsondei commented Aug 1, 2022

As discussed in #2073 with @mbasmanova .
During Expr::computeMetadata(), finding the fields multiple referenced by different inputs. And then load the multiple referenced fields before expression actual evaluated in Expr::eval().
If the fields multiple referenced by different expressions, load the fields for all rows in ExprSet::eval().
Remove the existing workaround from FilterProject::project().
Remain the usage of finalSelection from EvalCtx::ensureFieldLoaded, in order to support partial rows evaluation.

Also waiting for your comments. @bikramSingh91

velox/expression/EvalCtx.cpp Outdated Show resolved Hide resolved
velox/expression/Expr.cpp Outdated Show resolved Hide resolved
velox/expression/Expr.cpp Outdated Show resolved Hide resolved
velox/expression/Expr.cpp Outdated Show resolved Hide resolved
velox/expression/tests/ExprTest.cpp Show resolved Hide resolved
velox/expression/Expr.cpp Outdated Show resolved Hide resolved
velox/expression/Expr.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@barsondei Thank you for iterating on this PR. The implementation has changed quite a bit, so PR title is no longer accurate. Would you PR title and also provide a description to explain the original problem and the fix implemented here?

Some additional comments below.

velox/expression/Expr.cpp Outdated Show resolved Hide resolved
velox/expression/Expr.cpp Outdated Show resolved Hide resolved
velox/expression/Expr.cpp Outdated Show resolved Hide resolved
velox/expression/Expr.cpp Show resolved Hide resolved
velox/expression/tests/ExprTest.cpp Outdated Show resolved Hide resolved
velox/expression/tests/ExprTest.cpp Outdated Show resolved Hide resolved
velox/expression/tests/ExprTest.cpp Outdated Show resolved Hide resolved
velox/expression/tests/ExprTest.cpp Outdated Show resolved Hide resolved
@barsondei barsondei changed the title Fix finalSelection when skip null value's evalution (#2073) pre-load lazy vectors that is referenced by multiple expressions or expr's inputs (#2073) Aug 2, 2022
@barsondei
Copy link
Contributor Author

This PR is aimed to fix the bug we reported in 2073.

We evaluate an expression array_constructor(c0 + c1, if(c1 >= 0, c1, 0)) over lazy vectors. First, we evaluate c0 + c1. It happens so that c0 has a null in the last position, hence, we end up loading c1 for all positions but last. Next, we evaluate if(c1 >= 0, c1, 0) expression over already loaded vector and end up producing incorrect result for the last position.
This problem is similar to evaluating multiple expressions. Different expressions may need different sets of rows, but individual expressions do not have visibility into their siblings. We have worked around this problem before by adding logic to FilterProject::project method to set final selection to 'false'.
In this case the problem occurs within a single expression that contains multiple expressions under a non-null-propagating expression. array_constuctor is the non-null-propagating expression with two sub-expressions: c0 + c1 and if(c1 >= 0, c1, 0). First sub-expression is evaluated for a subset of rows when c0 is not null, but second sub-expression needs to be evaluated for all rows. The two sub-expressions do not know about each other though. If array_constuctor was null-propagating, a null in first sub-expression would make the whole expression null and we won't need to evaluate the second sub-expression on rows where first sub-expression is null.

Current implementation:

  1. Find the fields multiply referenced by different input exprs, During Expr::computeMetadata(). And then load the multiply referenced fields before expression actual evaluated in Expr::eval().
  2. similar logic to 1). find the fields multiply referenced by different expressions. Pre-load the multiply referenced fields in ExprSet::eval().
  3. Remove the existing workaround, setting isFinalSelection to false, from FilterProject::project().

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@barsondei Thank you for working on this fix. Overall looks great % a couple of small comments and PR title and description need updating.

When code is merged, PR title and description together form a commit message. Hence, it is important that these are written clearly and explain the changes in detail. Here are some guidelines about how to write commit messages:

velox/expression/Expr.h Outdated Show resolved Hide resolved
velox/expression/Expr.h Outdated Show resolved Hide resolved
velox/expression/Expr.cpp Outdated Show resolved Hide resolved
velox/expression/Expr.cpp Outdated Show resolved Hide resolved
velox/expression/Expr.cpp Outdated Show resolved Hide resolved
@barsondei barsondei changed the title pre-load lazy vectors that is referenced by multiple expressions or expr's inputs (#2073) Pre-load lazy vectors that is referenced by multiple expressions or expr's inputs (#2073) Aug 3, 2022
velox/expression/Expr.h Outdated Show resolved Hide resolved
@barsondei barsondei requested a review from mbasmanova August 3, 2022 16:25
@barsondei
Copy link
Contributor Author

@mbasmanova I have updated the PR's description.

@mbasmanova mbasmanova changed the title Pre-load lazy vectors that is referenced by multiple expressions or expr's inputs (#2073) Pre-load lazy vectors that are referenced by multiple sub expressions Aug 3, 2022
Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@barsondei Barson, thank you editing the PR title and writing up a description. It is clear now what the problem is and what solution is implemented here. I editing the title and description to fix typos and wrap at 80 characters.

The code changes look good to me, but, please, address any remaining comments from @bikramSingh91 and @Yuhta .

Thank you for the contribution.

@barsondei barsondei requested a review from mbasmanova August 4, 2022 03:31
Copy link
Contributor

@bikramSingh91 bikramSingh91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just one small nit

velox/expression/Expr.h Outdated Show resolved Hide resolved
@facebook-github-bot
Copy link
Contributor

@bikramSingh91 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

This pull request has been reverted by 5e2c7e3.

barsondei and others added 8 commits September 9, 2022 11:20
1) remove inessential comments
2) rename findMultiRefFields to updateAllAndMultiplyReferencedFields
3) materializing fields only for non-null propagating
4) add a test that exercises the changes at ExprSet::eval()
5) rename moreFields to fieldsToAdd
1) use distinctFields_ instead of allFields
2) rename variables, drop temp variables in ExprTest::lazyVectorAccessTwiceWithDifferentRows
1) revert inessential comments
2) combine updateMultiplyReferencedFields and mergeFields functions
3) fix typo
4) use std::unordered_set instead of std::set
@netlify
Copy link

netlify bot commented Sep 9, 2022

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 1b217e2
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/631b8dfdcddb94000c0f8b25

@bikramSingh91
Copy link
Contributor

rebased and resolved conflicts

@bikramSingh91
Copy link
Contributor

Patch re-applied in #2501

marin-ma pushed a commit to marin-ma/velox-oap that referenced this pull request Dec 15, 2023
What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)

(Fixes: facebookincubator#2088)

How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Manually.

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Reverted
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants