-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Projection got incorrect result with LazyVector #2073
Comments
@barsondei Barson, thank you for reporting a problem. Do you want to propose a fix? It's ok if not, we'll take a look anyway. |
@barsondei Just an FYI, there is a simpler way to create vectors for testing:
|
@barsondei This is something I ran into awhile ago and worked around by setting final selection to false in FilterProject operator.
We have 2 expressions, c0 + c1 and if (c1 >= 0, c1, 0), which are evaluated independently, e.g. evaluation of the first expression is not aware of the second expression. The first expression has default null behavior, hence, it is evaluated only for rows where both c0 and c1 are not null. This ends up loading c1 vector for first 3 rows. When second expression is evaluated, c1 is already loaded, but it is missing 4-th row. What's the use case where you are running into this problem? Can you workaround for now by setting final selection to false? A fix might be to add logic to ExprSet::eval to detect the case when multiple expressions share inputs and load vectors accordingly. In other words, we would need to move the logic of loading vectors from Expr::eval into ExprSet::eval and make it aware of all the expressions in the set.
|
You ran into this problem with 2 independent expressions:c0 + c1, if (c1 >= 0, c1, 0)。 I just found the bug,when I'm learning the code of expression evaluation recently。
Bitmap will be updated when:
I suggest to fix the bug by setting the isFinalSelection to false in Expr::evalAll, follow the similar logic in ControlExpr。 |
@barsondei Would you explain how to reproduce this bug? |
I don't think I understand this statement. When expression has default null behavior, e.g. any null in an input produces a null in the result, the expression is evaluated only on non-null values, hence, it is not necessary to load null rows. |
row_constructor(c0 + c1, if (c1 >= 0, c1, 0)) |
@barsondei I'm a bit confused. |
bitmap has 2 benifits: the expression is evaluated only on non-null values is OK。but need to use finalSelection as RowSet in lazy load logic |
how about:element_at( row_constructor(c0 + c1, if (c1 >= 0, c1, 0)), 0) > 0 ? I construct the expression just to prove the bug。The expression not has any business meaning。 |
@barsondei I understand the general problem, but I believe this problem exists only if one uses expression evaluation standalone and runs it on lazy vectors. This is not a common use case as lazy vectors are produced by a TableScan operator and expressions on these are evaluated by FilterProject operation. While not the best solution, the workaround in FilterProject seems sufficient. You mentioned that "the bug can also occur with 1 expression in FilterProject::filter" and I'm trying to understand how would this be possible. Any chance you could clarify? |
This doesn't seem to be necessary. If the column is not used outside of the expression, there is no need to load rows that will be participate in expression evaluation. Only if the column is used somewhere else, then the loading needs to occur for all rows. The thinking is that the caller of the expression evaluation has the context on whether any columns are used outside of the specified expressions. If that's the case, the caller is expected to set finalSelection = false. |
I’m not clear with the logic of LazyVector's creation in TableScan operator。
|
@barsondei Thank you for an example query. I implemented it in a unit test and I'm seeing failures evaluating this query. Will investigate. BTW, let me know if you'd like to help with investigating or fixing.
The error I'm getting is:
|
OK。I‘m pleasure to fix it。 |
@barsondei @kgpai @bikramSingh91 @oerling I looked into this problem some more. We evaluate an expression This problem is similar to evaluating multiple expressions. Different expressions may need different sets of rows, but individual expressions do not have visibility into their siblings. We have worked around this problem before by adding logic to FilterProject::project method to set final selection to 'false'. In this case the problem occurs within a single expression that contains multiple expressions under a non-null-propagating expression. array_constuctor is the non-null-propagating expression with two sub-expressions: c0 + c1 and if(c1 >= 0, c1, 0). First sub-expression is evaluated for a subset of rows when c0 is not null, but second sub-expression needs to be evaluated for all rows. The two sub-expressions do not know about each other though. If array_constuctor was null-propagating, a null in first sub-expression would make the whole expression null and we won't need to evaluate the second sub-expression on rows where first sub-expression is null. It is important that one of the sub-expressions is a conditional (includes an IF statement). If none of the sub-expressions had a conditional, the lazy vectors for both c0 and c1 will be loaded for all rows in the Expr::eval():
I see a few options for fixing this problem.
|
@mbasmanova
Would you explain it in more detail? |
In Expr::evalWithNulls, we have 2 code paths: (1) expression propagates nulls, (2) expression doesn't propagate nulls. In (1), we don't need to do anything since nulls in any input produce null results. If one sub-expression has nulls in positions 1, 5, 10, then no other sub-expression needs to be evaluated for these positions. In (2), we have to evaluate all sub-expressions independently for all positions. If one sub-expression has nulls in some positions, the other expression still needs to be evaluated for these positions. In this case, if we have a lazy vector used in multiple sub-expressions, we need to load this lazy vector for all positions. That's what I propose to do. Similarly, in ExprSet::eval, we have a set of independent expressions. Hence, if a lazy vector is used in multiple expressions, we need to load this lazy vector for all positions. Hence, the logic in ExprSet::eval and Expr::evalWithNulls is similar. Does this make sense? |
@mbasmanova I got it.
I think that with the workaround in FilterProject::project() and Expr::eval(), most common case load lazy vector for all positions passed by FilterProject::filter(). I prefer to adopt option(2) and remove the workaround both in FilterProject::project() and Expr::eval(). |
Just to clarify,
array_constructor doesn't propagate nulls, hence, array_constructor(c0 + c1, if(c1 >= 0, c1, 0)) expression would go via path #2.
This expression would go via path #1, assuming array_constructor_propagate_nulls is a hypothetical function that acts like array_constructor, but propagates nulls.
This would be my preference as well. However, we may need to keep the code in Expr::eval() as it serves another purpose.
|
It would be OK to say that final selection is goes false in contexts where a row has been dropped because of a null in a previous argument of a null propagating expression.
Loading a row or two extra is not a big deal, most of the time. Avoiding a whole load is the prize we are going for with lazy vectors.
Something like select if (a,b,c), if(a, b, c +d), … should not load c if a were true across the whole batch.
|
1) use distinctFields_ instead of allFields 2) rename variables, drop temp variables in ExprTest::lazyVectorAccessTwiceWithDifferentRows
Re-opening as the fix got reverted. CC @bikramSingh91 @tanjialiang |
1) use distinctFields_ instead of allFields 2) rename variables, drop temp variables in ExprTest::lazyVectorAccessTwiceWithDifferentRows
I add the following test case to ExprTest.cpp and got incorrect result.
The text was updated successfully, but these errors were encountered: