Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Projection got incorrect result with LazyVector #2073

Open
barsondei opened this issue Jul 21, 2022 · 22 comments
Open

Projection got incorrect result with LazyVector #2073

barsondei opened this issue Jul 21, 2022 · 22 comments
Labels
bug Something isn't working

Comments

@barsondei
Copy link
Contributor

barsondei commented Jul 21, 2022

image

I add the following test case to ExprTest.cpp and got incorrect result.

TEST_F(ExprTest, lazyLoadBug) {
    const vector_size_t size = 4;

    auto valueAt = [](auto row) { return row; };
    auto isNullAtColC0 = [](auto row) { return row; };
    // [1, 1, 1, null]
    auto inputC0 = makeFlatVector<int64_t>(
            size, [](auto row) { return 1; }, [](auto row) { return row == 3; });
    // [0, 1, 2, 3] if fully loaded
    std::vector<vector_size_t> loadedRows;
    VectorPtr inputC1 = std::make_shared<LazyVector>(
            pool_.get(),
            BIGINT(),
            size,
            std::make_unique<test::SimpleVectorLoader>([&](auto rows) {
                for (auto row : rows) {
                    loadedRows.push_back(row);
                }
                return makeFlatVector<int64_t>(
                        rows.back() + 1, valueAt);
            }));

    // 1) can pass test, manually load LazyVector with allRows
//    SelectivityVector allRows(size);
//    LazyVector::ensureLoadedRows(inputC1, allRows);

    // 2) can pass test, not using lazy vector
//    auto inputC1 = makeFlatVector<int64_t>(
//            size, valueAt, nullptr);

    // isFinalSelection_ == true
    auto result = evaluate(
            "row_constructor(c0 + c1, if (c1 >= 0, c1, 0))", makeRowVector({inputC0, inputC1}));

    // [1, 2, 3, null]
    auto outputCol0 = makeFlatVector<int64_t>(
            size, [](auto row) { return row + 1; }, [](auto row) { return row == 3; });
    // [0, 1, 2, 3]
    auto outputCol1 = makeFlatVector<int64_t>(
            size, [](auto row) { return row; }, nullptr);
    // [(1, 0), (2, 1), (3, 2), (null, 3)]
    auto expected = ExprTest::makeRowVector({outputCol0, outputCol1});

    assertEqualVectors(expected, result);
}
@mbasmanova mbasmanova added the bug Something isn't working label Jul 21, 2022
@mbasmanova
Copy link
Contributor

@barsondei Barson, thank you for reporting a problem. Do you want to propose a fix? It's ok if not, we'll take a look anyway.

CC: @kagamiori @laithsakka @kevinwilfong @kgpai @Yuhta

@mbasmanova
Copy link
Contributor

@barsondei Just an FYI, there is a simpler way to create vectors for testing:

  // [1, 1, 1, null]
  auto inputC0 = makeNullableFlatVector<int64_t>({1, 1, 1, std::nullopt});

@mbasmanova
Copy link
Contributor

@barsondei This is something I ran into awhile ago and worked around by setting final selection to false in FilterProject operator.

void FilterProject::project(const SelectivityVector& rows, EvalCtx* evalCtx) {
  // Make sure LazyVectors are loaded for all the "rows".
  //
  // Consider projection with 2 expressions: f(a) AND g(b), h(b)
  // If b is a LazyVector and f(a) AND g(b) expression is evaluated first, it
  // will load b only for rows where f(a) is true. However, h(b) projection
  // needs all rows for "b".
  //
  // This works, but may load more rows than necessary. E.g. if we only have
  // f(a) AND g(b) expression and b is not used anywhere else, it is sufficient
  // to load b for a subset of rows where f(a) is true.
  *evalCtx->mutableIsFinalSelection() = false;
  *evalCtx->mutableFinalSelection() = &rows;

We have 2 expressions, c0 + c1 and if (c1 >= 0, c1, 0), which are evaluated independently, e.g. evaluation of the first expression is not aware of the second expression. The first expression has default null behavior, hence, it is evaluated only for rows where both c0 and c1 are not null. This ends up loading c1 vector for first 3 rows. When second expression is evaluated, c1 is already loaded, but it is missing 4-th row.

What's the use case where you are running into this problem? Can you workaround for now by setting final selection to false?

A fix might be to add logic to ExprSet::eval to detect the case when multiple expressions share inputs and load vectors accordingly. In other words, we would need to move the logic of loading vectors from Expr::eval into ExprSet::eval and make it aware of all the expressions in the set.

  // Check if there are any IFs, ANDs or ORs. These expressions are special
  // because not all of their sub-expressions get evaluated on all the rows
  // all the time. Therefore, we should delay loading lazy vectors until we
  // know the minimum subset of rows needed to be loaded.
  //
  // If there is only one field, load it unconditionally. The very first IF,
  // AND or OR will have to load it anyway. Pre-loading enables peeling of
  // encodings at a higher level in the expression tree and avoids repeated
  // peeling and wrapping in the sub-nodes.
  //
  // TODO: Re-work the logic of deciding when to load which field.
  if (!hasConditionals_ || distinctFields_.size() == 1) {
    // Load lazy vectors if any.
    for (const auto& field : distinctFields_) {
      context.ensureFieldLoaded(field->index(context), rows);
    }
  }

@barsondei
Copy link
Contributor Author

@barsondei This is something I ran into awhile ago and worked around by setting final selection to false in FilterProject operator.

void FilterProject::project(const SelectivityVector& rows, EvalCtx* evalCtx) {
  // Make sure LazyVectors are loaded for all the "rows".
  //
  // Consider projection with 2 expressions: f(a) AND g(b), h(b)
  // If b is a LazyVector and f(a) AND g(b) expression is evaluated first, it
  // will load b only for rows where f(a) is true. However, h(b) projection
  // needs all rows for "b".
  //
  // This works, but may load more rows than necessary. E.g. if we only have
  // f(a) AND g(b) expression and b is not used anywhere else, it is sufficient
  // to load b for a subset of rows where f(a) is true.
  *evalCtx->mutableIsFinalSelection() = false;
  *evalCtx->mutableFinalSelection() = &rows;

We have 2 expressions, c0 + c1 and if (c1 >= 0, c1, 0), which are evaluated independently, e.g. evaluation of the first expression is not aware of the second expression. The first expression has default null behavior, hence, it is evaluated only for rows where both c0 and c1 are not null. This ends up loading c1 vector for first 3 rows. When second expression is evaluated, c1 is already loaded, but it is missing 4-th row.

What's the use case where you are running into this problem? Can you workaround for now by setting final selection to false?

A fix might be to add logic to ExprSet::eval to detect the case when multiple expressions share inputs and load vectors accordingly. In other words, we would need to move the logic of loading vectors from Expr::eval into ExprSet::eval and make it aware of all the expressions in the set.

  // Check if there are any IFs, ANDs or ORs. These expressions are special
  // because not all of their sub-expressions get evaluated on all the rows
  // all the time. Therefore, we should delay loading lazy vectors until we
  // know the minimum subset of rows needed to be loaded.
  //
  // If there is only one field, load it unconditionally. The very first IF,
  // AND or OR will have to load it anyway. Pre-loading enables peeling of
  // encodings at a higher level in the expression tree and avoids repeated
  // peeling and wrapping in the sub-nodes.
  //
  // TODO: Re-work the logic of deciding when to load which field.
  if (!hasConditionals_ || distinctFields_.size() == 1) {
    // Load lazy vectors if any.
    for (const auto& field : distinctFields_) {
      context.ensureFieldLoaded(field->index(context), rows);
    }
  }

You ran into this problem with 2 independent expressions:c0 + c1, if (c1 >= 0, c1, 0)。
I think this is why you work around this problem in FilterProject::project method, but not in FilterProject::filter method。
In my case, the bug can also occur with 1 expression in FilterProject::filter。

I just found the bug,when I'm learning the code of expression evaluation recently。
I saw a lot of logic dealing with lazy load in expression evaluation, mainly to solve the lazy vector shared by multiple inputs。
Current principle to deal with LazyVector:

  1. if bitmap will not grow after current Expr's evalution, use current bitmap to load lazy vector。
  2. otherwise, use finalSelection_ to load lazy vector, which is superset of Expr's bitmap。
    This principle is not the optimal solution, but it works。

Bitmap will be updated when:

  1. after FilterProject::filter method invoked
  2. contional expr:AND/OR/SWITCH...
  3. nulls skip logic in Expr::evalAll。
    While the nulls skip logic not follow the lazy load's principle,invoke child Expr will smaller bitmap without setting isFinalSelection to false。

I suggest to fix the bug by setting the isFinalSelection to false in Expr::evalAll, follow the similar logic in ControlExpr。

@mbasmanova
Copy link
Contributor

In my case, the bug can also occur with 1 expression in FilterProject::filter。

@barsondei Would you explain how to reproduce this bug?

@mbasmanova
Copy link
Contributor

While the nulls skip logic not follow the lazy load's principle,invoke child Expr will smaller bitmap without setting isFinalSelection to false。

I don't think I understand this statement. When expression has default null behavior, e.g. any null in an input produces a null in the result, the expression is evaluated only on non-null values, hence, it is not necessary to load null rows.

@barsondei
Copy link
Contributor Author

In my case, the bug can also occur with 1 expression in FilterProject::filter。

@barsondei Would you explain how to reproduce this bug?

row_constructor(c0 + c1, if (c1 >= 0, c1, 0))

@mbasmanova
Copy link
Contributor

@barsondei I'm a bit confused. row_constructor(c0 + c1, if (c1 >= 0, c1, 0)) is not a filter expression. Any chance you can share a full repro?

@barsondei
Copy link
Contributor Author

While the nulls skip logic not follow the lazy load's principle,invoke child Expr will smaller bitmap without setting isFinalSelection to false。

I don't think I understand this statement. When expression has default null behavior, e.g. any null in an input produces a null in the result, the expression is evaluated only on non-null values, hence, it is not necessary to load null rows.

bitmap has 2 benifits:
1)reduce expression evaluation with null behavior
2) load minimum row set in lazy load logic

the expression is evaluated only on non-null values is OK。but need to use finalSelection as RowSet in lazy load logic

@barsondei
Copy link
Contributor Author

@barsondei I'm a bit confused. row_constructor(c0 + c1, if (c1 >= 0, c1, 0)) is not a filter expression. Any chance you can share a full repro?

how about:element_at( row_constructor(c0 + c1, if (c1 >= 0, c1, 0)), 0) > 0 ?

I construct the expression just to prove the bug。The expression not has any business meaning。

@mbasmanova
Copy link
Contributor

@barsondei I understand the general problem, but I believe this problem exists only if one uses expression evaluation standalone and runs it on lazy vectors. This is not a common use case as lazy vectors are produced by a TableScan operator and expressions on these are evaluated by FilterProject operation. While not the best solution, the workaround in FilterProject seems sufficient.

You mentioned that "the bug can also occur with 1 expression in FilterProject::filter" and I'm trying to understand how would this be possible. Any chance you could clarify?

@mbasmanova
Copy link
Contributor

but need to use finalSelection as RowSet in lazy load logic

This doesn't seem to be necessary. If the column is not used outside of the expression, there is no need to load rows that will be participate in expression evaluation. Only if the column is used somewhere else, then the loading needs to occur for all rows. The thinking is that the caller of the expression evaluation has the context on whether any columns are used outside of the specified expressions. If that's the case, the caller is expected to set finalSelection = false.

@barsondei
Copy link
Contributor Author

element_at( row_constructor(c0 + c1, if (c1 >= 0, c1, 0)), 0) > 0

I’m not clear with the logic of LazyVector's creation in TableScan operator。
If it only load rows specified with bitmap,the bug will occur when run SQL in end to end environment:
SQL:

select * from 
  table
  where
  element_at( row_constructor(c0 + c1, if (c1 >= 0, c1, 0)), 0) > 0 

@mbasmanova
Copy link
Contributor

mbasmanova commented Jul 21, 2022

@barsondei Thank you for an example query. I implemented it in a unit test and I'm seeing failures evaluating this query. Will investigate. BTW, let me know if you'd like to help with investigating or fixing.


TEST_F(TableScanTest, bug) {
  auto data = makeRowVector({
      makeNullableFlatVector<int64_t>({1, 1, 1, std::nullopt}),
      makeNullableFlatVector<int64_t>({0, 1, 2, 3}),
  });

  auto filePath = TempFilePath::create();
  writeToFile(filePath->path, {data});
  createDuckDbTable({data});

  auto plan =
      PlanBuilder()
          .tableScan(asRowType(data->type()))
          .filter(
              "element_at(array_constructor(c0 + c1, if(c1 >= 0, c1, 0)), 0) > 0")
          .planNode();
  assertQuery(plan, {filePath}, "SELECT null, null");
}

The error I'm getting is:

C++ exception with description "Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: (4 vs. 3)
Retriable: False
Expression: rows.end() <= vector.size()
Context: switch(gte(c1, 0:BIGINT), c1, 0:BIGINT)
Top-Level Context: gt(element_at(array_constructor(plus(c0, c1), switch(gte(c1, 0:BIGINT), c1, 0:BIGINT)), 0:BIGINT), 0:BIGINT)
Function: makeIndices
File: /Users/mbasmanova/cpp/velox-1/velox/vector/DecodedVector.cpp
Line: 98

@barsondei
Copy link
Contributor Author

@barsondei Thank you for an example query. I implemented it in a unit test and I'm seeing failures evaluating this query. Will investigate. BTW, let me know if you'd like to help with investigating or fixing.


TEST_F(TableScanTest, bug) {
  auto data = makeRowVector({
      makeNullableFlatVector<int64_t>({1, 1, 1, std::nullopt}),
      makeNullableFlatVector<int64_t>({0, 1, 2, 3}),
  });

  auto filePath = TempFilePath::create();
  writeToFile(filePath->path, {data});
  createDuckDbTable({data});

  auto plan =
      PlanBuilder()
          .tableScan(asRowType(data->type()))
          .filter(
              "element_at(array_constructor(c0 + c1, if(c1 >= 0, c1, 0)), 0) > 0")
          .planNode();
  assertQuery(plan, {filePath}, "SELECT null, null");
}

The error I'm getting is:

C++ exception with description "Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: (4 vs. 3)
Retriable: False
Expression: rows.end() <= vector.size()
Context: switch(gte(c1, 0:BIGINT), c1, 0:BIGINT)
Top-Level Context: gt(element_at(array_constructor(plus(c0, c1), switch(gte(c1, 0:BIGINT), c1, 0:BIGINT)), 0:BIGINT), 0:BIGINT)
Function: makeIndices
File: /Users/mbasmanova/cpp/velox-1/velox/vector/DecodedVector.cpp
Line: 98

OK。I‘m pleasure to fix it。

@mbasmanova
Copy link
Contributor

@barsondei @kgpai @bikramSingh91 @oerling

I looked into this problem some more.

We evaluate an expression array_constructor(c0 + c1, if(c1 >= 0, c1, 0)) over lazy vectors. First, we evaluate c0 + c1. It happens so that c0 has a null in the last position, hence, we end up loading c1 for all positions but last. Next, we evaluate if(c1 >= 0, c1, 0) expression over already loaded vector and end up producing incorrect result for the last position.

This problem is similar to evaluating multiple expressions. Different expressions may need different sets of rows, but individual expressions do not have visibility into their siblings. We have worked around this problem before by adding logic to FilterProject::project method to set final selection to 'false'.

In this case the problem occurs within a single expression that contains multiple expressions under a non-null-propagating expression. array_constuctor is the non-null-propagating expression with two sub-expressions: c0 + c1 and if(c1 >= 0, c1, 0). First sub-expression is evaluated for a subset of rows when c0 is not null, but second sub-expression needs to be evaluated for all rows. The two sub-expressions do not know about each other though. If array_constuctor was null-propagating, a null in first sub-expression would make the whole expression null and we won't need to evaluate the second sub-expression on rows where first sub-expression is null.

It is important that one of the sub-expressions is a conditional (includes an IF statement). If none of the sub-expressions had a conditional, the lazy vectors for both c0 and c1 will be loaded for all rows in the Expr::eval():

  if (!hasConditionals_ || distinctFields_.size() == 1) {
    // Load lazy vectors if any.
    for (const auto& field : distinctFields_) {
      context.ensureFieldLoaded(field->index(context), rows);
    }
  }

I see a few options for fixing this problem.

  1. Extend the existing workaround in FilterProject::project to FilterProject::filter. This could work for task-based queries, but won't help for cases when expression evaluation is used standalone.
  2. Remove the existing workaround from FilterProject::project. Modify ExprSet::eval() to identify columns used in more than one expression and proactively load these for all rows. Modify Expr::evalWithNulls to add similar logic for the case when expression doesn’t propagate nulls. This logic would ensure that lazy vectors that are used as inputs to multiple expressions are loaded for all necessary rows.
  3. Same as (2), but instead of identifying columns used in multiple sub-expressions simply set final selection to false. This would work, but would result in unnecessary loading in some cases.

@barsondei
Copy link
Contributor Author

@mbasmanova
I don't quite understand the description:

Modify Expr::evalWithNulls to add similar logic for the case when expression doesn’t propagate nulls.

Would you explain it in more detail?

@mbasmanova
Copy link
Contributor

@barsondei

Modify Expr::evalWithNulls to add similar logic for the case when expression doesn’t propagate nulls.

In Expr::evalWithNulls, we have 2 code paths: (1) expression propagates nulls, (2) expression doesn't propagate nulls.

In (1), we don't need to do anything since nulls in any input produce null results. If one sub-expression has nulls in positions 1, 5, 10, then no other sub-expression needs to be evaluated for these positions.

In (2), we have to evaluate all sub-expressions independently for all positions. If one sub-expression has nulls in some positions, the other expression still needs to be evaluated for these positions. In this case, if we have a lazy vector used in multiple sub-expressions, we need to load this lazy vector for all positions. That's what I propose to do. Similarly, in ExprSet::eval, we have a set of independent expressions. Hence, if a lazy vector is used in multiple expressions, we need to load this lazy vector for all positions. Hence, the logic in ExprSet::eval and Expr::evalWithNulls is similar.

Does this make sense?

@barsondei
Copy link
Contributor Author

barsondei commented Jul 26, 2022

@mbasmanova I got it.
The major difference between the option 2)、3) you proposed above and my PR#2089 is that it's no need to set finalSelection to false when evaluating with null-propagating expression. It make sense due to most function propagates nulls.
With the implement of option 2, expression1 array_constructor(c0 + c1, if(c1 >= 0, c1, 0)) will go through code path(1), expression2 array_constructor_propagate_nulls(c0 + c1, c1) will go through code path(2). But, c1 will still be loaded for all positions when visit Expr array_constructor_propagate_nulls before sub-expression is evaluated. For there's no conditionals in expression2.

  if (!hasConditionals_ || distinctFields_.size() == 1) {
    // Load lazy vectors if any.
    for (const auto& field : distinctFields_) {
      context.ensureFieldLoaded(field->index(context), rows);
    }
  }

I think that with the workaround in FilterProject::project() and Expr::eval(), most common case load lazy vector for all positions passed by FilterProject::filter().

I prefer to adopt option(2) and remove the workaround both in FilterProject::project() and Expr::eval().

@mbasmanova
Copy link
Contributor

Just to clarify,

With the implement of option 2, expression1 array_constructor(c0 + c1, if(c1 >= 0, c1, 0)) will go through code path(1),

array_constructor doesn't propagate nulls, hence, array_constructor(c0 + c1, if(c1 >= 0, c1, 0)) expression would go via path #2.

expression2 array_constructor_propagate_nulls(c0 + c1, c1) will go through code path(2)

This expression would go via path #1, assuming array_constructor_propagate_nulls is a hypothetical function that acts like array_constructor, but propagates nulls.

I prefer to adopt option(2) and remove the workaround both in FilterProject::project() and Expr::eval().

This would be my preference as well. However, we may need to keep the code in Expr::eval() as it serves another purpose.

  // ...Pre-loading enables peeling of
  // encodings at a higher level in the expression tree and avoids repeated
  // peeling and wrapping in the sub-nodes.

@oerling
Copy link
Contributor

oerling commented Jul 29, 2022 via email

barsondei added a commit to barsondei/velox that referenced this issue Aug 2, 2022
1) use distinctFields_ instead of allFields
2) rename variables, drop temp variables in ExprTest::lazyVectorAccessTwiceWithDifferentRows
@mbasmanova
Copy link
Contributor

Re-opening as the fix got reverted. CC @bikramSingh91 @tanjialiang

@mbasmanova mbasmanova reopened this Aug 22, 2022
bikramSingh91 pushed a commit to barsondei/velox that referenced this issue Sep 9, 2022
1) use distinctFields_ instead of allFields
2) rename variables, drop temp variables in ExprTest::lazyVectorAccessTwiceWithDifferentRows
marin-ma pushed a commit to marin-ma/velox-oap that referenced this issue Dec 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants