Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TiFlash crash when Aggregation FailPoints in createAggregateStates triggers #5356

Closed
yibin87 opened this issue Jul 12, 2022 · 2 comments · Fixed by #5433
Closed

TiFlash crash when Aggregation FailPoints in createAggregateStates triggers #5356

yibin87 opened this issue Jul 12, 2022 · 2 comments · Fixed by #5433
Assignees
Labels
affects-5.0 affects-5.1 affects-5.2 affects-5.3 affects-5.4 This bug affects the 5.4.x(LTS) versions. affects-6.0 affects-6.1 This bug affects the 6.1.x(LTS) versions. component/compute severity/major type/bug The issue is confirmed as a bug.

Comments

@yibin87
Copy link
Contributor

yibin87 commented Jul 12, 2022

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

  1. Turns on Aggregation RandomFail point
  2. Prepare TPCH database, SF=10, run query_13.sql parrelly, you'll get core dump, and stack trace like this:
    image

2. What did you expect to see? (Required)

TiFlash no crash.

3. What did you see instead (Required)

4. What is your TiFlash version? (Required)

v6.2.0-alpha-nightly-20220710

@gengliqi
Copy link
Contributor

gengliqi commented Jul 20, 2022

I try this failpoint by using master code and find there is no crash.

The bug is very likely caused by the aggregation of non-joined data after an exception happened.

In my previous attempt, I change the code line of 201 to parent.cancel(false);.

void ParallelAggregatingBlockInputStream::Handler::onException(std::exception_ptr & exception, size_t thread_num)
{
parent.exceptions[thread_num] = exception;
Int32 old_value = -1;
parent.first_exception_index.compare_exchange_strong(old_value, static_cast<Int32>(thread_num), std::memory_order_seq_cst, std::memory_order_relaxed);
/// can not cancel parent inputStream or the exception might be lost
if (!parent.executed)
/// use cancel instead of kill to avoid too many useless error message
parent.processor.cancel(false);
}

Then the crash doesn't happen anymore because the aggregation code of non-joined data isn't triggered. (see the code below)
bool Aggregator::executeOnBlock(
const Block & block,
AggregatedDataVariants & result,
const FileProviderPtr & file_provider,
ColumnRawPtrs & key_columns,
AggregateColumns & aggregate_columns,
Int64 & local_delta_memory,
bool & no_more_keys)
{
if (isCancelled())
return true;

However, this line of code doesn't change at present.
Why does the crash not happen? It's due to #5274.

In this PR, if an exception happens, the non-join data will not be fetched anymore so the aggregation code of them will not be triggered as well.

By the way, I think the code line of 201 still needs to be changed though it seems no harm for now.

void ParallelAggregatingBlockInputStream::Handler::onException(std::exception_ptr & exception, size_t thread_num)
{
parent.exceptions[thread_num] = exception;
Int32 old_value = -1;
parent.first_exception_index.compare_exchange_strong(old_value, static_cast<Int32>(thread_num), std::memory_order_seq_cst, std::memory_order_relaxed);
/// can not cancel parent inputStream or the exception might be lost
if (!parent.executed)
/// use cancel instead of kill to avoid too many useless error message
parent.processor.cancel(false);
}

@gengliqi gengliqi added affects-5.4 This bug affects the 5.4.x(LTS) versions. affects-6.0 affects-6.1 This bug affects the 6.1.x(LTS) versions. labels Jul 21, 2022
@gengliqi
Copy link
Contributor

gengliqi commented Jul 21, 2022

The bug is introduced by #3434. So it should be fixed in all versions after 5.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-5.0 affects-5.1 affects-5.2 affects-5.3 affects-5.4 This bug affects the 5.4.x(LTS) versions. affects-6.0 affects-6.1 This bug affects the 6.1.x(LTS) versions. component/compute severity/major type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants