Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: segfault in k_way_merge_sort_partition #16923

Open
1 of 2 tasks
maxjustus opened this issue Nov 24, 2024 · 4 comments
Open
1 of 2 tasks

bug: segfault in k_way_merge_sort_partition #16923

maxjustus opened this issue Nov 24, 2024 · 4 comments
Assignees
Labels
C-bug Category: something isn't working

Comments

@maxjustus
Copy link
Contributor

maxjustus commented Nov 24, 2024

Search before asking

  • I had searched in the issues and found no similar issues.

Version

Version: v1.2.660-nightly-55cef11019(rust-1.81.0-nightly-2024-11-17T22:08:20.457627205Z)

What's Wrong?

Running create table as select or merge into queries with large result sets randomly causes segfaults in k_way_merge_sort_partition. This causes Databend to crash and become unresponsive - when I restart it all data is erased. I am seemingly able to work around it with:

set global enable_loser_tree_merge_sort=0;
set global enable_parallel_multi_merge_sort=0;

I'm not sure which setting resolves it because the pipeline I'm running which causes the segfault takes ~30 minutes to set up the database to run the query which triggers the segfault - and as already mentioned when the segfault occurs it erases all data.

############################### Crash fault info ###############################
PID: 35
Version: v1.2.660-nightly-55cef11019(rust-1.81.0-nightly-2024-11-17T22:08:20.457627205Z)
Timestamp(UTC): 2024-11-19 19:59:07.596544683 UTC
Timestamp(Local): 2024-11-19 19:59:07.596563975 +00:00
QueryId: "76bbd6bb-c6df-481f-a168-09caad581d70"
Signal 11 (SIGSEGV), si_code 1 (Unknown), Address 0x33665f7c228877

Backtrace:
    0: backtrace::backtrace::libunwind::trace[inlined]
             at /opt/rust/cargo/git/checkouts/backtrace-rs-fb1f822361417489-shallow/72265be/src/backtrace/libunwind.rs:116:5
   1: backtrace::backtrace::trace_unsynchronized[inlined]
             at /opt/rust/cargo/git/checkouts/backtrace-rs-fb1f822361417489-shallow/72265be/src/backtrace/mod.rs:66:5
   2: databend_common_tracing::crash_hook::CrashHandler::recv_signal[inlined]
             at /workspace/src/common/tracing/src/crash_hook.rs:101:13
   3: databend_common_tracing::crash_hook::signal_handler@7c4a824
             at /workspace/src/common/tracing/src/crash_hook.rs:272:9
   4: <unknown>
   5: <unknown>@92b8c
   6: <u8 as core::slice::cmp::SliceOrd>::compare[inlined]
             at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/core/src/slice/cmp.rs:199:34
   7: <A as core::slice::cmp::SlicePartialOrd>::partial_compare[inlined]
             at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/core/src/slice/cmp.rs:138:14
   8: core::slice::cmp::<impl core::cmp::PartialOrd for [T]>::partial_cmp[inlined]
             at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/core/src/slice/cmp.rs:39:9
   9: core::cmp::PartialOrd::ge[inlined]
             at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/core/src/cmp.rs:1233:18
  10: core::cmp::impls::<impl core::cmp::PartialOrd<&B> for &A>::ge[inlined]
             at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/core/src/cmp.rs:1691:13
  11: databend_common_pipeline_transforms::processors::transforms::sort::list_domain::Candidate<T>::find_target::{{closure}}[inlined]
             at /workspace/src/query/pipeline/transforms/src/processors/transforms/sort/list_domain.rs:300:20
  12: core::ops::function::impls::<impl core::ops::function::FnMut<A> for &mut F>::call_mut@6829a34
             at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/core/src/ops/function.rs:294:13
  13: core::iter::traits::iterator::Iterator::find_map::check::{{closure}}[inlined]
             at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/core/src/iter/traits/iterator.rs:2907:32
  14: <alloc::vec::into_iter::IntoIter<T,A> as core::iter::traits::iterator::Iterator>::try_fold[inlined]
             at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/alloc/src/vec/into_iter.rs:340:25
  15: core::iter::traits::iterator::Iterator::find_map[inlined]
             at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/core/src/iter/traits/iterator.rs:2913:9
  16: <core::iter::adapters::filter_map::FilterMap<I,F> as core::iter::traits::iterator::Iterator>::next[inlined]
             at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/core/src/iter/adapters/filter_map.rs:64:9
  17: databend_common_pipeline_transforms::processors::transforms::sort::list_domain::Candidate<T>::find_target@6839bf8
             at /workspace/src/query/pipeline/transforms/src/processors/transforms/sort/list_domain.rs:298:22
  18: databend_common_pipeline_transforms::processors::transforms::sort::list_domain::Candidate<T>::calc_partition@6832604
             at /workspace/src/query/pipeline/transforms/src/processors/transforms/sort/list_domain.rs:213:43
  19: databend_common_pipeline_transforms::processors::transforms::sort::k_way_merge_sort_partition::KWaySortPartitioner<R,S>::calc_partition_point@68ff028
             at /workspace/src/query/pipeline/transforms/src/processors/transforms/sort/k_way_merge_sort_partition.rs:163:9
  20: databend_common_pipeline_transforms::processors::transforms::sort::k_way_merge_sort_partition::KWaySortPartitioner<R,S>::build_task@68ffa3c
             at /workspace/src/query/pipeline/transforms/src/processors/transforms/sort/k_way_merge_sort_partition.rs:167:25
  21: databend_common_pipeline_transforms::processors::transforms::sort::k_way_merge_sort_partition::KWaySortPartitioner<R,S>::next_task@68fe9dc
             at /workspace/src/query/pipeline/transforms/src/processors/transforms/sort/k_way_merge_sort_partition.rs:149:12
  22: <databend_common_pipeline_transforms::processors::transforms::transform_k_way_merge_sort::KWayMergePartitionerProcessor<R> as databend_common_pipeline_core::processors::processor::Processor>::process@67f9b08
             at /workspace/src/query/pipeline/transforms/src/processors/transforms/transform_k_way_merge_sort.rs:345:20
  23: databend_common_pipeline_core::processors::processor::ProcessorPtr::process@67534b8
             at /workspace/src/query/pipeline/core/src/processors/processor.rs:169:9
  24: databend_query::pipelines::executor::executor_worker_context::ExecutorWorkerContext::execute_sync_task[inlined]
             at /workspace/src/query/service/src/pipelines/executor/executor_worker_context.rs:169:9
  25: databend_query::pipelines::executor::executor_worker_context::ExecutorWorkerContext::execute_task@8a9a4b4
             at /workspace/src/query/service/src/pipelines/executor/executor_worker_context.rs:132:52
  26: databend_query::pipelines::executor::query_pipeline_executor::QueryPipelineExecutor::execute_single_thread@8a960d4
             at /workspace/src/query/service/src/pipelines/executor/query_pipeline_executor.rs:406:35
  27: databend_query::pipelines::executor::query_pipeline_executor::QueryPipelineExecutor::execute_threads::{{closure}}::{{closure}}[inlined]
             at /workspace/src/query/service/src/pipelines/executor/query_pipeline_executor.rs:378:50
  28: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once[inlined]
             at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/core/src/panic/unwind_safe.rs:272:9
  29: std::panicking::try::do_call[inlined]
             at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/std/src/panicking.rs:553:40
  30: std::panicking::try[inlined]
             at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/std/src/panicking.rs:517:19
  31: std::panic::catch_unwind@8bf2b14
             at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/std/src/panic.rs:350:14
  32: databend_common_base::runtime::catch_unwind::catch_unwind@8714a6c
             at /workspace/src/common/base/src/runtime/catch_unwind.rs:47:11
  33: databend_query::pipelines::executor::query_pipeline_executor::QueryPipelineExecutor::execute_threads::{{closure}}[inlined]
             at /workspace/src/query/service/src/pipelines/executor/query_pipeline_executor.rs:378:34
  34: databend_common_base::runtime::runtime_tracker::ThreadTracker::tracking_function::{{closure}}::{{closure}}[inlined]
             at /workspace/src/common/base/src/runtime/runtime_tracker.rs:208:17
  35: databend_common_base::runtime::thread::Thread::named_spawn::{{closure}}[inlined]
             at /workspace/src/common/base/src/runtime/thread.rs:78:21
  36: std::sys::backtrace::__rust_begin_short_backtrace@80d2be0
             at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/std/src/sys/backtrace.rs:155:18
  37: std::thread::Builder::spawn_unchecked_::{{closure}}::{{closure}}[inlined]
             at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/std/src/thread/mod.rs:542:17
  38: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once[inlined]
             at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/core/src/panic/unwind_safe.rs:272:9
  39: std::panicking::try::do_call[inlined]
             at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/std/src/panicking.rs:553:40
  40: std::panicking::try[inlined]
             at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/std/src/panicking.rs:517:19
  41: std::panic::catch_unwind[inlined]
             at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/std/src/panic.rs:350:14
  42: std::thread::Builder::spawn_unchecked_::{{closure}}[inlined]
             at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/std/src/thread/mod.rs:541:30
  43: core::ops::function::FnOnce::call_once{{vtable.shim}}@80d5bc8
             at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/core/src/ops/function.rs:250:5
  44: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once[inlined]
             at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/alloc/src/boxed.rs:2064:9
  45: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once[inlined]
             at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/alloc/src/boxed.rs:2064:9
  46: std::sys::pal::unix::thread::Thread::new::thread_start@a358cc4
             at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/std/src/sys/pal/unix/thread.rs:108:17
  47: <unknown>@7ee90
  48: <unknown>@e7b1c
  49: <unknown>

How to Reproduce?

A general scenario that seems to cause this is enabling disk spilling, populating a wide table with ~50m rows, then doing a create table x as select * from large_table with enough joins to cause disk spilling. I triggered the segfault running locally in a non-clustered configuration on an m1 Max MacBook Pro with 64gb of ram. But I also managed to crash a 3 node Databend cluster while running the same query with the same dataset as what caused the segfault locally - so I suspect the crash is due to the same issue.

I'm sorry I don't have more specific reproduction steps. It was very difficult to reproduce - most of the time Databend would just crash with no stack trace but I managed to catch one twice. Happy to answer any additional questions or send more specific repro queries privately if it's helpful.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!
@maxjustus maxjustus added the C-bug Category: something isn't working label Nov 24, 2024
@sundy-li
Copy link
Member

sundy-li commented Nov 24, 2024

Hi, could you please trigger the issue with the following settings?

set global enable_loser_tree_merge_sort=0;
set global enable_parallel_multi_merge_sort=1;

I need to determine which setting is causing the problem.

@maxjustus
Copy link
Contributor Author

ok just ran with those settings and got the segfault!

############################### Crash fault info ###############################
PID: 34
Version: v1.2.660-nightly-55cef11019(rust-1.81.0-nightly-2024-11-17T22:08:20.457627205Z)
Timestamp(UTC): 2024-11-24 16:23:16.574301956 UTC
Timestamp(Local): 2024-11-24 16:23:16.574321331 +00:00
QueryId: "9da37a16-827a-45dd-8820-6fe18ed5be68"
Signal 11 (SIGSEGV), si_code 1 (Unknown), Address 0x64616f6d170069

Backtrace:
    0: backtrace::backtrace::libunwind::trace[inlined]
             at /opt/rust/cargo/git/checkouts/backtrace-rs-fb1f822361417489-shallow/72265be/src/backtrace/libunwind.rs:116:5
   1: backtrace::backtrace::trace_unsynchronized[inlined]
             at /opt/rust/cargo/git/checkouts/backtrace-rs-fb1f822361417489-shallow/72265be/src/backtrace/mod.rs:66:5
   2: databend_common_tracing::crash_hook::CrashHandler::recv_signal[inlined]
             at /workspace/src/common/tracing/src/crash_hook.rs:101:13
   3: databend_common_tracing::crash_hook::signal_handler@7c4a824
             at /workspace/src/common/tracing/src/crash_hook.rs:272:9
   4: <unknown>
   5: <unknown>@92b8c
   6: <u8 as core::slice::cmp::SliceOrd>::compare[inlined]
             at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/core/src/slice/cmp.rs:199:34
   7: <A as core::slice::cmp::SlicePartialOrd>::partial_compare[inlined]
             at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/core/src/slice/cmp.rs:138:14
   8: core::slice::cmp::<impl core::cmp::PartialOrd for [T]>::partial_cmp[inlined]
             at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/core/src/slice/cmp.rs:39:9
   9: core::cmp::PartialOrd::ge[inlined]
             at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/core/src/cmp.rs:1233:18
  10: core::cmp::impls::<impl core::cmp::PartialOrd<&B> for &A>::ge[inlined]
             at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/core/src/cmp.rs:1691:13
  11: databend_common_pipeline_transforms::processors::transforms::sort::list_domain::Candidate<T>::find_target::{{closure}}[inlined]
             at /workspace/src/query/pipeline/transforms/src/processors/transforms/sort/list_domain.rs:300:20
  12: core::ops::function::impls::<impl core::ops::function::FnMut<A> for &mut F>::call_mut@6829a34
             at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/core/src/ops/function.rs:294:13
  13: core::iter::traits::iterator::Iterator::find_map::check::{{closure}}[inlined]
             at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/core/src/iter/traits/iterator.rs:2907:32
  14: <alloc::vec::into_iter::IntoIter<T,A> as core::iter::traits::iterator::Iterator>::try_fold[inlined]
             at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/alloc/src/vec/into_iter.rs:340:25
  15: core::iter::traits::iterator::Iterator::find_map[inlined]
             at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/core/src/iter/traits/iterator.rs:2913:9
  16: <core::iter::adapters::filter_map::FilterMap<I,F> as core::iter::traits::iterator::Iterator>::next[inlined]
             at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/core/src/iter/adapters/filter_map.rs:64:9
  17: databend_common_pipeline_transforms::processors::transforms::sort::list_domain::Candidate<T>::find_target@6839bf8
             at /workspace/src/query/pipeline/transforms/src/processors/transforms/sort/list_domain.rs:298:22
  18: databend_common_pipeline_transforms::processors::transforms::sort::list_domain::Candidate<T>::calc_partition@6832604
             at /workspace/src/query/pipeline/transforms/src/processors/transforms/sort/list_domain.rs:213:43
  19: databend_common_pipeline_transforms::processors::transforms::sort::k_way_merge_sort_partition::KWaySortPartitioner<R,S>::calc_partition_point@68ff028
             at /workspace/src/query/pipeline/transforms/src/processors/transforms/sort/k_way_merge_sort_partition.rs:163:9
  20: databend_common_pipeline_transforms::processors::transforms::sort::k_way_merge_sort_partition::KWaySortPartitioner<R,S>::build_task@68ffa3c
             at /workspace/src/query/pipeline/transforms/src/processors/transforms/sort/k_way_merge_sort_partition.rs:167:25
  21: databend_common_pipeline_transforms::processors::transforms::sort::k_way_merge_sort_partition::KWaySortPartitioner<R,S>::next_task@68fe9dc
             at /workspace/src/query/pipeline/transforms/src/processors/transforms/sort/k_way_merge_sort_partition.rs:149:12
  22: <databend_common_pipeline_transforms::processors::transforms::transform_k_way_merge_sort::KWayMergePartitionerProcessor<R> as databend_common_pipeline_core::processors::processor::Processor>::process@67f9b08
             at /workspace/src/query/pipeline/transforms/src/processors/transforms/transform_k_way_merge_sort.rs:345:20
  23: databend_common_pipeline_core::processors::processor::ProcessorPtr::process@67534b8
             at /workspace/src/query/pipeline/core/src/processors/processor.rs:169:9
  24: databend_query::pipelines::executor::executor_worker_context::ExecutorWorkerContext::execute_sync_task[inlined]
             at /workspace/src/query/service/src/pipelines/executor/executor_worker_context.rs:169:9
  25: databend_query::pipelines::executor::executor_worker_context::ExecutorWorkerContext::execute_task@8a9a4b4
             at /workspace/src/query/service/src/pipelines/executor/executor_worker_context.rs:132:52
  26: databend_query::pipelines::executor::query_pipeline_executor::QueryPipelineExecutor::execute_single_thread@8a960d4
             at /workspace/src/query/service/src/pipelines/executor/query_pipeline_executor.rs:406:35
  27: databend_query::pipelines::executor::query_pipeline_executor::QueryPipelineExecutor::execute_threads::{{closure}}::{{closure}}[inlined]
             at /workspace/src/query/service/src/pipelines/executor/query_pipeline_executor.rs:378:50
  28: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once[inlined]
             at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/core/src/panic/unwind_safe.rs:272:9
  29: std::panicking::try::do_call[inlined]
             at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/std/src/panicking.rs:553:40
  30: std::panicking::try[inlined]
             at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/std/src/panicking.rs:517:19
  31: std::panic::catch_unwind@8bf2b14
             at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/std/src/panic.rs:350:14
  32: databend_common_base::runtime::catch_unwind::catch_unwind@8714a6c
             at /workspace/src/common/base/src/runtime/catch_unwind.rs:47:11
  33: databend_query::pipelines::executor::query_pipeline_executor::QueryPipelineExecutor::execute_threads::{{closure}}[inlined]
             at /workspace/src/query/service/src/pipelines/executor/query_pipeline_executor.rs:378:34
  34: databend_common_base::runtime::runtime_tracker::ThreadTracker::tracking_function::{{closure}}::{{closure}}[inlined]
             at /workspace/src/common/base/src/runtime/runtime_tracker.rs:208:17
  35: databend_common_base::runtime::thread::Thread::named_spawn::{{closure}}[inlined]
             at /workspace/src/common/base/src/runtime/thread.rs:78:21
  36: std::sys::backtrace::__rust_begin_short_backtrace@80d2be0
             at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/std/src/sys/backtrace.rs:155:18
  37: std::thread::Builder::spawn_unchecked_::{{closure}}::{{closure}}[inlined]
             at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/std/src/thread/mod.rs:542:17
  38: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once[inlined]
             at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/core/src/panic/unwind_safe.rs:272:9
  39: std::panicking::try::do_call[inlined]
             at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/std/src/panicking.rs:553:40
  40: std::panicking::try[inlined]
             at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/std/src/panicking.rs:517:19
  41: std::panic::catch_unwind[inlined]
             at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/std/src/panic.rs:350:14
  42: std::thread::Builder::spawn_unchecked_::{{closure}}[inlined]
             at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/std/src/thread/mod.rs:541:30
  43: core::ops::function::FnOnce::call_once{{vtable.shim}}@80d5bc8
             at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/core/src/ops/function.rs:250:5
  44: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once[inlined]
             at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/alloc/src/boxed.rs:2064:9
  45: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once[inlined]
             at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/alloc/src/boxed.rs:2064:9
  46: std::sys::pal::unix::thread::Thread::new::thread_start@a358cc4
             at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/std/src/sys/pal/unix/thread.rs:108:17
  47: <unknown>@7ee90
  48: <unknown>@e7b1c
  49: <unknown>

@sundy-li
Copy link
Member

@maxjustus Hi, Can you trigger the issue with the latest code after #16934 is merged?

If you can manually build the databend-query, you can use the profile CI to test the issue.

cargo build --profile ci --bin databend-query         

It will give us more debug logs for out of bounds check.

@maxjustus
Copy link
Contributor Author

Will do!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants