-
Notifications
You must be signed in to change notification settings - Fork 753
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: segfault in k_way_merge_sort_partition #16923
Comments
Hi, could you please trigger the issue with the following settings?
I need to determine which setting is causing the problem. |
ok just ran with those settings and got the segfault!
|
@maxjustus Hi, Can you trigger the issue with the latest code after #16934 is merged? If you can manually build the
It will give us more debug logs for out of bounds check. |
Will do! |
Search before asking
Version
Version: v1.2.660-nightly-55cef11019(rust-1.81.0-nightly-2024-11-17T22:08:20.457627205Z)
What's Wrong?
Running create table as select or merge into queries with large result sets randomly causes segfaults in k_way_merge_sort_partition. This causes Databend to crash and become unresponsive - when I restart it all data is erased. I am seemingly able to work around it with:
I'm not sure which setting resolves it because the pipeline I'm running which causes the segfault takes ~30 minutes to set up the database to run the query which triggers the segfault - and as already mentioned when the segfault occurs it erases all data.
How to Reproduce?
A general scenario that seems to cause this is enabling disk spilling, populating a wide table with ~50m rows, then doing a
create table x as select * from large_table
with enough joins to cause disk spilling. I triggered the segfault running locally in a non-clustered configuration on an m1 Max MacBook Pro with 64gb of ram. But I also managed to crash a 3 node Databend cluster while running the same query with the same dataset as what caused the segfault locally - so I suspect the crash is due to the same issue.I'm sorry I don't have more specific reproduction steps. It was very difficult to reproduce - most of the time Databend would just crash with no stack trace but I managed to catch one twice. Happy to answer any additional questions or send more specific repro queries privately if it's helpful.
Are you willing to submit PR?
The text was updated successfully, but these errors were encountered: