Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coredump when reading type Array(String) by OptimizedParquetInputFormat #166

Closed
taiyang-li opened this issue Oct 20, 2022 · 1 comment
Closed

Comments

@taiyang-li
Copy link

taiyang-li commented Oct 20, 2022

use this pr: #165

Note if use previous ParquetBlockInputFormat instead of OptimizedParquetBlockInputFormat, coredump won't be reproduced.

reproduce steps

> ./build_gcc/utils/local-engine/tests/unit_tests_local_engine  --gtest_filter="ParquetRead.ReadData"
./contrib/arrow/cpp/src/arrow/array/array_nested.cc:193:  Check failed: (self->list_type_->value_type()->id()) == (data->child_data[0]->type->id()) 
[1]    22000 abort (core dumped)  ./build_gcc/utils/local-engine/tests/unit_tests_local_engine 

core stack:

(gdb) 
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007f37b810f859 in __GI_abort () at abort.c:79
#2  0x00007f37cd3176d2 in arrow::util::CerrLog::~CerrLog() () from /data1/liyang/cppproject/kyli/ClickHouse/build_gcc/utils/local-engine/libch.so
#3  0x00007f37cd3175a1 in arrow::util::ArrowLog::~ArrowLog() () from /data1/liyang/cppproject/kyli/ClickHouse/build_gcc/utils/local-engine/libch.so
#4  0x00007f37cd09d8ce in void arrow::internal::SetListData<arrow::ListType>(arrow::BaseListArray<arrow::ListType>*, std::__1::shared_ptr<arrow::ArrayData> const&, arrow::Type::type) () from /data1/liyang/cppproject/kyli/ClickHouse/build_gcc/utils/local-engine/libch.so
#5  0x00007f37cd093b88 in arrow::ListArray::ListArray(std::__1::shared_ptr<arrow::ArrayData>) () from /data1/liyang/cppproject/kyli/ClickHouse/build_gcc/utils/local-engine/libch.so
#6  0x00007f37cd0c7a4c in arrow::MakeArray(std::__1::shared_ptr<arrow::ArrayData> const&) () from /data1/liyang/cppproject/kyli/ClickHouse/build_gcc/utils/local-engine/libch.so
#7  0x00007f37c8855c75 in ch_parquet::arrow::(anonymous namespace)::ListReader<int>::AssembleArray (this=<optimized out>, data=...) at ./utils/local-engine/Storages/ch_parquet/arrow/reader.cc:580
#8  0x00007f37c8855620 in ch_parquet::arrow::(anonymous namespace)::ListReader<int>::BuildArray (this=0x7f37b73479c0, length_upper_bound=<optimized out>, out=0x7fff12ce31b0) at ./utils/local-engine/Storages/ch_parquet/arrow/reader.cc:638
#9  0x00007f37c8860c60 in ch_parquet::arrow::ColumnReaderImpl::NextBatch (this=0x7f37b73479c0, batch_size=1, out=0x7fff12ce31b0) at ./utils/local-engine/Storages/ch_parquet/arrow/reader.cc:119
#10 0x00007f37c8858dae in ch_parquet::arrow::(anonymous namespace)::FileReaderImpl::ReadColumn (this=0x7f37b7219900, i=0, row_groups=..., reader=0x7f37b73479c0, out=0x7fff12ce31b0) at ./utils/local-engine/Storages/ch_parquet/arrow/reader.cc:283
#11 0x00007f37c885de40 in ch_parquet::arrow::(anonymous namespace)::FileReaderImpl::DecodeRowGroups(std::__1::shared_ptr<ch_parquet::arrow::(anonymous namespace)::FileReaderImpl>, std::__1::vector<int, std::__1::allocator<int> > const&, std::__1::vector<int, std::__1::allocator<int> > const&, arrow::internal::Executor*)::$_4::operator()(unsigned long, std::__1::shared_ptr<ch_parquet::arrow::ColumnReaderImpl>) const (this=this@entry=0x7fff12ce3420, i=i@entry=0, reader=...) at ./utils/local-engine/Storages/ch_parquet/arrow/reader.cc:1203
#12 0x00007f37c885c793 in arrow::internal::OptionalParallelForAsync<ch_parquet::arrow::(anonymous namespace)::FileReaderImpl::DecodeRowGroups(std::__1::shared_ptr<ch_parquet::arrow::(anonymous namespace)::FileReaderImpl>, std::__1::vector<int, std::__1::allocator<int> > const&, std::__1::vector<int, std::__1::allocator<int> > const&, arrow::internal::Executor*)::$_4&, std::__1::shared_ptr<ch_parquet::arrow::ColumnReaderImpl>, std::__1::shared_ptr<arrow::ChunkedArray> >(bool, std::__1::vector<std::__1::shared_ptr<ch_parquet::arrow::ColumnReaderImpl>, std::__1::allocator<std::__1::shared_ptr<ch_parquet::arrow::ColumnReaderImpl> > >, ch_parquet::arrow::(anonymous namespace)::FileReaderImpl::DecodeRowGroups(std::__1::shared_ptr<ch_parquet::arrow::(anonymous namespace)::FileReaderImpl>, std::__1::vector<int, std::__1::allocator<int> > const&, std::__1::vector<int, std::__1::allocator<int> > const&, arrow::internal::Executor*)::$_4&, arrow::internal::Executor*) (inputs=..., func=..., executor=0x7f37b7364660, use_threads=<optimized out>) at ./contrib/arrow/cpp/src/arrow/util/parallel.h:95
#13 ch_parquet::arrow::(anonymous namespace)::FileReaderImpl::DecodeRowGroups (this=<optimized out>, this@entry=0x7f37b7219900, self=..., row_groups=..., column_indices=..., cpu_executor=0x7f37b7364660, cpu_executor@entry=0x0)
    at ./utils/local-engine/Storages/ch_parquet/arrow/reader.cc:1221
#14 0x00007f37c885234b in ch_parquet::arrow::(anonymous namespace)::FileReaderImpl::ReadRowGroups (this=0x7f37b7219900, row_groups=..., column_indices=..., out=0x7fff12ce36a0) at ./utils/local-engine/Storages/ch_parquet/arrow/reader.cc:1182
#15 0x00007f37c8851f30 in ch_parquet::arrow::(anonymous namespace)::FileReaderImpl::ReadRowGroup (this=0x7f37b7219900, row_group_index=0, column_indices=..., out=0x0) at ./utils/local-engine/Storages/ch_parquet/arrow/reader.cc:320
#16 0x00007f37c884b1de in DB::OptimizedParquetBlockInputFormat::generate (this=0x7f37b7305618) at ./utils/local-engine/Storages/ch_parquet/OptimizedParquetBlockInputFormat.cpp:54
#17 0x00007f37cb385e35 in DB::ISource::tryGenerate (this=0x7fff12ce2be0) at ./src/Processors/ISource.cpp:79
#18 0x00007f37cb385ada in DB::ISource::work (this=0x7f37b7305618) at ./src/Processors/ISource.cpp:53
#19 0x00007f37cb3a0e83 in DB::executeJob (processor=0x7f37b7305618) at ./src/Processors/Executors/ExecutionThreadContext.cpp:45
#20 DB::ExecutionThreadContext::executeTask (this=0x7f37b7207400) at ./src/Processors/Executors/ExecutionThreadContext.cpp:63
#21 0x00007f37cb397d00 in DB::PipelineExecutor::executeStepImpl (this=<optimized out>, this@entry=0x7f37b7267c18, thread_num=<optimized out>, thread_num@entry=0, yield_flag=yield_flag@entry=0x7f37b72f0250) at ./src/Processors/Executors/PipelineExecutor.cpp:213
#22 0x00007f37cb397a40 in DB::PipelineExecutor::executeStep (this=0x7f37b7267c18, yield_flag=0x7f37b72f0250) at ./src/Processors/Executors/PipelineExecutor.cpp:115
#23 0x00007f37cb3a44f8 in DB::PullingPipelineExecutor::pull (this=0x7f37b72f0250, chunk=...) at ./src/Processors/Executors/PullingPipelineExecutor.cpp:50
#24 0x00007f37cb3a478c in DB::PullingPipelineExecutor::pull (this=0x7f37b72f0250, block=...) at ./src/Processors/Executors/PullingPipelineExecutor.cpp:61
#25 0x00000000002e6ae3 in ParquetRead_ReadData_Test::TestBody (this=<optimized out>) at ./utils/local-engine/tests/gtest_parquet_read.cpp:212
#26 0x000000000035aac6 in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void> (method=<optimized out>, location=0x27d280 "the test body", object=<optimized out>) at ./contrib/googletest/googletest/src/gtest.cc:2589
#27 testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void> (object=<optimized out>, method=<optimized out>, location=0x27d280 "the test body") at ./contrib/googletest/googletest/src/gtest.cc:2625
#28 0x000000000033336a in testing::Test::Run (this=0x7f37b7201250) at ./contrib/googletest/googletest/src/gtest.cc:2664
#29 0x0000000000334bf1 in testing::TestInfo::Run (this=0x7f37b7242700) at ./contrib/googletest/googletest/src/gtest.cc:2842
#30 0x0000000000335639 in testing::TestSuite::Run (this=0x7f37b7242600) at ./contrib/googletest/googletest/src/gtest.cc:2996
#31 0x0000000000345c3d in testing::internal::UnitTestImpl::RunAllTests (this=<optimized out>) at ./contrib/googletest/googletest/src/gtest.cc:5708
#32 0x000000000035b9a6 in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (method=<optimized out>, location=0x2799d6 "auxiliary test code (environments or event listeners)", object=<optimized out>)
    at ./contrib/googletest/googletest/src/gtest.cc:2589
#33 testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (object=<optimized out>, method=<optimized out>, location=0x2799d6 "auxiliary test code (environments or event listeners)") at ./contrib/googletest/googletest/src/gtest.cc:2625
#34 0x000000000034511b in testing::UnitTest::Run (this=0x4e4308 <testing::UnitTest::GetInstance()::instance>) at ./contrib/googletest/googletest/src/gtest.cc:5291
#35 0x00000000002d92f0 in RUN_ALL_TESTS () at ./contrib/googletest/googletest/include/gtest/gtest.h:2471
#36 main (argc=<optimized out>, argv=0x7fff12ce46d8) at ./utils/local-engine/tests/gtest_local_engine.cpp:333

@taiyang-li taiyang-li changed the title coredump when reading type Array(String) from parquet file Coredump when reading type Array(String) from parquet file Oct 20, 2022
@taiyang-li taiyang-li changed the title Coredump when reading type Array(String) from parquet file Coredump when reading type Array(String) by OptimizedParquetInputFormat Oct 20, 2022
taiyang-li added a commit to bigo-sg/ClickHouse that referenced this issue Oct 20, 2022
@taiyang-li
Copy link
Author

It is fixed in #163

liuneng1994 pushed a commit that referenced this issue Nov 4, 2022
* support calculate backing length of different types

* remove comment

* rename symbols

* apply BackingDataLengthCalculator

* support decimal from ch column to spark row

* fix decimal issue in ch column to spark row

* refactor SparkRowInfo

* fix building error

* wip

* implement demo

* dev map

* finish map and tuple

* fix building error

* finish writer dev

* fix code style

* ready to improve spark row to ch column

* wip

* finish array/map/tuple reader

* fix building error

* add some uts

* finish debug

* commit again

* finish plan convert

* add benchmark

* improve performance

* try to optimize spark row to ch column

* continue

* optimize SparkRowInfo::SparkRowInfo

* wrap functions

* improve performance

* improve from 360ms to 240 ms

* finish optimizeing performance

* add benchmark for BM_SparkRowTOCHColumn_Lineitem

* refactor spark row reader

* finish tests

* revert cmake

* fix code style

* fix code style

* fix memory leak

* fix build error

* fix building error in debug mode

* add test data file

* add build type, convert ch type to substrait type

* refactor jni interface: native column type

* fixbug of decimal

* replace decimal.parquet

* add data array.parquet

* add test data map.parquet

* add test data file

* finish debug

* wip

* fix logging

* fix address problem

* fix core dump

* fix code style

* throw exception when complex types in substrait plan is in nullable

* make ch complex type nullable

* support nullable complex types

* add tests for parquet nullable

* add uts for all types

* debug gtest_parquet_read

* fix issue: #166

* remove stdout log

* fix bug of binary null

* remove logs

* remove useless files
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant