-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-33655: [C++][Parquet] Write parquet columns in parallel #33656
Conversation
|
@wjones127 @emkornfield @pitrou Please take a look when you have time. Thanks! |
- Add use_threads and executor options to ArrowWriterProperties. - Write columns in parallel when buffered row group is enabled. - Only WriteRecordBatch() is supported.
cpp/src/parquet/arrow/writer.cc
Outdated
@@ -333,7 +340,7 @@ class FileWriterImpl : public FileWriter { | |||
std::unique_ptr<ArrowColumnWriterV2> writer, | |||
ArrowColumnWriterV2::Make(*data, offset, size, schema_manifest_, | |||
row_group_writer_)); | |||
return writer->Write(&column_write_context_); | |||
return writer->Write(&column_write_context_.back()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would a writer doesn't have any columns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then it is invalid to call WriteColumnChunk
. Maybe I need to add a check to protect it.
cpp/src/parquet/arrow/writer.cc
Outdated
int column_index_start = 0; | ||
for (int i = 0; i < batch.num_columns(); i++) { | ||
ChunkedArray chunkedArray(batch.column(i)); | ||
ChunkedArray chunkedArray{batch.column(i)}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here would chunkedArray
dtor in stack because it outlives writer's lifetime?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have checked that the created writer only holds a std::shared_ptr from the array wrapped in the chunkedArray
. So the lifecycle looks good.
cpp/src/parquet/arrow/writer.cc
Outdated
column_index_start += writer->leaf_count(); | ||
writers.emplace_back(std::move(writer)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand it's to unify use-thread and non-thread code, but it's looks not ideal to always create such a vector for non-thread case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Let me refactor it to share common logic and avoid unnecessary vector creation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cpp/src/parquet/arrow/writer.cc
Outdated
@@ -333,7 +340,7 @@ class FileWriterImpl : public FileWriter { | |||
std::unique_ptr<ArrowColumnWriterV2> writer, | |||
ArrowColumnWriterV2::Make(*data, offset, size, schema_manifest_, | |||
row_group_writer_)); | |||
return writer->Write(&column_write_context_); | |||
return writer->Write(&column_write_context_.back()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then it is invalid to call WriteColumnChunk
. Maybe I need to add a check to protect it.
cpp/src/parquet/arrow/writer.cc
Outdated
int column_index_start = 0; | ||
for (int i = 0; i < batch.num_columns(); i++) { | ||
ChunkedArray chunkedArray(batch.column(i)); | ||
ChunkedArray chunkedArray{batch.column(i)}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have checked that the created writer only holds a std::shared_ptr from the array wrapped in the chunkedArray
. So the lifecycle looks good.
cpp/src/parquet/arrow/writer.cc
Outdated
column_index_start += writer->leaf_count(); | ||
writers.emplace_back(std::move(writer)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Let me refactor it to share common logic and avoid unnecessary vector creation.
C++ / AMD64 Ubuntu 20.04 C++ ASAN UBSAN (link) is failing with following error:
@kou Do you have any idea? |
It's not related to this pull request. I've opened a new issue for it: #33667 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could, potentially, lead to a nested parallelism deadlock if multiple parquet files are being written at the same time.
However, the only place that could happen today is the dataset writer and since use_threads
defaults to false I think we are ok for now.
The reason this happens is:
The thread that calls OptionalParallelFor will block waiting for the column writers. If this thread is a user thread that is fine. If this thread is a thread pool thread then that essentially becomes a "wasted thread" that can't be used in the pool. If the number of files being written is equal to or greater than the number of threads in the thread pool then there is a potential that all threads become wasted threads and no work can be done.
The fix would be to add a WriteRecordBatchAsync
method that calls ParallelForAsync
and returns the future. This can then be safely called in parallel, even by thread pool threads (assuming they don't block on that future but wrap it up into a higher level AllComplete
call later). The WriteRecordBatch
method could then just return WriteRecordBatchAsync(...).status()
.
For now, could you potentially add a comment that this must be false if the user is writing multiple files in parallel?
The comment could either go on the use_threads
property or the WriteRecordBatch
method. My preference would be WriteRecordBatch
.
cpp/src/parquet/arrow/writer.cc
Outdated
if (arrow_properties_->use_threads()) { | ||
DCHECK_EQ(parallel_column_write_contexts_.size(), writers.size()); | ||
RETURN_NOT_OK(::arrow::internal::OptionalParallelFor( | ||
/*use_threads=*/true, static_cast<int>(writers.size()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could just call ParallelFor
instead of OptionalParallelFor
if you are going to always have use_threads=true
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your review! I have addressed your comments. Please check. @westonpace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, those comments look good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind filing a ticket for the async version Weston suggested?
Thanks @lidavidm and @westonpace for the review! I have opened the issue to track the progress. |
Thanks @wgtmac ! |
Benchmark runs are scheduled for baseline = 444dcb6 and contender = c8d6110. c8d6110 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Closes #33655
What changes are included in this PR?
ArrowWriterProperties
.parquet::arrow::FileWriter
writes columns in parallel when buffered row group is enabled.WriteRecordBatch()
is supported.Are these changes tested?
Added
TEST(TestArrowReadWrite, MultithreadedWrite)
in thearrow_reader_writer_test.cc