GH-33655: [C++][Parquet] Write parquet columns in parallel #33656

wgtmac · 2023-01-13T15:39:52Z

Which issue does this PR close?

Closes #33655

What changes are included in this PR?

Add use_threads and executor options to ArrowWriterProperties.
parquet::arrow::FileWriter writes columns in parallel when buffered row group is enabled.
Only WriteRecordBatch() is supported.

Are these changes tested?

Added TEST(TestArrowReadWrite, MultithreadedWrite) in the arrow_reader_writer_test.cc

Closes: [C++][Parquet] Write columns in parallel for parquet writer #33655

github-actions · 2023-01-13T15:40:19Z

Closes: [C++][Parquet] Write columns in parallel for parquet writer #33655

github-actions · 2023-01-13T15:40:21Z

⚠️ GitHub issue #33655 has been automatically assigned in GitHub to PR creator.

wgtmac · 2023-01-13T16:08:06Z

@wjones127 @emkornfield @pitrou Please take a look when you have time. Thanks!

- Add use_threads and executor options to ArrowWriterProperties. - Write columns in parallel when buffered row group is enabled. - Only WriteRecordBatch() is supported.

mapleFU · 2023-01-14T10:06:30Z

cpp/src/parquet/arrow/writer.cc

@@ -333,7 +340,7 @@ class FileWriterImpl : public FileWriter {
          std::unique_ptr<ArrowColumnWriterV2> writer,
          ArrowColumnWriterV2::Make(*data, offset, size, schema_manifest_,
                                    row_group_writer_));
-      return writer->Write(&column_write_context_);
+      return writer->Write(&column_write_context_.back());


Would a writer doesn't have any columns?

Then it is invalid to call WriteColumnChunk. Maybe I need to add a check to protect it.

mapleFU · 2023-01-14T10:17:59Z

cpp/src/parquet/arrow/writer.cc

      int column_index_start = 0;
      for (int i = 0; i < batch.num_columns(); i++) {
-        ChunkedArray chunkedArray(batch.column(i));
+        ChunkedArray chunkedArray{batch.column(i)};


Here would chunkedArray dtor in stack because it outlives writer's lifetime?

I have checked that the created writer only holds a std::shared_ptr from the array wrapped in the chunkedArray. So the lifecycle looks good.

cpp/src/parquet/properties.h

cyb70289 · 2023-01-14T10:48:16Z

cpp/src/parquet/arrow/writer.cc

        column_index_start += writer->leaf_count();
+        writers.emplace_back(std::move(writer));


I understand it's to unify use-thread and non-thread code, but it's looks not ideal to always create such a vector for non-thread case.

Agreed. Let me refactor it to share common logic and avoid unnecessary vector creation.

cpp/src/parquet/arrow/writer.cc

wgtmac

Thanks for the review @cyb70289 @mapleFU

wgtmac · 2023-01-14T11:04:09Z

cpp/src/parquet/arrow/writer.cc

@@ -333,7 +340,7 @@ class FileWriterImpl : public FileWriter {
          std::unique_ptr<ArrowColumnWriterV2> writer,
          ArrowColumnWriterV2::Make(*data, offset, size, schema_manifest_,
                                    row_group_writer_));
-      return writer->Write(&column_write_context_);
+      return writer->Write(&column_write_context_.back());


Then it is invalid to call WriteColumnChunk. Maybe I need to add a check to protect it.

wgtmac · 2023-01-14T11:15:56Z

cpp/src/parquet/arrow/writer.cc

      int column_index_start = 0;
      for (int i = 0; i < batch.num_columns(); i++) {
-        ChunkedArray chunkedArray(batch.column(i));
+        ChunkedArray chunkedArray{batch.column(i)};


I have checked that the created writer only holds a std::shared_ptr from the array wrapped in the chunkedArray. So the lifecycle looks good.

wgtmac · 2023-01-14T11:16:33Z

cpp/src/parquet/arrow/writer.cc

        column_index_start += writer->leaf_count();
+        writers.emplace_back(std::move(writer));


Agreed. Let me refactor it to share common logic and avoid unnecessary vector creation.

cpp/src/parquet/properties.h

wgtmac · 2023-01-14T12:48:59Z

C++ / AMD64 Ubuntu 20.04 C++ ASAN UBSAN (link) is failing with following error:

[90/890] Building CXX object CMakeFiles/substrait.dir/substrait_ep-generated/substrait/extensions/extensions.pb.cc.o
FAILED: CMakeFiles/substrait.dir/substrait_ep-generated/substrait/extensions/extensions.pb.cc.o 
/usr/bin/ccache /usr/lib/ccache/clang++-14  -DADDRESS_SANITIZER -DARROW_HAVE_RUNTIME_AVX2 -DARROW_HAVE_RUNTIME_AVX512 -DARROW_HAVE_RUNTIME_BMI2 -DARROW_HAVE_RUNTIME_SSE4_2 -DARROW_HAVE_SSE4_2 -DARROW_NO_DEPRECATED_API -DARROW_UBSAN -DARROW_WITH_RE2 -DARROW_WITH_UTF8PROC -Isubstrait_ep-generated -Iprotobuf_ep-install/include -Isrc -I/arrow/cpp/src -I/arrow/cpp/src/generated -Qunused-arguments -fcolor-diagnostics  -Wall -Wextra -Wdocumentation -Wshorten-64-to-32 -Wno-missing-braces -Wno-unused-parameter -Wno-constant-logical-operand -Wno-return-stack-address -Wno-unknown-warning-option -Wno-pass-failed -msse4.2  -fsanitize=address -DADDRESS_SANITIZER -fsanitize=undefined -fno-sanitize=alignment,vptr,function,float-divide-by-zero -fno-sanitize-recover=all -fsanitize-coverage=pc-table,inline-8bit-counters,edge,no-prune,trace-cmp,trace-div,trace-gep -fsanitize-blacklist=/arrow/cpp/build-support/sanitizer-disallowed-entries.txt -g -Werror -O0 -ggdb -fPIC   -fsanitize-coverage=pc-table,inline-8bit-counters,edge,no-prune,trace-cmp,trace-div,trace-gep -std=c++17 -Wno-error=shorten-64-to-32 -MD -MT CMakeFiles/substrait.dir/substrait_ep-generated/substrait/extensions/extensions.pb.cc.o -MF CMakeFiles/substrait.dir/substrait_ep-generated/substrait/extensions/extensions.pb.cc.o.d -o CMakeFiles/substrait.dir/substrait_ep-generated/substrait/extensions/extensions.pb.cc.o -c substrait_ep-generated/substrait/extensions/extensions.pb.cc
In file included from substrait_ep-generated/substrait/extensions/extensions.pb.cc:4:
In file included from substrait_ep-generated/substrait/extensions/extensions.pb.h:24:
In file included from protobuf_ep-install/include/google/protobuf/arena.h:52:
protobuf_ep-install/include/google/protobuf/arena_impl.h:45:10: fatal error: 'sanitizer/asan_interface.h' file not found
#include <sanitizer/asan_interface.h>
         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.

@kou Do you have any idea?

kou · 2023-01-15T03:15:23Z

It's not related to this pull request. I've opened a new issue for it: #33667

westonpace

This could, potentially, lead to a nested parallelism deadlock if multiple parquet files are being written at the same time.

However, the only place that could happen today is the dataset writer and since use_threads defaults to false I think we are ok for now.

The reason this happens is:

The thread that calls OptionalParallelFor will block waiting for the column writers. If this thread is a user thread that is fine. If this thread is a thread pool thread then that essentially becomes a "wasted thread" that can't be used in the pool. If the number of files being written is equal to or greater than the number of threads in the thread pool then there is a potential that all threads become wasted threads and no work can be done.

The fix would be to add a WriteRecordBatchAsync method that calls ParallelForAsync and returns the future. This can then be safely called in parallel, even by thread pool threads (assuming they don't block on that future but wrap it up into a higher level AllComplete call later). The WriteRecordBatch method could then just return WriteRecordBatchAsync(...).status().

For now, could you potentially add a comment that this must be false if the user is writing multiple files in parallel?

The comment could either go on the use_threads property or the WriteRecordBatch method. My preference would be WriteRecordBatch.

westonpace · 2023-01-16T15:27:08Z

cpp/src/parquet/arrow/writer.cc

+      if (arrow_properties_->use_threads()) {
+        DCHECK_EQ(parallel_column_write_contexts_.size(), writers.size());
+        RETURN_NOT_OK(::arrow::internal::OptionalParallelFor(
+            /*use_threads=*/true, static_cast<int>(writers.size()),


You could just call ParallelFor instead of OptionalParallelFor if you are going to always have use_threads=true.

Thanks for your review! I have addressed your comments. Please check. @westonpace

Thanks, those comments look good!

lidavidm

Would you mind filing a ticket for the async version Weston suggested?

wgtmac · 2023-01-17T01:37:57Z

Would you mind filing a ticket for the async version Weston suggested?

Thanks @lidavidm and @westonpace for the review! I have opened the issue to track the progress.

cpp/src/parquet/arrow/writer.cc

cyb70289 · 2023-01-18T01:36:12Z

Thanks @wgtmac !

ursabot · 2023-01-19T16:03:58Z

Benchmark runs are scheduled for baseline = 444dcb6 and contender = c8d6110. c8d6110 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.63% ⬆️0.15%] test-mac-arm
[Finished ⬇️1.28% ⬆️1.53%] ursa-i9-9960x
[Finished ⬇️0.87% ⬆️0.37%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] c8d6110a ec2-t3-xlarge-us-east-2
[Failed] c8d6110a test-mac-arm
[Finished] c8d6110a ursa-i9-9960x
[Finished] c8d6110a ursa-thinkcentre-m75q
[Finished] 444dcb67 ec2-t3-xlarge-us-east-2
[Finished] 444dcb67 test-mac-arm
[Finished] 444dcb67 ursa-i9-9960x
[Finished] 444dcb67 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

github-actions bot added Component: C++ Component: Parquet labels Jan 13, 2023

wgtmac force-pushed the GH-33655 branch from cdcaa3e to 452df6f Compare January 13, 2023 16:06

apacheGH-33655: [C++][Parquet] Write parquet columns in parallel

e6abdac

- Add use_threads and executor options to ArrowWriterProperties. - Write columns in parallel when buffered row group is enabled. - Only WriteRecordBatch() is supported.

wgtmac force-pushed the GH-33655 branch from 452df6f to e6abdac Compare January 14, 2023 07:56

mapleFU reviewed Jan 14, 2023

View reviewed changes

cyb70289 reviewed Jan 14, 2023

View reviewed changes

wgtmac commented Jan 14, 2023

View reviewed changes

wgtmac added 2 commits January 14, 2023 20:02

fix test failure and do not use vector for single thread

251a30d

fix windows build

a1fab4c

kou mentioned this pull request Jan 15, 2023

[C++][CI] "C++ / AMD64 Ubuntu 20.04 C++ ASAN UBSAN" is failed #33667

Closed

pitrou requested review from westonpace and lidavidm January 16, 2023 14:28

westonpace reviewed Jan 16, 2023

View reviewed changes

add comment to warn potential deadlock

f24f96f

lidavidm approved these changes Jan 16, 2023

View reviewed changes

wgtmac mentioned this pull request Jan 17, 2023

[C++][Parquet] Add WriteRecordBatchAsync to parquet writer #33710

Open

cyb70289 reviewed Jan 17, 2023

View reviewed changes

cpp/src/parquet/arrow/writer.cc Outdated Show resolved Hide resolved

rename chunkedArray to chunked_array

12396f9

wgtmac requested a review from wjones127 as a code owner January 17, 2023 03:24

cyb70289 approved these changes Jan 17, 2023

View reviewed changes

cyb70289 merged commit c8d6110 into apache:master Jan 18, 2023

wgtmac mentioned this pull request Jul 4, 2023

GH-36280: [Parquet][C++] FileWriter supports WriteTable in the buffered mode #36377

Open

jonashaag mentioned this pull request Sep 19, 2023

Fetching query results concurrently pacman82/arrow-odbc-py#51

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-33655: [C++][Parquet] Write parquet columns in parallel #33656

GH-33655: [C++][Parquet] Write parquet columns in parallel #33656

wgtmac commented Jan 13, 2023 •

edited by github-actions bot

Loading

github-actions bot commented Jan 13, 2023

github-actions bot commented Jan 13, 2023

wgtmac commented Jan 13, 2023

mapleFU Jan 14, 2023

wgtmac Jan 14, 2023

mapleFU Jan 14, 2023

wgtmac Jan 14, 2023 •

edited

Loading

cyb70289 Jan 14, 2023

wgtmac Jan 14, 2023

wgtmac left a comment •

edited

Loading

wgtmac Jan 14, 2023

wgtmac Jan 14, 2023 •

edited

Loading

wgtmac Jan 14, 2023

wgtmac commented Jan 14, 2023

kou commented Jan 15, 2023

westonpace left a comment

westonpace Jan 16, 2023

wgtmac Jan 16, 2023

westonpace Jan 16, 2023

lidavidm left a comment

wgtmac commented Jan 17, 2023

cyb70289 commented Jan 18, 2023

ursabot commented Jan 19, 2023

		column_index_start += writer->leaf_count();
		writers.emplace_back(std::move(writer));

GH-33655: [C++][Parquet] Write parquet columns in parallel #33656

GH-33655: [C++][Parquet] Write parquet columns in parallel #33656

Conversation

wgtmac commented Jan 13, 2023 • edited by github-actions bot Loading

Which issue does this PR close?

What changes are included in this PR?

Are these changes tested?

github-actions bot commented Jan 13, 2023

github-actions bot commented Jan 13, 2023

wgtmac commented Jan 13, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wgtmac Jan 14, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wgtmac left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wgtmac Jan 14, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wgtmac commented Jan 14, 2023

kou commented Jan 15, 2023

westonpace left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lidavidm left a comment

Choose a reason for hiding this comment

wgtmac commented Jan 17, 2023

cyb70289 commented Jan 18, 2023

ursabot commented Jan 19, 2023

wgtmac commented Jan 13, 2023 •

edited by github-actions bot

Loading

wgtmac Jan 14, 2023 •

edited

Loading

wgtmac left a comment •

edited

Loading

wgtmac Jan 14, 2023 •

edited

Loading