feat(c/driver/postgresql): Use COPY for writes #1093

WillAyd · 2023-09-22T02:43:01Z

WillAyd · 2023-09-22T02:45:30Z

c/driver/postgresql/statement.cc

-    PGresult* result = PQprepare(conn, /*stmtName=*/"", query.c_str(),
-                                 /*nParams=*/bind_schema->n_children, param_types.data());
-    if (PQresultStatus(result) != PGRES_COMMAND_OK) {
+    const char* temp = "COPY bulk_ingest FROM STDIN WITH (FORMAT binary);";


I placed this in Prepare because it represented a minor diff, but I'm not sure the Prepare/Execute pattern is the same concept as a COPY. Maybe COPY should have a dedicated ADBC method?

WillAyd · 2023-09-22T02:46:18Z

c/driver/postgresql/statement.cc

@@ -353,10 +360,30 @@ struct BindStream {
      CHECK_NA(INTERNAL, ArrowArrayViewSetArray(&array_view.value, &array.value, nullptr),
               error);

+      struct ArrowBuffer buf;
+      ArrowBufferInit(&buf);
+      ArrowBufferAppend(&buf, kPgCopyBinarySignature, sizeof(kPgCopyBinarySignature));


We could probably reserve the number of bytes needed up front

WillAyd · 2023-09-22T02:47:24Z

c/driver/postgresql/statement.cc


+      if (PQputCopyData(conn, reinterpret_cast<char*>(buf.data), buf.size_bytes) <= 0) {


I am not sure if the reinterpret_cast here is appropriate, or if there was a way to construct an ArrowBufferView from the ArrowBuffer and use the union approach

WillAyd · 2023-09-22T12:29:59Z

I think the remaining test failures are just a matter of the COPY semantics not aligning well with expectations around the Prepare/Execute method currently in use. Do we need to add an option for USE_COPY or something to the affect with adbc? I assume if we do that we can set up dedicated Copy methods on drivers that support it to make this cleaner. Might be an easier way I am overlooking

lidavidm · 2023-09-22T14:00:07Z

Sorry, what exactly is the problem? Presumably we'd only activate COPY for bulk ingest?

WillAyd · 2023-09-22T14:35:12Z

That helps clarify - I wasn't sure if we wanted to totally replace the prepared statement with COPY but sounds like no.

I'll have to dig through the code a little bit more to see how we should be dispatching to COPY. I don't think any driver does that right now

lidavidm · 2023-09-22T15:10:21Z

Yeah, so we'd have to implement a separate path that only bulk ingestion uses.

For prepared statements, my understanding is that you can't use them with COPY (at least if you want bind parameters), so this only works for ingest, unfortunately.

WillAyd · 2023-09-22T19:08:19Z

Yeah, so we'd have to implement a separate path that only bulk ingestion uses.

When I'm looking at adbc_ingest in the Python implemention, I see it calls AdbcStatementExecuteQuery - so are you thinking we introduce something like AdbcStatementCopy or do we want the postgresql driver to handle that internally? I'm not sure if the concept of Execute is generic or tied to the SQL EXECUTE statemnennt

lidavidm · 2023-09-22T19:29:26Z

So ExecuteQuery is just an entrypoint - internally the driver knows that it's "in ingest mode" and can do something different

WillAyd · 2023-09-22T20:34:05Z

Ah OK great - thanks for all that info. Very helpful and cleared up some misconceptions I had about how this should work.

So for now set up a dedicated ExecuteCopy method; I need to take another pass on a refactor (would love to have this more closely resemble the writer setup) but hope this is heading moreso in the right direction

This is a pre-cursor to #1093 ; figured it would be easier to work piece-wise rather than all at once. This does not try to actually connect the statement.cc code to use this, but just gets the test case / general structure set up

WillAyd · 2023-10-12T18:22:49Z

c/validation/adbc_validation.cc

@@ -2112,7 +2112,8 @@ void StatementTest::TestSqlIngestErrors() {
                                         {"coltwo", NANOARROW_TYPE_INT64}}),
              IsOkErrno());
  ASSERT_THAT(
-      (MakeBatch<int64_t, int64_t>(&schema.value, &array.value, &na_error, {}, {})),
+      (MakeBatch<int64_t, int64_t>(&schema.value, &array.value, &na_error,


Looks like postgres COPY only raises if there is data

postgres=# COPY (SELECT int64s, int64s FROM bulk_ingest WHERE 1=0) TO '/tmp/pgout.data' WITH (FORMAT BINARY); postgres=# COPY bulk_ingest FROM '/tmp/pgout.data' WITH (FORMAT BINARY); COPY 0 postgres=# COPY (SELECT int64s, int64s FROM bulk_ingest) TO '/tmp/pgout.data' WITH (FORMAT BINARY); postgres=# COPY bulk_ingest FROM '/tmp/pgout.data' WITH (FORMAT BINARY); ERROR: row field count is 2, expected 1 CONTEXT: COPY bulk_ingest, line 1

Given the format itself sends the number of columns expected row-by-row; in the empty result case no such communication occurs

WillAyd · 2023-10-12T22:26:27Z

OK after implemented the CopyWriter separately this should work; need to figure out the CI build failure which I'm not getting locally, but think will be green after that.

From basic benchmarking in #1189 I don't know that this is any faster than what we have now, but this is also a pretty inefficient implementation. I assume we can speed things up a lot by pre-allocating the write buffer or reserving space rather than leaving it to individual BufferAppend calls to do that

c/driver/postgresql/postgres_copy_reader.h

WillAyd · 2023-10-13T18:57:56Z

Must have been misreading the benchmarks before. They do show this as a pretty sizable perf boost. Here is main:

2023-10-13T14:45:00-04:00
Running ./driver/postgresql/postgresql-benchmark
Run on (12 X 4700 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 1280 KiB (x6)
  L3 Unified 12288 KiB (x1)
Load Average: 1.02, 0.90, 0.88
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
---------------------------------------------------------------
Benchmark                     Time             CPU   Iterations
---------------------------------------------------------------
BM_PostgresqlExecute 4605464781 ns    252249371 ns            1

versus this PR:

2023-10-13T14:46:31-04:00
Running ./driver/postgresql/postgresql-benchmark
Run on (12 X 4700 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 1280 KiB (x6)
  L3 Unified 12288 KiB (x1)
Load Average: 1.21, 0.93, 0.89
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
----------------------------------------------------------------------------
Benchmark                                  Time             CPU   Iterations
----------------------------------------------------------------------------
BM_PostgresqlExecute/iterations:1    5619839 ns       402000 ns            1

WillAyd · 2023-10-13T18:59:28Z

c/driver/postgresql/postgresql_benchmark.cc

@@ -24,47 +24,56 @@

 static void BM_PostgresqlExecute(benchmark::State& state) {
  const char* uri = std::getenv("ADBC_POSTGRESQL_TEST_URI");
-  if (!uri) {
+  if (!uri || !strcmp(uri, "")) {


These are orthogonal but figured OK to lump in here given this is the first PR using these benchmarks. Also happy to split off if you prefer

I noticed LSAN is not happy if we actually return now, for example if you provide an invalid test_uri. Am I missing something extra I should be releasing before these returns?

==96786==ERROR: LeakSanitizer: detected memory leaks Direct leak of 1024 byte(s) in 1 object(s) allocated from: #0 0x7f0f1a0defef in __interceptor_malloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:69 #1 0x7f0f19b28aaa in SetErrorVariadic /home/willayd/clones/arrow-adbc/c/driver/common/utils.c:141 #2 0x7f0f19b283ac in SetError /home/willayd/clones/arrow-adbc/c/driver/common/utils.c:110 #3 0x7f0f19a7ffcc in adbcpq::PostgresDatabase::Connect(pg_conn**, AdbcError*) /home/willayd/clones/arrow-adbc/c/driver/postgresql/database.cc:107 #4 0x7f0f19a803d6 in adbcpq::PostgresDatabase::RebuildTypeResolver(AdbcError*) /home/willayd/clones/arrow-adbc/c/driver/postgresql/database.cc:135 #5 0x7f0f19a7fb23 in adbcpq::PostgresDatabase::Init(AdbcError*) /home/willayd/clones/arrow-adbc/c/driver/postgresql/database.cc:58 #6 0x7f0f19ab59b1 in PostgresDatabaseInit /home/willayd/clones/arrow-adbc/c/driver/postgresql/postgresql.cc:82 #7 0x7f0f19ab6766 in AdbcDatabaseInit /home/willayd/clones/arrow-adbc/c/driver/postgresql/postgresql.cc:197 #8 0x560878af3271 in BM_PostgresqlExecute /home/willayd/clones/arrow-adbc/c/driver/postgresql/postgresql_benchmark.cc:44 #9 0x560878b898bf in benchmark::internal::BenchmarkInstance::Run(long, int, benchmark::internal::ThreadTimer*, benchmark::internal::ThreadManager*, benchmark::internal::PerfCountersMeasurement*) const (/home/willayd/clones/arrow-adbc/build/driver/postgresql/postgresql-benchmark+0x75a8bf) (BuildId: 9c2b1c1be992ad8c4106282527e34c09b72aeb27) #10 0x560878b673e3 in benchmark::internal::(anonymous namespace)::RunInThread(benchmark::internal::BenchmarkInstance const*, long, int, benchmark::internal::ThreadManager*, benchmark::internal::PerfCountersMeasurement*) (/home/willayd/clones/arrow-adbc/build/driver/postgresql/postgresql-benchmark+0x7383e3) (BuildId: 9c2b1c1be992ad8c4106282527e34c09b72aeb27) SUMMARY: AddressSanitizer: 1024 byte(s) leaked in 1 allocation(s).

That's because the error needs to be released.

If we're not going to use the error, we could just not have it in the first place (it should always be OK to pass nullptr to an error argument)

Ah OK. Probably makes sense to just use the error message instead. Done that simplisticly for now - can make into a macro in a follow up

Yeah, Dewey's original suggestion of factoring it out into a single function would also work.

lidavidm

What is performance like before/after?

lidavidm · 2023-10-13T19:12:25Z

Oh, you posted above

lidavidm · 2023-10-13T19:14:14Z

That is very impressive (I assume postgres was running locally?) So ~4.5 seconds to ~5.5 milliseconds?

It might be good to figure out how to do this test with multiple iterations (I think you can tell Googletest inside the benchmark loop "this is setup, don't measure it")

WillAyd · 2023-10-13T19:22:27Z

The current limitation with multiple iterations is that the call to AdbcStatementBind will release the array, so I guess would either need to take Bind out of the loop, create/copy the array in the loop, or figure out some way to prevent AdbcStatementBind from releasing the array (I don't think this exists)

lidavidm · 2023-10-13T20:08:09Z

Thanks for working through this!

WillAyd added 3 commits September 21, 2023 12:13

first build

d685042

hacked together

e5f559a

working with ArrowBuffer

2002e78

WillAyd commented Sep 22, 2023

View reviewed changes

WillAyd added 2 commits September 22, 2023 16:23

dedicated ExecuteCopy method

ff6c03a

Add ExecuteCopy method

1939483

WillAyd mentioned this pull request Sep 26, 2023

feat(c/driver/postgresql): Inital COPY Writer design #1110

Merged

WillAyd added 4 commits October 9, 2023 17:54

Merge remote-tracking branch 'upstream/main' into copy-writer

95d4ba0

Update to use CopyWriter

4c9bc6c

Merge branch 'main' into copy-writer

b42feea

updated IngestError test

96b7796

WillAyd commented Oct 12, 2023

View reviewed changes

use unique_ptr

95094d2

WillAyd added 2 commits October 13, 2023 10:37

factory func cleanup

e032f55

zero initialize

1ae2e48

WillAyd commented Oct 13, 2023

View reviewed changes

c/driver/postgresql/postgres_copy_reader.h Outdated Show resolved Hide resolved

WillAyd added 4 commits October 13, 2023 11:33

nullptr assignment

e732ef7

Merge branch 'main' into copy-writer

39492a2

change benchmark iterations

35452fc

benchmark improvements

671a6ae

WillAyd marked this pull request as ready for review October 13, 2023 18:57

WillAyd commented Oct 13, 2023

View reviewed changes

lidavidm approved these changes Oct 13, 2023

View reviewed changes

release error

d8fdeca

lidavidm merged commit 792f3d2 into apache:main Oct 13, 2023
45 checks passed

WillAyd deleted the copy-writer branch October 13, 2023 20:15

tokoko mentioned this pull request Mar 23, 2024

Discussion: Pushing batches of data to online store: Should conn.commit() happen in the for loop or after? feast-dev/feast#4036

Closed

job-almekinders mentioned this pull request Jun 24, 2024

PostgresOnlineStore: Improve materialization feast-dev/feast#4309

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(c/driver/postgresql): Use COPY for writes #1093

feat(c/driver/postgresql): Use COPY for writes #1093

WillAyd commented Sep 22, 2023 •

edited

Loading

WillAyd Sep 22, 2023

WillAyd Sep 22, 2023

WillAyd Sep 22, 2023

WillAyd commented Sep 22, 2023

lidavidm commented Sep 22, 2023

WillAyd commented Sep 22, 2023

lidavidm commented Sep 22, 2023

WillAyd commented Sep 22, 2023

lidavidm commented Sep 22, 2023

WillAyd commented Sep 22, 2023

WillAyd Oct 12, 2023

WillAyd commented Oct 12, 2023

WillAyd commented Oct 13, 2023 •

edited

Loading

WillAyd Oct 13, 2023 •

edited

Loading

lidavidm Oct 13, 2023

WillAyd Oct 13, 2023

lidavidm Oct 13, 2023

lidavidm left a comment

lidavidm commented Oct 13, 2023

lidavidm commented Oct 13, 2023

WillAyd commented Oct 13, 2023 •

edited

Loading

lidavidm commented Oct 13, 2023


		if (PQputCopyData(conn, reinterpret_cast<char*>(buf.data), buf.size_bytes) <= 0) {

feat(c/driver/postgresql): Use COPY for writes #1093

feat(c/driver/postgresql): Use COPY for writes #1093

Conversation

WillAyd commented Sep 22, 2023 • edited Loading

WillAyd Sep 22, 2023

Choose a reason for hiding this comment

WillAyd Sep 22, 2023

Choose a reason for hiding this comment

WillAyd Sep 22, 2023

Choose a reason for hiding this comment

WillAyd commented Sep 22, 2023

lidavidm commented Sep 22, 2023

WillAyd commented Sep 22, 2023

lidavidm commented Sep 22, 2023

WillAyd commented Sep 22, 2023

lidavidm commented Sep 22, 2023

WillAyd commented Sep 22, 2023

WillAyd Oct 12, 2023

Choose a reason for hiding this comment

WillAyd commented Oct 12, 2023

WillAyd commented Oct 13, 2023 • edited Loading

WillAyd Oct 13, 2023 • edited Loading

Choose a reason for hiding this comment

lidavidm Oct 13, 2023

Choose a reason for hiding this comment

WillAyd Oct 13, 2023

Choose a reason for hiding this comment

lidavidm Oct 13, 2023

Choose a reason for hiding this comment

lidavidm left a comment

Choose a reason for hiding this comment

lidavidm commented Oct 13, 2023

lidavidm commented Oct 13, 2023

WillAyd commented Oct 13, 2023 • edited Loading

lidavidm commented Oct 13, 2023

WillAyd commented Sep 22, 2023 •

edited

Loading

WillAyd commented Oct 13, 2023 •

edited

Loading

WillAyd Oct 13, 2023 •

edited

Loading

WillAyd commented Oct 13, 2023 •

edited

Loading