feat(c/driver/postgresql): Inital COPY Writer design #1110

WillAyd · 2023-09-26T20:51:24Z

This is a pre-cursor to #1093 ; figured it would be easier to work piece-wise rather than all at once.

This does not try to actually connect the statement.cc code to use this, but just gets the test case / general structure set up

WillAyd · 2023-09-26T20:51:58Z

c/driver/postgresql/postgres_copy_reader.h

@@ -117,6 +117,53 @@ ArrowErrorCode ReadChecked(ArrowBufferView* data, T* out, ArrowError* error) {
  return NANOARROW_OK;
 }

+// Write a value to a buffer without checking the buffer size. Advances


I put all of this code into postgres_copy_reader.h because it re-uses a lot of the same patterns and constants. Maybe we should rename this postgres_copy_io.h?

We could also factor out the headers into shared/nonshared parts at some point, I don't think it's a big deal either way.

WillAyd · 2023-09-26T20:55:01Z

c/driver/postgresql/postgres_copy_reader.h

+    children_[child_i]->Init(array_view_->children[child_i]);
+  }
+
+  ArrowErrorCode Write(ArrowBuffer* buffer, int64_t index, ArrowError* error) override {


The Write method calls accept an index argument, which is a little different from the Reader setup. Instead of accessing by index, the Readers always call ArrowBufferAppend on the array they are building.

I think we could still do that here, it's just a little bit more complicated by the fact that there is no generator ArrowBufferGetNext or similar, so I figured just using index access was easier to start. Could be something larger I am overlooking

I think indices are fine, reading data is different than writing it

Sounds good. I think its just kind of confusing that we increment the buffer in the WriteUnchecked calls but also mix in index access here. May be the best we can do to start

WillAyd · 2023-09-26T20:55:42Z

c/driver/postgresql/postgres_copy_reader.h

+    return NANOARROW_OK;
+  }
+
+  int64_t array_size_approx_bytes() const { return array_size_approx_bytes_; }


Not sure if this is necessary, just copied from the reader design

Probably not? That was to let us build partial results fitting roughly within some bound.

WillAyd · 2023-09-26T20:59:47Z

c/driver/postgresql/postgres_copy_reader.h

+  int64_t array_size_approx_bytes() const { return array_size_approx_bytes_; }
+
+  ArrowErrorCode WriteHeader(ArrowBuffer* buffer, ArrowError* error) {
+    ArrowBufferAppend(buffer, kPgCopyBinarySignature, sizeof(kPgCopyBinarySignature));


I think we can move towards ArrowBufferAppendUnsafe if ensure a proper buffer size up front. I think in the current protocol you need 19 bytes for the header, 2 bytes for the number of columns in each row, 4 bytes for each record to indicate the record length, n bits for every non-null record to contain its actual bytes and finally 4 bytes for the end message.

Something to investigate futher

ah, so one pass to figure out the buffer size, then another pass to actually copy data?

Have to think more about this. I don't think you need 2 passes though? I think this could be calculated up front

WillAyd · 2023-09-26T21:00:46Z

c/driver/postgresql/postgres_copy_reader.h

+ private:
+  PostgresCopyFieldTupleWriter root_writer_;
+  struct ArrowSchema* schema_;
+  std::unique_ptr<struct ArrowArrayView> array_view_{new struct ArrowArrayView};


I'm a bit iffy on C++ constructs, but I think this is the easiest way to declare an pointer that owns data as a class member with C++11 compat. Apologies if I'm missing something easier

I believe Handle<ArrowArrayView> should work. I should really find the time to go write that adbc++ library.

Cool that is a great idea. I think we would have to refactor the Handle to move from statement.cc to postgres_util.h so will tackle in another PR

WillAyd · 2023-09-26T21:04:35Z

c/driver/postgresql/postgres_copy_reader_test.cc

+      result = writer_.WriteRecord(buffer, error);
+    } while (result == NANOARROW_OK);
+
+    // TODO: don't think we should do this here; the reader equivalent does


AFAICT the reader implementation just moves through the message buffer. I mirrored that as well for the writer, but that means that trying to read the buffer after the fact requires knowing how many bytes were traversed and moving back there. Probably a better way to do this

WillAyd · 2023-09-26T21:07:47Z

c/driver/postgresql/postgres_copy_reader_test.cc

+
+  // The last 4 bytes of a message can be transmitted via PQputCopyData
+  // so no need to test those bytes from the Writer
+  for (size_t i = 0; i < sizeof(kTestPgCopyBoolean) - 4; i++) {


Ultimately when we implement this in statement.cc I imagine we will build the buffer (maybe even in chunks) and send that via PQputCopyData . When all is said and done we would then do a PQputCopyEnd to send the last 4 bytes. Maybe we should make the end message a constant in the tests so it is clear what is part of the "data" versus the sentinel signaling the end of the buffer

lidavidm

This looks like a great start

lidavidm · 2023-09-27T18:45:12Z

c/driver/postgresql/postgres_copy_reader.h

+ private:
+  PostgresCopyFieldTupleWriter root_writer_;
+  struct ArrowSchema* schema_;
+  std::unique_ptr<struct ArrowArrayView> array_view_{new struct ArrowArrayView};


I believe Handle<ArrowArrayView> should work. I should really find the time to go write that adbc++ library.

lidavidm · 2023-09-27T18:45:37Z

c/driver/postgresql/postgres_copy_reader.h

+    return NANOARROW_OK;
+  }
+
+  int64_t array_size_approx_bytes() const { return array_size_approx_bytes_; }


Probably not? That was to let us build partial results fitting roughly within some bound.

lidavidm · 2023-09-27T18:46:02Z

c/driver/postgresql/postgres_copy_reader.h

+    children_[child_i]->Init(array_view_->children[child_i]);
+  }
+
+  ArrowErrorCode Write(ArrowBuffer* buffer, int64_t index, ArrowError* error) override {


I think indices are fine, reading data is different than writing it

lidavidm · 2023-09-27T18:46:26Z

c/driver/postgresql/postgres_copy_reader.h

@@ -117,6 +117,53 @@ ArrowErrorCode ReadChecked(ArrowBufferView* data, T* out, ArrowError* error) {
  return NANOARROW_OK;
 }

+// Write a value to a buffer without checking the buffer size. Advances


We could also factor out the headers into shared/nonshared parts at some point, I don't think it's a big deal either way.

lidavidm · 2023-09-27T18:47:31Z

c/driver/postgresql/postgres_copy_reader.h

+template <>
+inline void WriteUnsafe(ArrowBuffer* buffer, int8_t in) {
+  buffer->data[0] = in;
+  buffer->data += sizeof(int8_t);
+  buffer->size_bytes += sizeof(int8_t);
+}


FWIW, I'm not sure this is necessary. Compilers understand memcpy, and I would guess that they optimize the generic above to the same as these specializations.

The specializations exist because of the unsigned argument requirement forSwapNetworkToHost requirement, although that makes me realize these are incorrect as is

lidavidm · 2023-09-27T18:49:14Z

c/driver/postgresql/postgres_copy_reader.h

+  int64_t array_size_approx_bytes() const { return array_size_approx_bytes_; }
+
+  ArrowErrorCode WriteHeader(ArrowBuffer* buffer, ArrowError* error) {
+    ArrowBufferAppend(buffer, kPgCopyBinarySignature, sizeof(kPgCopyBinarySignature));


ah, so one pass to figure out the buffer size, then another pass to actually copy data?

WillAyd · 2023-09-28T00:32:19Z

I don't think the CI failures are related. Happy to have this merged now and work on more writers in follow ups if you'd like

lidavidm · 2023-09-28T16:55:57Z

Yeah, the CI failures are #1088

feat(c/driver/postgresql): Inital COPY Writer design

b2ff1ef

WillAyd requested a review from lidavidm as a code owner September 26, 2023 20:51

WillAyd commented Sep 26, 2023

View reviewed changes

WillAyd added 8 commits September 26, 2023 17:10

destructor

2d8c4b3

try cast

b6c2492

size_t

a6b04fe

cursor

9b32b2d

cursor assignment

a4a4834

update test

a24ca5d

add bytes to buffer

dbdc4f2

use expect instead of assert

897248e

lidavidm approved these changes Sep 27, 2023

View reviewed changes

WillAyd added 5 commits September 27, 2023 15:09

fix specializations

55b0e60

remove array_size_approx_bytes from Writer

2a82645

add expect message

575b00f

initialize records_written

8548161

remove verbose EXPECT_EQ

b065b45

lidavidm approved these changes Sep 28, 2023

View reviewed changes

lidavidm added this to the ADBC Libraries 0.8.0 milestone Sep 28, 2023

lidavidm merged commit 7226c0e into apache:main Sep 28, 2023
60 of 64 checks passed

WillAyd deleted the postgres-copy-writer branch September 28, 2023 17:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(c/driver/postgresql): Inital COPY Writer design #1110

feat(c/driver/postgresql): Inital COPY Writer design #1110

WillAyd commented Sep 26, 2023

WillAyd Sep 26, 2023

lidavidm Sep 27, 2023

WillAyd Sep 26, 2023

lidavidm Sep 27, 2023

WillAyd Sep 27, 2023

WillAyd Sep 26, 2023

lidavidm Sep 27, 2023

WillAyd Sep 26, 2023

lidavidm Sep 27, 2023

WillAyd Sep 27, 2023

WillAyd Sep 26, 2023

lidavidm Sep 27, 2023

WillAyd Sep 27, 2023

WillAyd Sep 26, 2023

WillAyd Sep 26, 2023

lidavidm left a comment

lidavidm Sep 27, 2023

lidavidm Sep 27, 2023

lidavidm Sep 27, 2023

lidavidm Sep 27, 2023

lidavidm Sep 27, 2023

WillAyd Sep 27, 2023

lidavidm Sep 27, 2023

WillAyd commented Sep 28, 2023

lidavidm commented Sep 28, 2023

feat(c/driver/postgresql): Inital COPY Writer design #1110

feat(c/driver/postgresql): Inital COPY Writer design #1110

Conversation

WillAyd commented Sep 26, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lidavidm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd commented Sep 28, 2023

lidavidm commented Sep 28, 2023