Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(c/driver/postgresql): Support for writing DECIMAL types #1288

Merged
merged 35 commits into from
Jan 3, 2024

Conversation

WillAyd
Copy link
Contributor

@WillAyd WillAyd commented Nov 13, 2023

No description provided.

constexpr int kDecDigits = 4;

// TODO: need some kind of bounds check on this
int64_t decimal_int = ArrowDecimalGetIntUnsafe(&decimal);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is ultimately correct. I think ideally we would just use the bytes backing the Decimal object, but I haven't yet figured out how that all gets managed when multiple words are required

@WillAyd
Copy link
Contributor Author

WillAyd commented Nov 13, 2023

Going to require some more time and passes at this, but sharing in case there is any overall thoughts about the architecture.

Once the writer is done, I think it would also be nice to go back and have the COPY reader return decimals directly (right now they get mapped to strings)

@lidavidm
Copy link
Member

Once the writer is done, I think it would also be nice to go back and have the COPY reader return decimals directly (right now they get mapped to strings)

I think the problem is Postgres decimals are unconstrained in terms of precision/scale and support infinity, NaN, etc. while Arrow decimals are strictly semantics layered on top of 128-bit or 256-bit integers, so the mapping would be a bit wonky

if (decimal_int < 0) {
decimal_int = -decimal_int;
}
std::vector<int16_t> pg_digits;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably too small of an optimization to matter, but in principle you should be able to put an upper bound on the number of digits needed to represent an Arrow decimal, and then just stack-allocate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea that's true and actually how postgres does it internally.

https://github.com/postgres/postgres/blob/8680bae8463a0b213893ca6a1c5bb2c2530e823c/src/backend/utils/adt/numeric.c#L8026

If we wanted to stack allocate I guess would just expand that out to whatever is required to store up to 4 decimal words?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, decimal128 would be 38 digits and decimal256 would be 76

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK cool. Shouldn't be too hard to switch to that - just need to figure out how to handle once I get multi-word decimals supported

@WillAyd
Copy link
Contributor Author

WillAyd commented Nov 13, 2023

I think the problem is Postgres decimals are unconstrained in terms of precision/scale and support infinity, NaN, etc. while Arrow

Yea that makes sense. So I guess it is standard in the Arrow ecosystem to use a string here?

@lidavidm
Copy link
Member

I'm not sure about standard, but I don't think we have many choices here :/

@WillAyd
Copy link
Contributor Author

WillAyd commented Nov 13, 2023

Would it make sense to have a driver option to toggle that? I can see some users not caring much about inf/nan, so pretty unfortunate to not be able to read back in Decimal objects for subsequent computations

@lidavidm
Copy link
Member

Yeah, I think we're going to end up with some layer of type mapping shenanigans for every driver (SQLite and Snowflake also have to deal with this). We could attempt to convert to an Arrow decimal with given precision/scale based on the column type, and fail (or maybe insert NULL) for invalid values.

@lidavidm
Copy link
Member

Depending on what people need, we may also want to pursue an extension type (Python decimals behave more like Postgres decimals so there may be some demand, not sure)

@jorisvandenbossche
Copy link
Member

@WillAyd FYI some previous discussion about support for decimals on the read side: #767

@WillAyd
Copy link
Contributor Author

WillAyd commented Nov 14, 2023

That is great info thank you @jorisvandenbossche I'll post any more thoughts on the reader there going forward to keep the conversation in on thread. Thanks!

@WillAyd
Copy link
Contributor Author

WillAyd commented Nov 14, 2023

@lidavidm just want to make sure I have the right conceptual model on how decimal numbers get stored across multiple words. Assuming I had a really long sequence like 12345678901234567890 is this the proper way to construct and populate that decimal?

  const uint64_t large_decimal[2] = {1, 2345678901234567890};
  uint8_t large_decimal_bytes[16];
  std::memcpy(&large_decimal_bytes, large_decimal, sizeof(large_decimal_bytes));
  ArrowDecimalSetBytes(&decimal6, large_decimal_bytes);

Or do I need to worry about the endianness of the platform when defining large_decimal?

@lidavidm
Copy link
Member

Endianness does matter

/// Exact decimal value represented as an integer value in two's
/// complement. Currently only 128-bit (16-byte) and 256-bit (32-byte) integers
/// are used. The representation uses the endianness indicated
/// in the Schema.

Otherwise I think yes but it might be clearer to just write out the bytes

@WillAyd
Copy link
Contributor Author

WillAyd commented Nov 17, 2023

When it comes to spanning multiple words I see that the Arrow implementation uses a uint128_t to convert the bytes that span multiple words into a decimal-based value. Though I also see uint128_t comes from boost, which I don't think we want to take on as a dependency.

Do you have any high level guidance on how I should be looking at that conversion from multiple-word bytes into a decimal-based value?

@lidavidm
Copy link
Member

If we actually need arithmetic on 128 bit integers we could either detect __int128 or vendor an implementation

@github-actions github-actions bot added this to the ADBC Libraries 0.9.0 milestone Nov 29, 2023
@WillAyd WillAyd changed the title feat(c/driver/postgresql): Support for writing DECIMAL128 feat(c/driver/postgresql): Support for writing DECIMAL types Nov 30, 2023
@lidavidm
Copy link
Member

lidavidm commented Dec 7, 2023

For point 1 above I am still trying to figure out how to bolt on a parameterized suite of tests into the existing fixture without making it too complicated, although maybe we have to live with Decimal ingestions tests being different from the rest. Open to ideas

I don't think Googletest is that flexible. Either "parametrize" yourself by looping through a list of cases inside a single actual test case, or create a separate fixture that is parametrized. (Or handwrite a few selected cases.)

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM generally, I'll trust you on the algorithm

bool seen_decimal = scale_ == 0;
bool truncating_trailing_zeros = true;

const std::string decimal_string = DecimalToString<bitwidth_>(&decimal);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another micro-optimization might be to have a stack-allocated char array here and have DecimalToString just fill the char array and return the index of the start; avoids allocating a string in each iteration

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(though possibly, it gets all inlined and optimized away anyways)

const int start_pos = digits_remaining < kDecDigits ?
0 : digits_remaining - kDecDigits;
const size_t len = digits_remaining < 4 ? digits_remaining : kDecDigits;
std::string substr{decimal_string.substr(start_pos, len)};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c++17 would let us use string_view to avoid the extra allocation; we could track indices manually here to avoid it explicitly for now

NANOARROW_RETURN_NOT_OK(WriteChecked<int16_t>(buffer, dscale, error));

for (auto pg_digit : pg_digits) {
NANOARROW_RETURN_NOT_OK(WriteChecked<int16_t>(buffer, pg_digit, error));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

presumably you could check once then memcpy the digits over

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(again, possibly the compiler already does this)

@WillAyd
Copy link
Contributor Author

WillAyd commented Dec 18, 2023

I don't think Googletest is that flexible. Either "parametrize" yourself by looping through a list of cases inside a single actual test case, or create a separate fixture that is parametrized. (Or handwrite a few selected cases.)

Thanks for that guidance. Went with the separate class to parametrize which I think is better for coverage. For now it is self-contained to the postgres tests

@lidavidm lidavidm removed this from the ADBC Libraries 0.9.0 milestone Dec 19, 2023
@github-actions github-actions bot added this to the ADBC Libraries 0.9.0 milestone Dec 22, 2023
@WillAyd WillAyd marked this pull request as ready for review December 22, 2023 17:20
@WillAyd
Copy link
Contributor Author

WillAyd commented Dec 22, 2023

OK I think this is reviewable now. Addressed some of the comments and improved test coverage.

The algorithm to convert from Decimal to string is definitely slow. Here is what the benchmark looks like:

2023-12-22T12:20:26-05:00
Running ./release/driver/postgresql/postgresql-benchmark
Run on (12 X 4700 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 1280 KiB (x6)
  L3 Unified 12288 KiB (x1)
Load Average: 1.34, 0.70, 0.51
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
---------------------------------------------------------------------------------
Benchmark                                       Time             CPU   Iterations
---------------------------------------------------------------------------------
BM_PostgresqlExecute/iterations:1         3757964 ns       277292 ns            1
BM_PostgresqlDecimalWrite/iterations:1   28636835 ns     24109617 ns            1

From some light research I think Arrow uses a "powers-of-ten" approach to determine what the digits should be. Happy to take a look at that in a follow up and see if we can make this faster

const std::vector<std::optional<ArrowDecimal*>> values = {
std::nullopt, &decimal1, &decimal2, &decimal3, &decimal4, &decimal5};

ArrowSchemaInit(&schema.value);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 4 lines are essentially what adbc_validation::MakeSchema does. Though that function is templated by type, I wasn't sure if there was a way to make the template type and optionally precision / scale. There certainly could be a more graceful way of handling this in C++ that I am unaware of

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been meaning to overhaul this for a while now. (Or possibly give in and depend on arrow-cpp...) The current approach really only works for primitive types.

Comment on lines +1707 to +1716
// this is a bit of a hack to make std::vector play nicely with
// a dynamic number of stack-allocated ArrowDecimal objects
constexpr size_t max_decimals = 10;
struct ArrowDecimal decimals[max_decimals];
if (nrecords > max_decimals) {
FAIL() <<
" max_decimals exceeded for test case - please change parametrization";
}

std::vector<std::optional<ArrowDecimal*>> values;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's wrong with vector<optional<ArrowDecimal>>?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to change the MakeBatch implementation to use that type if you prefer; I think just went with the pointer as we use that same pattern with ArrowInterval * in MakeBatchImpl

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, no problem then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I'd really like to overhaul these helpers now that we've gotten more usage out of them 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea agreed these get a bit tough to use. If you have any high level ideas on what you'd like to see I would be happy to try and help as time permits. Think this would be a great exercise to continually improve my C++

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I need to think through things more but probably we will end up with something that looks more like Arrow C++. And I'd like to take a look at driver implementations and see if we can factor out a nanoarrow++ of sorts.

@lidavidm lidavidm merged commit 2116cff into apache:main Jan 3, 2024
53 of 54 checks passed
@WillAyd WillAyd deleted the copy-decimal branch June 28, 2024 14:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants