-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(c/driver/postgresql): Support for writing DECIMAL types #1288
Conversation
constexpr int kDecDigits = 4; | ||
|
||
// TODO: need some kind of bounds check on this | ||
int64_t decimal_int = ArrowDecimalGetIntUnsafe(&decimal); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is ultimately correct. I think ideally we would just use the bytes backing the Decimal object, but I haven't yet figured out how that all gets managed when multiple words are required
Going to require some more time and passes at this, but sharing in case there is any overall thoughts about the architecture. Once the writer is done, I think it would also be nice to go back and have the COPY reader return decimals directly (right now they get mapped to strings) |
I think the problem is Postgres decimals are unconstrained in terms of precision/scale and support infinity, NaN, etc. while Arrow decimals are strictly semantics layered on top of 128-bit or 256-bit integers, so the mapping would be a bit wonky |
if (decimal_int < 0) { | ||
decimal_int = -decimal_int; | ||
} | ||
std::vector<int16_t> pg_digits; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably too small of an optimization to matter, but in principle you should be able to put an upper bound on the number of digits needed to represent an Arrow decimal, and then just stack-allocate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea that's true and actually how postgres does it internally.
If we wanted to stack allocate I guess would just expand that out to whatever is required to store up to 4 decimal words?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, decimal128 would be 38 digits and decimal256 would be 76
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK cool. Shouldn't be too hard to switch to that - just need to figure out how to handle once I get multi-word decimals supported
Yea that makes sense. So I guess it is standard in the Arrow ecosystem to use a string here? |
I'm not sure about standard, but I don't think we have many choices here :/ |
Would it make sense to have a driver option to toggle that? I can see some users not caring much about inf/nan, so pretty unfortunate to not be able to read back in Decimal objects for subsequent computations |
Yeah, I think we're going to end up with some layer of type mapping shenanigans for every driver (SQLite and Snowflake also have to deal with this). We could attempt to convert to an Arrow decimal with given precision/scale based on the column type, and fail (or maybe insert NULL) for invalid values. |
Depending on what people need, we may also want to pursue an extension type (Python decimals behave more like Postgres decimals so there may be some demand, not sure) |
That is great info thank you @jorisvandenbossche I'll post any more thoughts on the reader there going forward to keep the conversation in on thread. Thanks! |
@lidavidm just want to make sure I have the right conceptual model on how decimal numbers get stored across multiple words. Assuming I had a really long sequence like const uint64_t large_decimal[2] = {1, 2345678901234567890};
uint8_t large_decimal_bytes[16];
std::memcpy(&large_decimal_bytes, large_decimal, sizeof(large_decimal_bytes));
ArrowDecimalSetBytes(&decimal6, large_decimal_bytes); Or do I need to worry about the endianness of the platform when defining |
Endianness does matter
Otherwise I think yes but it might be clearer to just write out the bytes |
When it comes to spanning multiple words I see that the Arrow implementation uses a uint128_t to convert the bytes that span multiple words into a decimal-based value. Though I also see uint128_t comes from boost, which I don't think we want to take on as a dependency. Do you have any high level guidance on how I should be looking at that conversion from multiple-word bytes into a decimal-based value? |
If we actually need arithmetic on 128 bit integers we could either detect |
I don't think Googletest is that flexible. Either "parametrize" yourself by looping through a list of cases inside a single actual test case, or create a separate fixture that is parametrized. (Or handwrite a few selected cases.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM generally, I'll trust you on the algorithm
bool seen_decimal = scale_ == 0; | ||
bool truncating_trailing_zeros = true; | ||
|
||
const std::string decimal_string = DecimalToString<bitwidth_>(&decimal); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another micro-optimization might be to have a stack-allocated char array here and have DecimalToString just fill the char array and return the index of the start; avoids allocating a string in each iteration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(though possibly, it gets all inlined and optimized away anyways)
const int start_pos = digits_remaining < kDecDigits ? | ||
0 : digits_remaining - kDecDigits; | ||
const size_t len = digits_remaining < 4 ? digits_remaining : kDecDigits; | ||
std::string substr{decimal_string.substr(start_pos, len)}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
c++17 would let us use string_view to avoid the extra allocation; we could track indices manually here to avoid it explicitly for now
NANOARROW_RETURN_NOT_OK(WriteChecked<int16_t>(buffer, dscale, error)); | ||
|
||
for (auto pg_digit : pg_digits) { | ||
NANOARROW_RETURN_NOT_OK(WriteChecked<int16_t>(buffer, pg_digit, error)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
presumably you could check once then memcpy the digits over
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(again, possibly the compiler already does this)
Thanks for that guidance. Went with the separate class to parametrize which I think is better for coverage. For now it is self-contained to the postgres tests |
OK I think this is reviewable now. Addressed some of the comments and improved test coverage. The algorithm to convert from Decimal to string is definitely slow. Here is what the benchmark looks like:
From some light research I think Arrow uses a "powers-of-ten" approach to determine what the digits should be. Happy to take a look at that in a follow up and see if we can make this faster |
const std::vector<std::optional<ArrowDecimal*>> values = { | ||
std::nullopt, &decimal1, &decimal2, &decimal3, &decimal4, &decimal5}; | ||
|
||
ArrowSchemaInit(&schema.value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These 4 lines are essentially what adbc_validation::MakeSchema
does. Though that function is templated by type, I wasn't sure if there was a way to make the template type and optionally precision / scale. There certainly could be a more graceful way of handling this in C++ that I am unaware of
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been meaning to overhaul this for a while now. (Or possibly give in and depend on arrow-cpp...) The current approach really only works for primitive types.
// this is a bit of a hack to make std::vector play nicely with | ||
// a dynamic number of stack-allocated ArrowDecimal objects | ||
constexpr size_t max_decimals = 10; | ||
struct ArrowDecimal decimals[max_decimals]; | ||
if (nrecords > max_decimals) { | ||
FAIL() << | ||
" max_decimals exceeded for test case - please change parametrization"; | ||
} | ||
|
||
std::vector<std::optional<ArrowDecimal*>> values; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's wrong with vector<optional<ArrowDecimal>>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to change the MakeBatch implementation to use that type if you prefer; I think just went with the pointer as we use that same pattern with ArrowInterval *
in MakeBatchImpl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok, no problem then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I'd really like to overhaul these helpers now that we've gotten more usage out of them 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea agreed these get a bit tough to use. If you have any high level ideas on what you'd like to see I would be happy to try and help as time permits. Think this would be a great exercise to continually improve my C++
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I need to think through things more but probably we will end up with something that looks more like Arrow C++. And I'd like to take a look at driver implementations and see if we can factor out a nanoarrow++ of sorts.
No description provided.