GH-44345: [C++][Parquet] Fully support arrow decimal32/64 in Parquet #45351

curioustien · 2025-01-25T20:55:21Z

Rationale for this change

As described in #44345, the support for arrow decimal32/64 in Parquet is not there yet. This change fully supports arrow decimal32/64 in Parquet by doing the correct conversion between arrow decimal32/64 and Parquet decimal.

What changes are included in this PR?

A few changes in this PR:

Support correct schema conversion between Parquet and arrow decimal32/64/128/256
Support writing arrow decimal32/64 to Parquet
Support reading Parquet decimal to arrow decimal32/64
Enforce the right decimal conversion based on the precision value
Allow decimal32/64 in Arrow compute vector hash which is needed for some of the existing Parquet tests
Support converting pyarrow parquet decimal32/64 to pandas

Are these changes tested?

Yes

Are there any user-facing changes?

Yes, after this change, any decimals in Parquet will be converted to the corresponding arrow decimal type based on the precision

GitHub Issue: [C++][Parquet] arrow Decimal32/Decimal64 write Parquet and testing #44345

mapleFU

Thanks! I think the legacy Decimal128/Decimal256 write data behavior should not be change

mapleFU · 2025-01-26T02:48:58Z

cpp/src/parquet/arrow/reader_internal.cc

-  if (maybe_type.ok()) {
-    arrow_type = maybe_type.ValueOrDie();
+
+  if (precision <= Decimal32Type::kMaxPrecision) {


Can we add a comment that the literal would can be cast to the correspond type if the real reader type is a wider decimal type?

mapleFU · 2025-01-26T02:51:14Z

cpp/src/parquet/column_writer.cc

-      WRITE_SERIALIZE_CASE(DECIMAL128, Decimal128Type, Int32Type)
-      WRITE_SERIALIZE_CASE(DECIMAL256, Decimal256Type, Int32Type)


why not

WRITE_SERIALIZE_CASE(DECIMAL32, Decimal32Type, Int32Type) WRITE_SERIALIZE_CASE(DECIMAL64, Decimal64Type, Int32Type) WRITE_SERIALIZE_CASE(DECIMAL128, Decimal128Type, Int32Type) WRITE_SERIALIZE_CASE(DECIMAL256, Decimal256Type, Int32Type)

Perhaps we don't need WRITE_SERIALIZE_CASE(DECIMAL64, Decimal64Type, Int32Type)?

Perhaps we don't need

I think we need, decimal64 is just decimal type but not limit the precision ?

I'm open to this but suspicious of the value of it.

mapleFU · 2025-01-26T02:51:29Z

cpp/src/parquet/column_writer.cc

      WRITE_ZERO_COPY_CASE(DURATION, DurationType, Int64Type)
-      WRITE_SERIALIZE_CASE(DECIMAL128, Decimal128Type, Int64Type)
-      WRITE_SERIALIZE_CASE(DECIMAL256, Decimal256Type, Int64Type)
+      WRITE_SERIALIZE_CASE(DECIMAL64, Decimal64Type, Int64Type)


mapleFU · 2025-01-26T02:52:38Z

cpp/src/parquet/column_writer.cc

+    scratch_i32 = reinterpret_cast<int32_t*>(scratch_buffer->mutable_data());
+    scratch_i64 = reinterpret_cast<int64_t*>(scratch_buffer->mutable_data());


why split this into two parts?

I'm not sure how to handle this code in a clean way. However, the main reason why I had to do this was because of the int64_t* scratch pointer we're using. IIUC, the current code constructs the decimal endian array using this scratch space. The scratch pointer moves along the memory address space to do this construction.

If you see the current logic, it looks at the byte_width to determine how many address spaces we need to use from the input to construct the decimal. The int64_t* scratch pointer works for decimal64, decimal128, and decimal256. However, it doesn't work for decimal32 because it uses 32-bit address space, so I had to create another pointer with int32_t*.

I may misunderstand how this code works, so feel free to correct me here

Perhaps we can remove scratch_i32 and scratch_i64 and delay the casting until we use the scratch buffer?

I tried to delay the casting in this commit f279349, and got some test failures. After some debugging, I realized that we had to initialize the scratch pointer values here. If you look at the current Serialize function and FixDecimalEndianness function, we first allocate the scratch buffer and capture the scratch pointer. Then, we move along that scratch pointer until we finish the output. Therefore, we need to do the initialization and casting in this method

Can't we perform the reinterpret_cast right before calling ::arrow::bit_util::ToBigEndian to set it? At that time, you already know the byte_width.

mapleFU · 2025-01-26T02:53:21Z

python/pyarrow/tests/parquet/test_basic.py

 def test_store_decimal_as_integer(tempdir):
    arr_decimal_1_9 = pa.array(list(map(Decimal, range(100))),
-                               type=pa.decimal128(5, 2))
+                               type=pa.decimal32(5, 2))


this should not be changed, instead we should add new tests here

This is a little tricky to keep the test to be the same if we don't cast the type to a wider decimal. On the writer side, we can keep the same behavior from arrow to Parquet with additional support for decimal32/64.

However, on the reader side from Parquet to arrow, the Parquet decimal format only contains precision and scale without any knowledge of different arrow types (which is the correct behavior here). Therefore, in order to do the conversion, we look at the precision to convert it to either decimal32/64/128/256 correspondingly.

For this test which does a round trip for both writing to parquet and reading from parquet, the correct end result should be decimal32 when we read the data. I can modify this test case to cast the return decimal to a wider decimal if that's what you meant.

So can we provide a type function, and test all possible types here?

raulcd · 2025-01-27T16:42:20Z

cpp/src/arrow/compute/kernels/vector_hash.cc

    case Type::DATE32:
    case Type::TIME32:
    case Type::INTERVAL_MONTHS:
+    case Type::DECIMAL32:


Are the changes required to the compute kernels required to support Parquet? I can't see why but I might be missing something. Otherwise, we should move adding support for decimal32 and decimal64 to those compute kernels on a different PR and leave this one only with the required parquet changes.

ok, I see now, on the description says this is required for some tests:
Allow decimal32/64 in Arrow compute vector hash which is needed for some of the existing Parquet tests

I'm down to split this change to another PR which can cover this support with more tests on the arrow compute side. But yes, there are a few tests in Parquet that hit arrow vector kernel code path

curioustien · 2025-02-05T02:26:38Z

Quick update:
Finding a good way to convert these decimal types and getting all the test passes take longer than I thought. Probably need a few more days

curioustien · 2025-02-06T03:09:21Z

@wgtmac @mapleFU I'm facing an implementation blocker on keeping the same writing/reading behaviors for decimal128/256 between Arrow and Parquet because I can't find a clean way to do it. As I mentioned in this previous comment #45351 (comment) on my original implementation, the current Parquet reading logic looks at the decimal precision to determine how to convert Parquet decimal logical type to Arrow decimal type. Since we introduced decimal32/64 in arrow, I had to change this logic to include these types based on the precision.

Therefore, whenever we want to cast a decimal32 to decimal128, we need to force the schema to convert to a bigger decimal. I found this arrow Field.MergeWith method that could do the job (which I used it in one of the schema tests here). However, when I moved on to the reader/writer tests, I found that these schema fields can only be accessed from manifest within FileReader. Though, it's read only. Therefore, if I want to force schema conversion, I'll have to either:

Change this manifest() method so that we can manipulate the schema fields
Add a new property in ArrowReaderProperties so that we can propagate this schema conversion logic

I can't really come up with other options here. Both of these options require changes in some important classes, so I want to get some alignments before I proceed. Option 2 probably makes the most sense here. I also need to propagate this new property to pyarrow as well if we want to have this same behavior in python.

What are your thoughts here on this problem? Any other alternatives that I should try?

mapleFU · 2025-02-06T07:45:40Z

This is really a problem. Currently arrow has an option store_schema(), which stores the arrow schema in Parquet file

what about the new written decimal32/decimal64 might be readed as decimal32/decimal64, and legacy code goes the legacy way?

curioustien · 2025-02-07T02:01:35Z

@mapleFU Thanks for the pointer on store_schema(). I think we can leverage this option.

what about the new written decimal32/decimal64 might be readed as decimal32/decimal64, and legacy code goes the legacy way?

I'm quite confused about this comment. Could you elaborate? Did you mean that if store_schema() is enabled then we'll convert things correctly? Otherwise, we keep the legacy code and always convert to either decimal128 or decimal256? That kinda defeats the purpose of having decimal32 and decimal64 in arrow though unless the users know how to specify this store_schema() flag

wgtmac · 2025-02-07T15:47:54Z

If store_schema() is enabled, reading the Parquet file should just use the restored Arrow type. This is simple. However, if it is not used, I prefer to add a new option to ArrowReaderProperties to advise the reader that we need to use decimal type created by smallest_decimal(int32_t precision, int32_t scale).

curioustien · 2025-02-08T00:09:26Z

If store_schema() is enabled, reading the Parquet file should just use the restored Arrow type. This is simple. However, if it is not used, I prefer to add a new option to ArrowReaderProperties to advise the reader that we need to use decimal type created by smallest_decimal(int32_t precision, int32_t scale).

This makes sense to me. I'll proceed with this implementation then. Thanks for the discussion

curioustien · 2025-02-15T03:31:16Z

@mapleFU @wgtmac I see that we already have a kinda similar flag called arrow_extensions_enabled from the JSON support. Should we just use this flag instead? It kinda makes sense to me to reuse this flag instead of introducing a new flag just for decimal type. I don't mind introducing a new flag like smallest_decimal_enabled, but I think it may make the API a little more complicated.

curioustien · 2025-02-17T03:12:12Z

cpp/src/parquet/arrow/writer.cc


-  Status Init() {
+  Status Init(const ArrowReaderProperties& schema_arrow_reader_properties) {
    return SchemaManifest::Make(writer_->schema(), /*schema_metadata=*/nullptr,
-                                default_arrow_reader_properties(), &schema_manifest_);
+                                schema_arrow_reader_properties, &schema_manifest_);
  }


This is a bug in the current code where the Parquet writer always use the default arrow reader properties for its schema manifest. Instead, it should allow callers to pass in a custom arrow reader properties if needed

I think you have linked the wrong file. The Init() function from parquet/arrow/reader.cc does not have this issue, right?

@wgtmac sorry for the late response! I think you're right. I probably misread the logic somehow. Reverting this change now.

wgtmac · 2025-02-19T08:57:04Z

We cannot reuse arrow_extensions_enabled because JSON is a canonical extension type but decimal is a native type.

curioustien · 2025-03-22T21:32:59Z

@wgtmac @mapleFU Wanna take another look at this PR?

wgtmac · 2025-03-27T08:24:35Z

cpp/src/parquet/arrow/schema_internal.h

-Result<std::shared_ptr<::arrow::DataType>> FromInt64(const LogicalType& logical_type);
+Result<std::shared_ptr<::arrow::DataType>> FromByteArray(
+    const LogicalType& logical_type, bool arrow_extensions_enabled = false,
+    bool smallest_decimal_enabled = false);


Can we avoid these default parameters?

I can remove them for FromByteArray and FromFLBA to use ArrowReaderProperties instead. However, for FromInt32 and FromInt64, they're also used in FromInt32Statistics and FromInt64Statistics from StatisticsAsScalars. It doesn't look like I can stream the ArrowReaderProperties from there, so I have to use default parameters

wgtmac · 2025-03-27T08:33:55Z

cpp/src/parquet/column_writer.cc

+    scratch_i32 = reinterpret_cast<int32_t*>(scratch_buffer->mutable_data());
+    scratch_i64 = reinterpret_cast<int64_t*>(scratch_buffer->mutable_data());


Perhaps we can remove scratch_i32 and scratch_i64 and delay the casting until we use the scratch buffer?

wgtmac · 2025-03-27T08:39:06Z

cpp/src/parquet/arrow/schema_internal.cc


-Result<std::shared_ptr<ArrowType>> FromByteArray(
-    const LogicalType& logical_type, const ArrowReaderProperties& reader_properties) {
+Result<std::shared_ptr<ArrowType>> FromByteArray(const LogicalType& logical_type,


I'm just wondering why not simply passing const ArrowReaderProperties& reader_properties to all these functions? If we will add a third parameter which also comes from the properties, then we should directly use it.

wgtmac · 2025-03-27T08:42:07Z

cpp/src/parquet/arrow/reader.h

  virtual void set_batch_size(int64_t batch_size) = 0;

+  /// Set whether to enable smallest decimal arrow type
+  virtual void set_smallest_decimal_enabled(bool smallest_decimal_enabled) = 0;


This looks weird. We cannot change it after the reader has been created. Isn't it accessible via the ArrowReaderProperties?

wgtmac · 2025-03-27T08:42:59Z

cpp/src/parquet/arrow/reader.h

 PARQUET_EXPORT
 ::arrow::Result<std::unique_ptr<FileReader>> OpenFile(
-    std::shared_ptr<::arrow::io::RandomAccessFile>, ::arrow::MemoryPool* allocator);
+    std::shared_ptr<::arrow::io::RandomAccessFile>, ::arrow::MemoryPool* pool,


Please also revert this change.

wgtmac · 2025-03-27T08:47:14Z

cpp/src/parquet/arrow/test_util.h

+  static constexpr ::arrow::Type::type type_id = ::arrow::Decimal32Type::type_id;
+  static constexpr int32_t precision = PRECISION;
+  static constexpr int32_t scale = PRECISION - 1;
+  static constexpr bool smallest_decimal_enabled = SMALLEST_DECIMAL_ENABLED;


Why do we need to add smallest_decimal_enabled to it?

This is mainly used for the tests in arrow_reader_writer_test.cc for TestParquetIO.TestTypes from line 984 to 997 where I used this variable to trigger whether the test type requires the reader to use the smallest decimal or not

This reverts commit f279349.

curioustien · 2025-04-04T04:06:33Z

@wgtmac Wanna give another round of review? It looks like CI error isn't related to this change

wgtmac · 2025-04-11T09:52:38Z

cpp/src/parquet/arrow/reader_internal.cc

 static Status DecimalIntegerTransfer(RecordReader* reader, MemoryPool* pool,
                                     const std::shared_ptr<Field>& field, Datum* out) {
-  // Decimal128 and Decimal256 are only Arrow constructs.  Parquet does not
+  // Decimal32 and Decimal64 are only Arrow constructs.  Parquet does not


The comment seems not correct?

wgtmac · 2025-04-11T09:54:55Z

cpp/src/parquet/arrow/schema_internal.h


 #include "arrow/result.h"
 #include "arrow/type_fwd.h"
+#include "parquet/properties.h"


Please use forward declaration instead of adding a new include.

wgtmac · 2025-04-11T09:59:30Z

cpp/src/parquet/column_writer.cc

+    scratch_i32 = reinterpret_cast<int32_t*>(scratch_buffer->mutable_data());
+    scratch_i64 = reinterpret_cast<int64_t*>(scratch_buffer->mutable_data());


Can't we perform the reinterpret_cast right before calling ::arrow::bit_util::ToBigEndian to set it? At that time, you already know the byte_width.

wgtmac · 2025-05-22T06:50:16Z

Gentle ping @curioustien, do you have time to rebase to resolve the conflicts and address the comments?

curioustien · 2025-05-27T14:19:42Z

@wgtmac thanks for the reminder. I've been quite busy at work lately. I'll try to pick it up this weekend when I have some free time. Otherwise, please feel free to take over if you have time

HuaHuaY · 2025-08-25T09:01:07Z

@wgtmac thanks for the reminder. I've been quite busy at work lately. I'll try to pick it up this weekend when I have some free time. Otherwise, please feel free to take over if you have time

I would like to take over the read/write path in C++ code.

wgtmac · 2025-08-25T09:39:28Z

@HuaHuaY Feel free to take it over 👍

HuaHuaY · 2025-08-26T11:21:21Z

I have opened a new PR #47427 and only modified the C++ part.

wgtmac · 2025-08-26T14:51:40Z

Let me close this first. Thanks @curioustien!

github-actions bot added Component: Parquet Component: C++ awaiting review Awaiting review labels Jan 25, 2025

curioustien force-pushed the parquet-decimal-test branch from cffcc10 to 83c8a02 Compare January 25, 2025 21:03

github-actions bot added the Component: Python label Jan 26, 2025

curioustien marked this pull request as ready for review January 26, 2025 01:05

curioustien requested a review from wgtmac as a code owner January 26, 2025 01:05

curioustien mentioned this pull request Jan 26, 2025

[C++][Parquet] arrow Decimal32/Decimal64 write Parquet and testing #44345

Closed

mapleFU reviewed Jan 26, 2025

View reviewed changes

raulcd reviewed Jan 27, 2025

View reviewed changes

curioustien added 11 commits February 16, 2025 13:07

Support decimal32/64 in schema conversion

cba31c7

Support decimal32/64 in column writer

a9398a2

Restrict column writer with correct decimal types

e1dc023

Support decimal32/64 in reader & vector kernels & tests

6032b02

Pyarrow parquet to pandas

290de24

Address comments

e5b996e

Add more tests in arrow_schema_test

44f1adc

Add more tests in arrow_reader_writer_test

c017323

Add more typed tests for small decimals

63d307b

Document new flag

77dd7d3

Add decimal32/64 list type support arrow to pandas

d81cf13

curioustien force-pushed the parquet-decimal-test branch 2 times, most recently from 2ba19e9 to 6179dff Compare February 17, 2025 02:20

curioustien commented Feb 17, 2025

View reviewed changes

Revert writer schema manifest arg passing change

d1687a7

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Mar 11, 2025

curioustien added 2 commits March 22, 2025 16:32

Merge remote-tracking branch 'upstream/main' into parquet-decimal-test

1f0fb7b

Fix lint

52711d5

Remove extra doc

f64d6d9

wgtmac requested changes Mar 27, 2025

View reviewed changes

curioustien added 7 commits March 29, 2025 17:57

Revert FileReader changes

3fb307e

Delay scratch buffer pointer cast

f279349

Use ArrowReaderProperties

8a78c72

Merge remote-tracking branch 'upstream/main' into parquet-decimal-test

29e98ff

Revert "Delay scratch buffer pointer cast"

d2e1ffa

This reverts commit f279349.

Remove mistake include

a8304f3

Merge remote-tracking branch 'upstream/main' into parquet-decimal-test

de295e3

curioustien requested a review from wgtmac April 4, 2025 04:04

wgtmac reviewed Apr 11, 2025

View reviewed changes

wgtmac mentioned this pull request May 10, 2025

[Format] Add Decimal32 and Decimal64 to Arrow #43956

Open

5 tasks

thisisnic mentioned this pull request Jun 17, 2025

[Benchmarking][R] conbench is failing #46716

Open

wgtmac mentioned this pull request Aug 8, 2025

feat(parquet): add schema projection to parquet apache/iceberg-cpp#159

Merged

wgtmac closed this Aug 26, 2025

		WRITE_SERIALIZE_CASE(DECIMAL128, Decimal128Type, Int32Type)
		WRITE_SERIALIZE_CASE(DECIMAL256, Decimal256Type, Int32Type)

		scratch_i32 = reinterpret_cast<int32_t*>(scratch_buffer->mutable_data());
		scratch_i64 = reinterpret_cast<int64_t*>(scratch_buffer->mutable_data());

GH-44345: [C++][Parquet] Fully support arrow decimal32/64 in Parquet #45351

GH-44345: [C++][Parquet] Fully support arrow decimal32/64 in Parquet #45351

Uh oh!

Conversation

curioustien commented Jan 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

mapleFU left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

curioustien commented Feb 5, 2025

Uh oh!

curioustien commented Feb 6, 2025

Uh oh!

mapleFU commented Feb 6, 2025

Uh oh!

curioustien commented Feb 7, 2025

Uh oh!

wgtmac commented Feb 7, 2025

Uh oh!

curioustien commented Feb 8, 2025

Uh oh!

curioustien commented Feb 15, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wgtmac commented Feb 19, 2025

Uh oh!

curioustien commented Mar 22, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

curioustien Mar 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

curioustien commented Jan 25, 2025 •

edited

Loading

curioustien Mar 29, 2025 •

edited

Loading

HuaHuaY commented Aug 25, 2025 •

edited

Loading