fix: Add support for unsigned Arrow datatypes in schema conversion #1617

gkpanda4 · 2025-08-18T20:53:44Z

Which issue does this PR close?

Closes bug: ArrowSchemaConverter can't handle unsigned datatypes from arrow #675

What changes are included in this PR?

Bug Fixes

Fixed crash when ArrowSchemaConverter encounters unsigned datatypes
Resolved "Unsupported Arrow data type" errors for UInt8/16/32/64

Features

Added casting support for unsigned Arrow types
UInt8/16 → Int32 (safe casting to larger signed type)
UInt32 → Int64 (safe casting to larger signed type)
UInt64 → Error (no safe casting option, explicit error with guidance)

Code Changes

Enhanced ArrowSchemaConverter primitive() method with unsigned type handling
Added comprehensive test: test_unsigned_type_casting() for all unsigned variants

Files Modified

crates/iceberg/src/arrow/schema.rs

Impact

✅ No breaking changes - existing functionality preserved
✅ Safe type casting prevents overflow issues
✅ Clear error messages for unsupported UInt64 with alternatives
✅ Follows proven PyIceberg implementation approach

Are these changes tested?

All existing schema tests pass
New comprehensive test covers UInt8, UInt16, UInt32, UInt64 conversion behavior
Test verifies proper casting: UInt8/16→Int32, UInt32→Int64, UInt64→Error

emkornfield · 2025-08-20T21:00:10Z

Sorry, new to reviewing (and mostly new to the code base) so take comments with a grain of salt, but this approach seems brittle:

What happens if someone updates the doc field that removes the type information and Arrow RS tries to read the back?
What happens if a non-Arrow RS tries to read data written from Arrow with these fields (in particular int32 and int64)?

It seems a more robust solution would be to:

Convert uint32->int64
Either still block uint64, convert uint64 to a Decimal with an appropriate precision to represent the full range, or use int64 and validate that no values written are outside the appropriate range.

CTTY · 2025-08-20T23:59:33Z

I have the same concern as @emkornfield , using doc to determine field type seems unsafe to me. I think casting the type should be fine. This way there would be type loss when converting Iceberg schema back to arrow schema but it should be ok.

Also the Python's implementation can serve as a good reference. Note that Pyiceberg use bid width while arrow-rs only provides primitive_width() that returns width in bytes

gkpanda4 · 2025-08-21T00:29:33Z

@emkornfield Right, with the current approach, it has potential for silent data corruption because of Arrow's doc field dependency. @CTTY Thanks for the references, I will use this for casting.

My updated approach uses safe bit-width casting for unsigned integer types following the proven iceberg-python implementation

• uint8/uint16 → int32: Safe upcast with no overflow risk
• uint32 → int64: Safe upcast preserving full uint32 range
• uint64 → explicit block: Rather than risk data loss through unsafe conversion, provide clear error guidance directing users to choose between int64 (with range validation) or decimal (with full precision) based on their specific requirements

Let me know if there are any concerns, I will have the changes out otherwise.

emkornfield · 2025-08-21T03:37:35Z

crates/iceberg/src/arrow/schema.rs

+            // Cast unsigned types based on bit width (following Python implementation)
+            DataType::UInt8 | DataType::UInt16 | DataType::UInt32 => {
+                // Cast to next larger signed type to prevent overflow
+                let bit_width = p.primitive_width().unwrap_or(0) * 8; // Convert bytes to bits


This seems superflous, can't you just match on the data types and directly map them?

DataType::UInt8 | DataType::UInt16 => Ok(Type::Primitive(PrimitiveType::Int) DataType::UInt32 => Ok(Type::Primitive(PrimitiveType::Long)

Yes, will make this logic simple

emkornfield · 2025-08-21T03:38:09Z

crates/iceberg/src/arrow/schema.rs

            DataType::Int8 | DataType::Int16 | DataType::Int32 => {
                Ok(Type::Primitive(PrimitiveType::Int))
            }
+            // Cast unsigned types based on bit width (following Python implementation)


Suggested change

// Cast unsigned types based on bit width (following Python implementation)

// Cast unsigned types based on bit width to allow for no data loss

I'm sure python compatibility is a direct goal here?

Will incorporate this

emkornfield · 2025-08-21T03:39:02Z

crates/iceberg/src/arrow/schema.rs

+
+        // Test UInt8/UInt16 → Int32 casting
+        {
+            let arrow_field = Field::new("test", DataType::UInt8, false).with_metadata(


nit: this doesn't look like this is testing UInt16?

I have added this scenario

emkornfield · 2025-08-21T03:39:47Z

crates/iceberg/src/arrow/schema.rs


+    #[test]
+    fn test_unsigned_type_casting() {
+        // Test UInt32 → Int64 casting


is it possible to maybe parameterize the non-error cases as least with expected input/output to avoid boiler plate?

emkornfield · 2025-08-21T03:40:24Z

crates/iceberg/src/arrow/schema.rs

        }
    }

+    #[test]


this probably isn't the right module, but it would probably be nice to have a test that actually exercises writing these types and then reading them back again?

I implemented an integration test for unsigned type roundtrip, but discovered that ParquetWriter also requires modification to handle unsigned data conversion. The issue stems from a type mismatch between schema and data.

The problem occurs because schema conversion (arrow_schema_to_schema) transforms the metadata but leaves the actual data unchanged. When writing, Arrow validation fails due to this mismatch.

I think writing record batches that contain unsigned types is out of the scope of the original issue and can be tricky:

ParquetWriter uses AsyncArrowWriter under the hood

AsyncArrowWriter uses an arrow schema that got converted from the Iceberg table schema

When converting an Iceberg schema to arrow schema, arrow schema won't have any unsigned types (and I don't think it makes sense to do so unless there is a valid use case)

Schema mismatch between record batches and the arrow schema, arrow writer will fail

Thanks, from the original issue it seems scope is ambiguous. It seems like this change it makes it possible to create a schema from arrow with unsigned types which might be helpful by itself, but imagine the next thing the user would want to do is actually the write the data?

It seems fine to check this in separately as long as there is a clean failure for the unsigned types (i.e. we don't silently lose data).

emkornfield

A few more comments, mostly nits. The additional test coverage would be my primary concern, the rests are mostly style nits.

emkornfield

Seems like we should figure out if this actually closes out the original issue or if we should keep it open for write support, but with my limited knowledge these changes seem reasonable.

CTTY

LGTM!

CTTY · 2025-09-02T02:40:29Z

crates/iceberg/src/arrow/schema.rs

+                    .to_string()
+                    .contains("UInt64 is not supported")
+            );
+        }


nit: I think the brackets here and L1755 are excessive

Xuanwo

Thank you for working on this!

gkpanda4 added 2 commits August 18, 2025 13:45

fix: Add support for unsigned Arrow datatypes in schema conversion

81907e7

Fix formatting issues

cd78247

gkpanda4 mentioned this pull request Aug 18, 2025

bug: ArrowSchemaConverter can't handle unsigned datatypes from arrow #675

Closed

gkpanda4 added 2 commits August 20, 2025 18:09

Address comments and change approach

948f25d

Fix formatting issues

014790c

emkornfield reviewed Aug 21, 2025

View reviewed changes

emkornfield suggested changes Aug 21, 2025

View reviewed changes

gkpanda4 added 4 commits August 21, 2025 15:00

Make conversion logic simpler and add integration test

b457602

Address comments

a352b84

Remove local changes

44c62a5

nit:Update formats

31894ad

gkpanda4 requested review from emkornfield and CTTY August 22, 2025 20:37

emkornfield approved these changes Aug 22, 2025

View reviewed changes

dentiny mentioned this pull request Aug 22, 2025

iceberg: investigate / allow unsigned type support Mooncake-Labs/moonlink#1679

Open

CTTY approved these changes Sep 2, 2025

View reviewed changes

Xuanwo approved these changes Sep 3, 2025

View reviewed changes

Xuanwo merged commit 8bc44a7 into apache:main Sep 3, 2025
17 checks passed

	// Cast unsigned types based on bit width (following Python implementation)
	// Cast unsigned types based on bit width to allow for no data loss

                       }
                   }
+                  #[test]

fix: Add support for unsigned Arrow datatypes in schema conversion #1617

fix: Add support for unsigned Arrow datatypes in schema conversion #1617

Uh oh!

Conversation

gkpanda4 commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

What changes are included in this PR?

Bug Fixes

Features

Code Changes

Files Modified

Impact

Are these changes tested?

Uh oh!

emkornfield commented Aug 20, 2025

Uh oh!

CTTY commented Aug 20, 2025

Uh oh!

gkpanda4 commented Aug 21, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CTTY Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

emkornfield left a comment

Choose a reason for hiding this comment

Uh oh!

emkornfield left a comment

Choose a reason for hiding this comment

Uh oh!

CTTY left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Xuanwo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gkpanda4 commented Aug 18, 2025 •

edited

Loading

CTTY Aug 22, 2025 •

edited

Loading