Skip to content

Conversation

gkpanda4
Copy link
Contributor

@gkpanda4 gkpanda4 commented Aug 18, 2025

Which issue does this PR close?

What changes are included in this PR?

Bug Fixes

  • Fixed crash when ArrowSchemaConverter encounters unsigned datatypes
  • Resolved "Unsupported Arrow data type" errors for UInt8/16/32/64

Features

  • Added casting support for unsigned Arrow types
  • UInt8/16 → Int32 (safe casting to larger signed type)
  • UInt32 → Int64 (safe casting to larger signed type)
  • UInt64 → Error (no safe casting option, explicit error with guidance)

Code Changes

  • Enhanced ArrowSchemaConverter primitive() method with unsigned type handling
  • Added comprehensive test: test_unsigned_type_casting() for all unsigned variants

Files Modified

  • crates/iceberg/src/arrow/schema.rs

Impact

✅ No breaking changes - existing functionality preserved
✅ Safe type casting prevents overflow issues
✅ Clear error messages for unsupported UInt64 with alternatives
✅ Follows proven PyIceberg implementation approach

Are these changes tested?

  • All existing schema tests pass
  • New comprehensive test covers UInt8, UInt16, UInt32, UInt64 conversion behavior
  • Test verifies proper casting: UInt8/16→Int32, UInt32→Int64, UInt64→Error

@emkornfield
Copy link
Contributor

Sorry, new to reviewing (and mostly new to the code base) so take comments with a grain of salt, but this approach seems brittle:

  1. What happens if someone updates the doc field that removes the type information and Arrow RS tries to read the back?
  2. What happens if a non-Arrow RS tries to read data written from Arrow with these fields (in particular int32 and int64)?

It seems a more robust solution would be to:

  1. Convert uint32->int64
  2. Either still block uint64, convert uint64 to a Decimal with an appropriate precision to represent the full range, or use int64 and validate that no values written are outside the appropriate range.

@CTTY
Copy link
Contributor

CTTY commented Aug 20, 2025

I have the same concern as @emkornfield , using doc to determine field type seems unsafe to me. I think casting the type should be fine. This way there would be type loss when converting Iceberg schema back to arrow schema but it should be ok.

Also the Python's implementation can serve as a good reference. Note that Pyiceberg use bid width while arrow-rs only provides primitive_width() that returns width in bytes

@gkpanda4
Copy link
Contributor Author

@emkornfield Right, with the current approach, it has potential for silent data corruption because of Arrow's doc field dependency. @CTTY Thanks for the references, I will use this for casting.

My updated approach uses safe bit-width casting for unsigned integer types following the proven iceberg-python implementation

uint8/uint16 → int32: Safe upcast with no overflow risk
uint32 → int64: Safe upcast preserving full uint32 range
uint64 → explicit block: Rather than risk data loss through unsafe conversion, provide clear error guidance directing users to choose between int64 (with range validation) or decimal (with full precision) based on their specific requirements

Let me know if there are any concerns, I will have the changes out otherwise.

// Cast unsigned types based on bit width (following Python implementation)
DataType::UInt8 | DataType::UInt16 | DataType::UInt32 => {
// Cast to next larger signed type to prevent overflow
let bit_width = p.primitive_width().unwrap_or(0) * 8; // Convert bytes to bits
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems superflous, can't you just match on the data types and directly map them?

DataType::UInt8 | DataType::UInt16 => Ok(Type::Primitive(PrimitiveType::Int)
DataType::UInt32 => Ok(Type::Primitive(PrimitiveType::Long)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will make this logic simple

DataType::Int8 | DataType::Int16 | DataType::Int32 => {
Ok(Type::Primitive(PrimitiveType::Int))
}
// Cast unsigned types based on bit width (following Python implementation)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Cast unsigned types based on bit width (following Python implementation)
// Cast unsigned types based on bit width to allow for no data loss

I'm sure python compatibility is a direct goal here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will incorporate this


// Test UInt8/UInt16 → Int32 casting
{
let arrow_field = Field::new("test", DataType::UInt8, false).with_metadata(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this doesn't look like this is testing UInt16?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added this scenario


#[test]
fn test_unsigned_type_casting() {
// Test UInt32 → Int64 casting
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to maybe parameterize the non-error cases as least with expected input/output to avoid boiler plate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}
}

#[test]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this probably isn't the right module, but it would probably be nice to have a test that actually exercises writing these types and then reading them back again?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented an integration test for unsigned type roundtrip, but discovered that ParquetWriter also requires modification to handle unsigned data conversion. The issue stems from a type mismatch between schema and data.

The problem occurs because schema conversion (arrow_schema_to_schema) transforms the metadata but leaves the actual data unchanged. When writing, Arrow validation fails due to this mismatch.

Copy link
Contributor

@CTTY CTTY Aug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think writing record batches that contain unsigned types is out of the scope of the original issue and can be tricky:

  • ParquetWriter uses AsyncArrowWriter under the hood
  • AsyncArrowWriter uses an arrow schema that got converted from the Iceberg table schema
  • When converting an Iceberg schema to arrow schema, arrow schema won't have any unsigned types (and I don't think it makes sense to do so unless there is a valid use case)
  • Schema mismatch between record batches and the arrow schema, arrow writer will fail

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, from the original issue it seems scope is ambiguous. It seems like this change it makes it possible to create a schema from arrow with unsigned types which might be helpful by itself, but imagine the next thing the user would want to do is actually the write the data?

It seems fine to check this in separately as long as there is a clean failure for the unsigned types (i.e. we don't silently lose data).

Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments, mostly nits. The additional test coverage would be my primary concern, the rests are mostly style nits.

@gkpanda4 gkpanda4 requested review from emkornfield and CTTY August 22, 2025 20:37
Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we should figure out if this actually closes out the original issue or if we should keep it open for write support, but with my limited knowledge these changes seem reasonable.

Copy link
Contributor

@CTTY CTTY left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

.to_string()
.contains("UInt64 is not supported")
);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think the brackets here and L1755 are excessive

Copy link
Member

@Xuanwo Xuanwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this!

@Xuanwo Xuanwo merged commit 8bc44a7 into apache:main Sep 3, 2025
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: ArrowSchemaConverter can't handle unsigned datatypes from arrow
4 participants