Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 52 additions & 0 deletions crates/iceberg/src/arrow/schema.rs
Original file line number Diff line number Diff line change
Expand Up @@ -378,7 +378,16 @@ impl ArrowSchemaVisitor for ArrowSchemaConverter {
DataType::Int8 | DataType::Int16 | DataType::Int32 => {
Ok(Type::Primitive(PrimitiveType::Int))
}
DataType::UInt8 | DataType::UInt16 => Ok(Type::Primitive(PrimitiveType::Int)),
DataType::UInt32 => Ok(Type::Primitive(PrimitiveType::Long)),
DataType::Int64 => Ok(Type::Primitive(PrimitiveType::Long)),
DataType::UInt64 => {
// Block uint64 - no safe casting option
Err(Error::new(
ErrorKind::DataInvalid,
"UInt64 is not supported. Use Int64 for values ≤ 9,223,372,036,854,775,807 or Decimal(20,0) for full uint64 range.",
))
}
DataType::Float32 => Ok(Type::Primitive(PrimitiveType::Float)),
DataType::Float64 => Ok(Type::Primitive(PrimitiveType::Double)),
DataType::Decimal128(p, s) => Type::decimal(*p as u32, *s as u32).map_err(|e| {
Expand Down Expand Up @@ -1717,6 +1726,49 @@ mod tests {
}
}

#[test]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this probably isn't the right module, but it would probably be nice to have a test that actually exercises writing these types and then reading them back again?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented an integration test for unsigned type roundtrip, but discovered that ParquetWriter also requires modification to handle unsigned data conversion. The issue stems from a type mismatch between schema and data.

The problem occurs because schema conversion (arrow_schema_to_schema) transforms the metadata but leaves the actual data unchanged. When writing, Arrow validation fails due to this mismatch.

Copy link
Contributor

@CTTY CTTY Aug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think writing record batches that contain unsigned types is out of the scope of the original issue and can be tricky:

  • ParquetWriter uses AsyncArrowWriter under the hood
  • AsyncArrowWriter uses an arrow schema that got converted from the Iceberg table schema
  • When converting an Iceberg schema to arrow schema, arrow schema won't have any unsigned types (and I don't think it makes sense to do so unless there is a valid use case)
  • Schema mismatch between record batches and the arrow schema, arrow writer will fail

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, from the original issue it seems scope is ambiguous. It seems like this change it makes it possible to create a schema from arrow with unsigned types which might be helpful by itself, but imagine the next thing the user would want to do is actually the write the data?

It seems fine to check this in separately as long as there is a clean failure for the unsigned types (i.e. we don't silently lose data).

fn test_unsigned_integer_type_conversion() {
let test_cases = vec![
(DataType::UInt8, PrimitiveType::Int),
(DataType::UInt16, PrimitiveType::Int),
(DataType::UInt32, PrimitiveType::Long),
];

for (arrow_type, expected_iceberg_type) in test_cases {
let arrow_field = Field::new("test", arrow_type.clone(), false).with_metadata(
HashMap::from([(PARQUET_FIELD_ID_META_KEY.to_string(), "1".to_string())]),
);
let arrow_schema = ArrowSchema::new(vec![arrow_field]);

let iceberg_schema = arrow_schema_to_schema(&arrow_schema).unwrap();
let iceberg_field = iceberg_schema.as_struct().fields().first().unwrap();

assert!(
matches!(iceberg_field.field_type.as_ref(), Type::Primitive(t) if *t == expected_iceberg_type),
"Expected {:?} to map to {:?}",
arrow_type,
expected_iceberg_type
);
}

// Test UInt64 blocking
{
let arrow_field = Field::new("test", DataType::UInt64, false).with_metadata(
HashMap::from([(PARQUET_FIELD_ID_META_KEY.to_string(), "1".to_string())]),
);
let arrow_schema = ArrowSchema::new(vec![arrow_field]);

let result = arrow_schema_to_schema(&arrow_schema);
assert!(result.is_err());
assert!(
result
.unwrap_err()
.to_string()
.contains("UInt64 is not supported")
);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think the brackets here and L1755 are excessive

}

#[test]
fn test_datum_conversion() {
{
Expand Down
Loading