Iceberg table writer writes space to _x20 in column names in the Parquet file #23129

yingsu00 · 2024-07-03T15:00:12Z

Using the IcebergQueryRunner, create table and insert would change the space in the column name to "_x20". For example, "two words" would be written as "two_x20words" in the Parquet file. This doesn't happen in Hive tables, where "two words" would just be written as "two words"

Your Environment

N/A

Expected Behavior

Iceberg table writer write space as is. This is in alignment with Hive.

Current Behavior

Iceberg table writer replaces space " " to "_x20". This is not desired.

Possible Solution

Steps to Reproduce

Start IcebergQueryRunner (Java). In presto cli, run the following

create table space2 ("two words" int) with (format='parquet');
insert into space2 values (1), (2), (3);

Then put a break point at com/facebook/presto/hive/parquet/ParquetPageSourceFactory.java line 207
MessageType fileSchema = fileMetaData.getSchema();

Repeat the same steps using HiveQueryRunner, and you will see the name is different.

Screenshots (if appropriate)

Context

The text was updated successfully, but these errors were encountered:

karteekmurthys · 2024-07-10T15:47:51Z

The Iceberg module sanitzes the column names if it has special characters. Check here:
org/apache/iceberg/parquet/TypeToMessageType.java


    for (NestedField field : schema.columns()) {
      builder.addField(field(field));
    }

    return builder.named(AvroSchemaUtil.makeCompatibleName(name));
  }

Further up the stack it reaches here in org/apache/iceberg/avro/AvroSchemaUtil.java:

private static String sanitize(char character) {
    if (Character.isDigit(character)) {
      return "_" + character;
    }
    return "_x" + Integer.toHexString(character).toUpperCase();
  }

It is coming from dependent java libraries.

hantangwangd · 2024-07-10T16:33:28Z

Seems Iceberg will sanitize all the characters that aren't allowed in Avro field names, including space " ". Referring to apache/iceberg#216 (comment)

yingsu00 · 2024-07-16T16:44:00Z

So this is expected behavior. Closing for now. We will support reading such columns through name-id mapping in Velox. See facebookincubator/velox#10085

yingsu00 added the bug label Jul 3, 2024

github-project-automation bot added this to Bugs and support requests Jul 3, 2024

github-project-automation bot moved this to 🆕 Unprioritized in Bugs and support requests Jul 3, 2024

yingsu00 added the iceberg Apache Iceberg related label Jul 3, 2024

github-project-automation bot added this to Iceberg Support Jul 3, 2024

github-project-automation bot moved this to 🆕 Unprioritized in Iceberg Support Jul 3, 2024

yingsu00 mentioned this issue Jul 3, 2024

Presto C++ can't read Iceberg tables with spaces in the column names #23131

Closed

ethanyzhang assigned karteekmurthys and pdabre12 Jul 8, 2024

yingsu00 closed this as completed Jul 16, 2024

github-project-automation bot moved this from 🆕 Unprioritized to ✅ Done in Iceberg Support Jul 16, 2024

github-project-automation bot moved this from 🆕 Unprioritized to ✅ Done in Bugs and support requests Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Iceberg table writer writes space to _x20 in column names in the Parquet file #23129

Iceberg table writer writes space to _x20 in column names in the Parquet file #23129

yingsu00 commented Jul 3, 2024

karteekmurthys commented Jul 10, 2024

hantangwangd commented Jul 10, 2024

yingsu00 commented Jul 16, 2024

Iceberg table writer writes space to _x20 in column names in the Parquet file #23129

Iceberg table writer writes space to _x20 in column names in the Parquet file #23129

Comments

yingsu00 commented Jul 3, 2024

Your Environment

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce

Screenshots (if appropriate)

Context

karteekmurthys commented Jul 10, 2024

hantangwangd commented Jul 10, 2024

yingsu00 commented Jul 16, 2024