Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iceberg table writer writes space to _x20 in column names in the Parquet file #23129

Closed
yingsu00 opened this issue Jul 3, 2024 · 3 comments
Closed
Assignees
Labels
bug iceberg Apache Iceberg related

Comments

@yingsu00
Copy link
Contributor

yingsu00 commented Jul 3, 2024

Using the IcebergQueryRunner, create table and insert would change the space in the column name to "_x20". For example, "two words" would be written as "two_x20words" in the Parquet file. This doesn't happen in Hive tables, where "two words" would just be written as "two words"

Your Environment

N/A

Expected Behavior

Iceberg table writer write space as is. This is in alignment with Hive.

Current Behavior

Iceberg table writer replaces space " " to "_x20". This is not desired.

Possible Solution

Steps to Reproduce

Start IcebergQueryRunner (Java). In presto cli, run the following

create table space2 ("two words" int) with (format='parquet');
insert into space2 values (1), (2), (3);

Then put a break point at com/facebook/presto/hive/parquet/ParquetPageSourceFactory.java line 207
MessageType fileSchema = fileMetaData.getSchema();

Repeat the same steps using HiveQueryRunner, and you will see the name is different.

Screenshots (if appropriate)

image

Context

@karteekmurthys
Copy link
Contributor

The Iceberg module sanitzes the column names if it has special characters. Check here:
org/apache/iceberg/parquet/TypeToMessageType.java


    for (NestedField field : schema.columns()) {
      builder.addField(field(field));
    }

    return builder.named(AvroSchemaUtil.makeCompatibleName(name));
  }

Further up the stack it reaches here in org/apache/iceberg/avro/AvroSchemaUtil.java:

private static String sanitize(char character) {
    if (Character.isDigit(character)) {
      return "_" + character;
    }
    return "_x" + Integer.toHexString(character).toUpperCase();
  }

It is coming from dependent java libraries.

@hantangwangd
Copy link
Member

Seems Iceberg will sanitize all the characters that aren't allowed in Avro field names, including space " ". Referring to apache/iceberg#216 (comment)

@yingsu00
Copy link
Contributor Author

So this is expected behavior. Closing for now. We will support reading such columns through name-id mapping in Velox. See facebookincubator/velox#10085

@github-project-automation github-project-automation bot moved this from 🆕 Unprioritized to ✅ Done in Iceberg Support Jul 16, 2024
@github-project-automation github-project-automation bot moved this from 🆕 Unprioritized to ✅ Done in Bugs and support requests Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug iceberg Apache Iceberg related
Projects
Archived in project
Status: Done
Development

No branches or pull requests

4 participants