First commit on supporting parquet #650

unical1988 · 2025-02-15T16:15:32Z

Important Read

Please ensure the GitHub issue is mentioned at the beginning of the PR

What is the purpose of the pull request

This PR is a first attempt (building off of previous attempt) to include Parquet file for Syncing to XTable.

Brief change log

Added Schema Extractor for Parquet (almost similar to Avro's)
Added Table Extractor using metadata
Added Conversion Source Script

vinishjail97

Thanks for your first contribution @unical1988 , can you run mvn spotless:apply and push the PR ?

vinishjail97 · 2025-02-16T22:42:35Z

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetConversionSource.java

+import org.apache.xtable.spi.extractor.ConversionSource;
+
+@Builder
+public class ParquetConversionSource implements ConversionSource<Long> {


Few clarification questions to ensure we are on the same page.

Will this source assume partitioning based on directory structure or user can choose partition columns from the parquet file schema ?

If a parquet file is removed from the source root path, will it be handled or ignored ? Using file notifications makes this easier but we can find a way to do this through listing as well.

I guess getPartitionFromDirectoryStructure() does get the partitions from the directory, but optionally I can add retrieving partitioning from columns (as set by the user).

I can add catching (FileNotFoundException) to handle the case when the source path is not found.

ashvina

Hi @unical1988,
Thank you for your contribution! To ensure we fully understand the scope and assumptions of the Parquet support feature, could you please submit a RFC? A high-level description will make it easier for us to review the PR. For instance, please include details about schema consistency and validation across files, and different assumptions and error conditions.
Thanks again!

ashvina · 2025-02-17T21:47:16Z

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetConversionSource.java

@@ -0,0 +1,249 @@
+package org.apache.xtable.parquet;


APL header is missing. Please run spotless plugin on the code.

I just did, but on the other hand, not sure what is RFC, is there docs that explain what is?

Here is an example: #634

@unical1988 Here's the template, I can help if you have more clarifications, we can discuss in the slack.
https://github.com/apache/incubator-xtable/blob/main/rfc/template.md

@vinishjail97 ok

vinishjail97 · 2025-02-18T20:10:53Z

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetSchemaExtractror.java

+      case BYTES:
+      case JSON:
+      case BSON:
+      case FIXED:
+        logicalType = schema.getLogicalType();


Any reason why combining all of them in a single classification ? FIXED can be separate IMO. BYTES, JSON BSON are byte array kind of types.

I added BYTE_ARRAY type (are there any metadata to add?): according to this: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md.

vinishjail97 · 2025-02-20T05:09:16Z

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetConversionSource.java

+  @Builder.Default
+  private static final ParquetSchemaConverter schemaExtractor =
+      ParquetSchemaConverter.getInstance();
+
+  @Builder.Default
+  private static final ParquetMetadataExtractor parquetMetadataExtractor =
+      ParquetMetadataExtractor.getInstance();
+
+  @Builder.Default
+  private static final ParquetPartitionHelper parquetPartitionHelper =
+      ParquetPartitionHelper.getInstance();
+
+  private Map<String, List<String>> initPartitionInfo() {
+    return getPartitionFromDirectoryStructure(hadoopConf, basePath, Collections.emptyMap());


We need a ParquetStatsExtractor which computes col stats from parquet footer. These are populated in InternalDataFile

vinishjail97 · 2025-02-20T05:09:34Z

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetConversionSource.java

+      ParquetSchemaConverter.getInstance();
+
+  @Builder.Default
+  private static final ParquetMetadataExtractor parquetMetadataExtractor =


This class is for stats ?

@vinishjail97 no, but I just added a first method for that class named ParquetStatsExtractor

vinishjail97

Thanks for working on the PR @unical1988, added comments.

There seems to be some confusion about extracting partition values, let me know what you think of this.

basePath/ 
                p1/.. (Can be recursive partitions for parquet files)
                p2/ ..
                p3/.. 
                .hoodie/  (Hudi Metadata)
                metadata/ (Iceberg metadata) 
                _delta_log/ (Delta metadata)

To extract the partition fields (emphasis on fields here not the actual values) we can it in two ways:

Assume table is not partitioned, this would just sync the parquet files in the target formats using the physical paths you have extracted in one of the classes. When you read those tables, partition pruning won't work.
Ask user input (from YAML configuration) for the partition fields from the parquet file schema. Many of these analytical datasets are partitioned by date either through an actual date column in the parquet file or a timestamp field through which the date is actually extracted.

vinishjail97 · 2025-02-24T07:16:09Z

xtable-api/src/main/java/org/apache/xtable/conversion/ExternalTable.java

@@ -34,12 +34,16 @@
 class ExternalTable {
  /** The name of the table. */
  protected final @NonNull String name;
+


Are these changes coming from mvn spotless:apply ? Wondering how latest main branch doesn't reflect these.

https://github.com/apache/incubator-xtable/actions/runs/13485330237/job/37693458127?pr=650
The build failed because of spotless

regarding how to get partitionValues, I think it is best to discuss this in our meeting of today, but anyways, can it be asked from the user as YAML configuration?

Yes we can use YAML configuration inputs.
https://github.com/apache/incubator-xtable/blob/main/README.md#running-the-bundled-jar

vinishjail97 · 2025-02-24T07:29:33Z

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetTableExtractor.java

+import org.apache.xtable.model.storage.TableFormat;
+
+/**
+ * Extracts {@link InternalTable} canonical representation of a table at a point in time for Delta.


for Delta may be confusing with the table format, this should be enough I guess ?
Extracts {@link InternalTable} canonical representation of a table at a point in time

vinishjail97 · 2025-02-24T07:30:07Z

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetTableExtractor.java

+  @Builder.Default
+  private static final ParquetMetadataExtractor parquetMetadataExtractor =
+          ParquetMetadataExtractor.getInstance();
+  private Map<String, List<String>> initPartitionInfo() {


//nit new line between 45 and 46 lines.

vinishjail97 · 2025-02-24T07:50:00Z

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetTableExtractor.java

+    Set<String> partitionKeys = initPartitionInfo().keySet();
+    List<InternalPartitionField> partitionFields =
+            partitionExtractor
+            .getInternalPartitionField(partitionKeys,schema);


There seems to be some confusion here.

initPartitionInfo().keySet() will return all the unique partitionPaths for all parquet files combined.

InternalPartitionField is the partition field name and not the values for the table.

does this mean that

getPartitionFromDirectoryStructure() (in ParquetConversionSource.java) should return the partition accounting for the table name (only one file)?

that same method should maybe consider returning partition values for the table and not field names (e.g., in line 254 currentPartitionMap.computeIfAbsent(partitionKeyValue[1]/*instead of the actual partitionKeyValue[0]*/, k ->new ArrayList<>()).add(partitionKeyValue[1]);)?

Inferring the partition fields from directory structure without knowing how user generated is difficult, so we can support both these options. getPartitionsFromUserConfiguration(..) is one name I could think of right now, feel free to change it.

#650 (review)

vinishjail97 · 2025-02-24T07:52:31Z

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetPartitionExtractor.java

+    public List<InternalPartitionField> getInternalPartitionField(
+            Set<String> partitionList, InternalSchema schema) {


I don't think this would work because partitionList is a list of partition field values and you won't find this in the schema.

vinishjail97 · 2025-02-24T08:01:37Z

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetSchemaExtractror.java

+        logicalType = schema.getLogicalType();
+        // TODO: any metadata to add ?
+        if (logicalType== LogicalTypes.JSON) {
+          newDataType = InternalType.JSON;
+        }
+        else if (logicalType instanceof LogicalTypes.BSON) {
+          newDataType = InternalType.BSON;
+        }
+        else if (logicalType instanceof LogicalTypes.VARIANT) {
+          newDataType = InternalType.VARIANT;
+        }
+        else if (logicalType instanceof LogicalTypes.GEOMETRY) {
+          newDataType = InternalType.GEOMETRY;
+        }
+        else if (logicalType instanceof LogicalTypes.GEOGRAPHY) {
+          newDataType = InternalType.GEOGRAPHY;


Would it to be simple to map it to BYTES ?

Let me know your thoughts as well, I was asking because the targets (iceberg, delta and hudi) don't seem to support these types and just map it to byte array or binary.

vinishjail97 · 2025-02-24T08:03:37Z

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetSchemaExtractror.java

+ * parquet data types and canonical data types.
+ */
+@NoArgsConstructor(access = AccessLevel.PRIVATE)
+public class ParquetSchemaConverter {


This class looks okay and thanks for putting this up, can you add unit tests as well for this ?

I will add unit tests, are there similar test examples that I can start off from?

For schema extractor, this one is useful
https://github.com/apache/incubator-xtable/blob/main/xtable-core/src/test/java/org/apache/xtable/iceberg/TestIcebergSchemaExtractor.java

For e2e tests, you can use these as references

https://github.com/apache/incubator-xtable/blob/main/xtable-core/src/test/java/org/apache/xtable/iceberg/ITIcebergConversionSource.java
https://github.com/apache/incubator-xtable/blob/main/xtable-core/src/test/java/org/apache/xtable/ITConversionController.java

vinishjail97 · 2025-02-24T08:05:04Z

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetStatsExtractor.java

+        List<PartitionValue> partitionValues =
+                partitionExtractor
+                        .getPartitionValue(parentPath,file.getPath().toString(),schema,partitionInfo);


The partitionValues are not the actual partition values but it's a list of partitioning fields in the schema with their range.
https://github.com/apache/incubator-xtable/blob/main/xtable-api/src/main/java/org/apache/xtable/model/storage/InternalDataFile.java#L44

unical1988 · 2025-02-24T19:48:46Z

Thanks for working on the PR @unical1988, added comments.

There seems to be some confusion about extracting partition values, let me know what you think of this.
basePath/ 
                p1/.. (Can be recursive partitions for parquet files)
                p2/ ..
                p3/.. 
                .hoodie/  (Hudi Metadata)
                metadata/ (Iceberg metadata) 
                _delta_log/ (Delta metadata) 
To extract the partition fields (emphasis on fields here not the actual values) we can it in two ways:

Assume table is not partitioned, this would just sync the parquet files in the target formats using the physical paths you have extracted in one of the classes. When you read those tables, partition pruning won't work.

Ask user input (from YAML configuration) for the partition fields from the parquet file schema. Many of these analytical datasets are partitioned by date either through an actual date column in the parquet file or a timestamp field through which the date is actually extracted.

We would want to read the configuration (or the partition fields) into a Java object (if I am not wrong). p1/ then could be date - year - month -day and p2/could be location and p3/ could be ID, so given these fields we could extract the partitionValues located at the related subdirectories for a specific parquet file, is that correct?if yes, how could the Java object be defined?

…added YAML reader for partition fields user configuration

the-other-tim-brown · 2025-03-09T16:02:00Z

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetStatsExtractor.java

+
+    Stats valueCountStats = new Stats();
+    Stats allStats = new Stats();
+    Stats uncStats = new Stats();


What does unc mean here?

uncompressed

let's use the full word, it will be more clear

the-other-tim-brown · 2025-03-11T01:47:39Z

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetSchemaExtractror.java

+        switch (schema.getName()) {
+            //PrimitiveTypes
+            case "INT64":
+                logicalType = schema.getLogicalTypeAnnotation();


Should newDataType be set to LONG when there is no logical type?

Similarly it seems whenever there is a possibility of a logical type the newDataType is not currently set.

If 'no logical type' refers to the case where the logical type is null, then case UNKNOWN is my answer.

What is meant by "there is a possibility of a logical type", do you mean all handled cases except the default one? How is it not set?

You do not set the newDataType unless there is a logical type. This will mean that if there is a plain long then newDataType is simply null which is unexpected

what is the case of "no logical type" here? is it the default case ?
I would say that if plain long is detected then "UNKNOWN" case is concerned. How do you suggest handling such case?

When there is no logical type on the INT64 then you can set newDataType to InternalType.LONG.

INT64 should be InternalType.INT or InternalType.LONG?

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetSchemaExtractror.java

…T, MAP

vinishjail97 · 2025-03-12T22:50:13Z

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetConversionSource.java

+        new org.apache.parquet.parquet.ParquetSchemaConverter()
+            .convert(parquetMetadataExtractor.getSchema(parquetMetadata));
+
+    Set<String> partitionKeys = initPartitionInfo().keySet();


This will be user input.

vinishjail97 · 2025-03-12T23:13:13Z

Thanks for working on the PR @unical1988, added comments.

There seems to be some confusion about extracting partition values, let me know what you think of this.
basePath/ 
                p1/.. (Can be recursive partitions for parquet files)
                p2/ ..
                p3/.. 
                .hoodie/  (Hudi Metadata)
                metadata/ (Iceberg metadata) 
                _delta_log/ (Delta metadata) 
To extract the partition fields (emphasis on fields here not the actual values) we can it in two ways:

Assume table is not partitioned, this would just sync the parquet files in the target formats using the physical paths you have extracted in one of the classes. When you read those tables, partition pruning won't work.

Ask user input (from YAML configuration) for the partition fields from the parquet file schema. Many of these analytical datasets are partitioned by date either through an actual date column in the parquet file or a timestamp field through which the date is actually extracted.

public class InputPartitionColumn {
   String fieldName; 
   PartitionTransformType transformType;
}

InputPartitionKeyConfig should be part of Table object in DatasetConfig.  

1. No transform -> The values for partition keys in the parquet file are concatenated and partitionPath is generated.  Configuring this in InternalTable object. 
2. Transformation ->  timestamp -> transform(timestamp) -> year/date/month/xyz.parquet

First commit on supporting parquet

74cbc83

vinishjail97 reviewed Feb 16, 2025

View reviewed changes

catch file not found exception

79bd222

ashvina requested changes Feb 17, 2025

View reviewed changes

executed mvn spotless:apply

2143c99

vinishjail97 reviewed Feb 18, 2025

View reviewed changes

added byte_array data type

4f1ea77

vinishjail97 reviewed Feb 20, 2025

View reviewed changes

Selim Soufargi added 7 commits February 20, 2025 12:51

added ParquetStatsExtractor

f71610b

added InternalDataFile population from parquet metadata

c57a42f

added col stats for parquet

1557ea3

set todos

24c474a

integrated ParquetPartitionExtractor.java

e1a3f35

added partitionValues to StatsExtractor builder

fbbd1eb

added the parquet conversion source provider

40c5e67

vinishjail97 reviewed Feb 24, 2025

View reviewed changes

run mvn spotless:apply

ec222de

Selim Soufargi added 7 commits March 5, 2025 02:11

edited ParquetSchemaExtractor to include some other LogicalTypes and …

e0fbca8

…added YAML reader for partition fields user configuration

ParquetSchemaExtractor few fixes

6e2fc66

ParquetSchemaExtractor NULL type added

b4c49b7

ParquetSchemaExtractor Numeric and time types OK, TODO : Arrays and Maps

cac552a

ParquetSchemaExtractor added groupTypes Map and List: TODO: tests

004d763

added -write parquet- to test Parquet types

4b4593b

added first test for primitive types

9d56c21

the-other-tim-brown reviewed Mar 9, 2025

View reviewed changes

cleanups

18ef037

the-other-tim-brown reviewed Mar 11, 2025

View reviewed changes

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetSchemaExtractror.java Outdated Show resolved Hide resolved

Selim Soufargi added 7 commits March 11, 2025 04:23

added timestamp metadata (millis, micros, nanos)

bd11c67

added else type for each switch case

0dbedb0

added string type

0233d54

added Time type

8fc6a95

added metadata for ENUM and FIXED

c88fb25

adjusted primitive type detection

6c04cc7

adjusted primitive types for fromInternalSchema sync, TODO: ENUM, LIS…

9bdd972

…T, MAP

vinishjail97 reviewed Mar 12, 2025

View reviewed changes

		public List<InternalPartitionField> getInternalPartitionField(
		Set<String> partitionList, InternalSchema schema) {

First commit on supporting parquet #650

Are you sure you want to change the base?

First commit on supporting parquet #650

Conversation

unical1988 commented Feb 15, 2025

Important Read

What is the purpose of the pull request

Brief change log

vinishjail97 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ashvina left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vinishjail97 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

unical1988 Feb 24, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

unical1988 commented Feb 24, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

unical1988 Mar 11, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vinishjail97 commented Mar 12, 2025

unical1988 Feb 24, 2025 •

edited

Loading

unical1988 Mar 11, 2025 •

edited

Loading