Remove the hacked ParquetMetadataConverter.java and the need for the ParquetHadoop module. #901

jcferretti · 2021-07-26T13:34:56Z

We have a copy of the parquet-hadoop file ParquetMetadataConverter, with local modifications. We hack it on top of the real one to get different behavior, presumably as a result of some issue in our own reader (since somehow nobody else need to hack this file to make parquet work in Java).

Figure out what is the issue, fix it, and remove the need for it.

Relates to #806 (likely a prerequisite, but not certain if it is enough).

--

I tried. Didn't go well. I removed the code and module in branch
https://github.com/jcferretti/deephaven-core/tree/cfs-parquethadoop-module-removal-0

Trying to read back a file written with LZ4 results in:

org.apache.parquet.io.ParquetDecodingException: could not read page in col [store_and_fwd_flag] optional binary store_and_fwd_flag (STRING) as the dictionary was missing for encoding RLE_DICTIONARY
at io.deephaven.parquet.ColumnPageReaderImpl.getDataReader(ColumnPageReaderImpl.java:760)
at io.deephaven.parquet.ColumnPageReaderImpl.readPageV1(ColumnPageReaderImpl.java:333)
at io.deephaven.parquet.ColumnPageReaderImpl.readDataPage(ColumnPageReaderImpl.java:201)
at io.deephaven.parquet.ColumnPageReaderImpl.materialize(ColumnPageReaderImpl.java:75)
at io.deephaven.db.v2.locations.parquet.topage.ToPage.getResult(ToPage.java:52)
at io.deephaven.db.v2.locations.parquet.topage.ToPage.toPage(ToPage.java:77)
at io.deephaven.db.v2.locations.parquet.ColumnChunkPageStore.toPage(ColumnChunkPageStore.java:158)
at io.deephaven.db.v2.locations.parquet.VariablePageSizeColumnChunkPageStore.getPage(VariablePageSizeColumnChunkPageStore.java:111)
at io.deephaven.db.v2.locations.parquet.VariablePageSizeColumnChunkPageStore.getPageContaining(VariablePageSizeColumnChunkPageStore.java:150)
at io.deephaven.db.v2.locations.parquet.VariablePageSizeColumnChunkPageStore.getPageContaining(VariablePageSizeColumnChunkPageStore.java:17)
at io.deephaven.db.v2.sources.chunk.page.PageStore.fillChunk(PageStore.java:67)
at io.deephaven.db.v2.sources.regioned.ParquetColumnRegionBase.fillChunk(ParquetColumnRegionBase.java:50)
at io.deephaven.db.v2.sources.regioned.DeferredColumnRegionBase.fillChunk(DeferredColumnRegionBase.java:71)
at io.deephaven.db.v2.sources.chunk.page.PageStore.fillChunk(PageStore.java:71)
at io.deephaven.db.v2.sources.regioned.RegionedColumnSourceBase.fillChunk(RegionedColumnSourceBase.java:31)
at io.deephaven.db.v2.sources.regioned.RegionedColumnSourceObject$AsValues.fillChunk(RegionedColumnSourceObject.java:37)
at io.deephaven.db.v2.remote.ConstructSnapshot.getSnapshotDataAsChunk(ConstructSnapshot.java:1365)
at io.deephaven.db.v2.remote.ConstructSnapshot.serializeAllTable(ConstructSnapshot.java:1285)
at io.deephaven.db.v2.remote.ConstructSnapshot.lambda$constructBackplaneSnapshotInPositionSpace$2(ConstructSnapshot.java:575)
at io.deephaven.db.v2.remote.ConstructSnapshot.callDataSnapshotFunction(ConstructSnapshot.java:1045)
at io.deephaven.db.v2.remote.ConstructSnapshot.callDataSnapshotFunction(ConstructSnapshot.java:977)
at io.deephaven.db.v2.remote.ConstructSnapshot.constructBackplaneSnapshotInPositionSpace(ConstructSnapshot.java:578)
at io.deephaven.grpc_api.barrage.BarrageMessageProducer.getSnapshot(BarrageMessageProducer.java:1524)
at io.deephaven.grpc_api.barrage.BarrageMessageProducer.updateSubscriptionsSnapshotAndPropagate(BarrageMessageProducer.java:926)
at io.deephaven.grpc_api.barrage.BarrageMessageProducer.access$1400(BarrageMessageProducer.java:89)
at io.deephaven.grpc_api.barrage.BarrageMessageProducer$UpdatePropagationJob.run(BarrageMessageProducer.java:790)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at io.deephaven.grpc_api.runner.DeephavenApiServerModule$ThreadFactory.lambda$newThread$0(DeephavenApiServerModule.java:143)
at java.lang.Thread.run(Thread.java:748)

The code I used:

t = readTable('/data/eth_v2_p1_cBROTLI.parquet')   # From the deephaven-core-parquet-examples repo
writeTable(t, 'LZ4', '/data/t_LZ4.parquet')
tmore = readTable('/data/t_LZ4.parquet')

The text was updated successfully, but these errors were encountered:

rcaudy · 2021-08-09T20:17:12Z

Even worse, this file has bad conversion support for converted type:

            if (schemaElement.isSetConverted_type()) {
                OriginalType originalType = getLogicalTypeAnnotation(schemaElement.converted_type, schemaElement).toOriginalType();
                OriginalType newOriginalType = (schemaElement.isSetLogicalType() && getLogicalTypeAnnotation(schemaElement.logicalType) != null) ?
                        getLogicalTypeAnnotation(schemaElement.logicalType).toOriginalType() : null;
                if (!originalType.equals(newOriginalType)) {
                    if (newOriginalType != null) {
                        LOG.warn("Converted type and logical type metadata mismatch (convertedType: {}, logical type: {}). Using value in converted type.",
                                schemaElement.converted_type, schemaElement.logicalType);
                    }
                    childBuilder.as(originalType);
                }
            }

The one in ParquetFileReader is better:

            if (schemaElement.isSetConverted_type()) {
                LogicalTypeAnnotation originalType = getLogicalTypeAnnotation(schemaElement.converted_type, schemaElement);
                LogicalTypeAnnotation newOriginalType = schemaElement.isSetLogicalType() && getLogicalTypeAnnotation(schemaElement.logicalType) != null ? getLogicalTypeAnnotation(schemaElement.logicalType) : null;
                if (!originalType.equals(newOriginalType)) {
                    ((org.apache.parquet.schema.Types.Builder)childBuilder).as(originalType);
                }
            }

jcferretti added feature request New feature or request triage parquet Related to the Parquet integration labels Jul 26, 2021

jcferretti added this to the Backlog milestone Jul 26, 2021

jcferretti self-assigned this Jul 26, 2021

jcferretti changed the title ~~Remove the hack that makes the ParquetHadoop module necessary and remove the module.~~ Remove the hacked ParquetMetadataConverter.java and the need for the ParquetHadoop module. Aug 9, 2021

jcferretti mentioned this issue Aug 9, 2021

Remove dependency on hadoop-common #806

Closed

devinrsmith mentioned this issue Apr 6, 2023

Bump parquet-hadoop to 1.13.0 #3667

Merged

malhotrashivam modified the milestones: Backlog, August 2023 Aug 8, 2023

malhotrashivam assigned lbooker42 and malhotrashivam and unassigned jcferretti Aug 8, 2023

malhotrashivam modified the milestones: August 2023, September 2023 Sep 6, 2023

malhotrashivam mentioned this issue Sep 7, 2023

Removed ParquetHadoop module #4457

Merged

malhotrashivam closed this as completed in #4457 Sep 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove the hacked ParquetMetadataConverter.java and the need for the ParquetHadoop module. #901

Remove the hacked ParquetMetadataConverter.java and the need for the ParquetHadoop module. #901

jcferretti commented Jul 26, 2021 •

edited

Loading

rcaudy commented Aug 9, 2021

Remove the hacked ParquetMetadataConverter.java and the need for the ParquetHadoop module. #901

Remove the hacked ParquetMetadataConverter.java and the need for the ParquetHadoop module. #901

Comments

jcferretti commented Jul 26, 2021 • edited Loading

rcaudy commented Aug 9, 2021

jcferretti commented Jul 26, 2021 •

edited

Loading