Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove the hacked ParquetMetadataConverter.java and the need for the ParquetHadoop module. #901

Closed
jcferretti opened this issue Jul 26, 2021 · 1 comment · Fixed by #4457
Assignees
Labels
feature request New feature or request parquet Related to the Parquet integration triage

Comments

@jcferretti
Copy link
Member

jcferretti commented Jul 26, 2021

We have a copy of the parquet-hadoop file ParquetMetadataConverter, with local modifications. We hack it on top of the real one to get different behavior, presumably as a result of some issue in our own reader (since somehow nobody else need to hack this file to make parquet work in Java).

Figure out what is the issue, fix it, and remove the need for it.

Relates to #806 (likely a prerequisite, but not certain if it is enough).

--

I tried. Didn't go well. I removed the code and module in branch
https://github.com/jcferretti/deephaven-core/tree/cfs-parquethadoop-module-removal-0

Trying to read back a file written with LZ4 results in:

org.apache.parquet.io.ParquetDecodingException: could not read page in col [store_and_fwd_flag] optional binary store_and_fwd_flag (STRING) as the dictionary was missing for encoding RLE_DICTIONARY
at io.deephaven.parquet.ColumnPageReaderImpl.getDataReader(ColumnPageReaderImpl.java:760)
at io.deephaven.parquet.ColumnPageReaderImpl.readPageV1(ColumnPageReaderImpl.java:333)
at io.deephaven.parquet.ColumnPageReaderImpl.readDataPage(ColumnPageReaderImpl.java:201)
at io.deephaven.parquet.ColumnPageReaderImpl.materialize(ColumnPageReaderImpl.java:75)
at io.deephaven.db.v2.locations.parquet.topage.ToPage.getResult(ToPage.java:52)
at io.deephaven.db.v2.locations.parquet.topage.ToPage.toPage(ToPage.java:77)
at io.deephaven.db.v2.locations.parquet.ColumnChunkPageStore.toPage(ColumnChunkPageStore.java:158)
at io.deephaven.db.v2.locations.parquet.VariablePageSizeColumnChunkPageStore.getPage(VariablePageSizeColumnChunkPageStore.java:111)
at io.deephaven.db.v2.locations.parquet.VariablePageSizeColumnChunkPageStore.getPageContaining(VariablePageSizeColumnChunkPageStore.java:150)
at io.deephaven.db.v2.locations.parquet.VariablePageSizeColumnChunkPageStore.getPageContaining(VariablePageSizeColumnChunkPageStore.java:17)
at io.deephaven.db.v2.sources.chunk.page.PageStore.fillChunk(PageStore.java:67)
at io.deephaven.db.v2.sources.regioned.ParquetColumnRegionBase.fillChunk(ParquetColumnRegionBase.java:50)
at io.deephaven.db.v2.sources.regioned.DeferredColumnRegionBase.fillChunk(DeferredColumnRegionBase.java:71)
at io.deephaven.db.v2.sources.chunk.page.PageStore.fillChunk(PageStore.java:71)
at io.deephaven.db.v2.sources.regioned.RegionedColumnSourceBase.fillChunk(RegionedColumnSourceBase.java:31)
at io.deephaven.db.v2.sources.regioned.RegionedColumnSourceObject$AsValues.fillChunk(RegionedColumnSourceObject.java:37)
at io.deephaven.db.v2.remote.ConstructSnapshot.getSnapshotDataAsChunk(ConstructSnapshot.java:1365)
at io.deephaven.db.v2.remote.ConstructSnapshot.serializeAllTable(ConstructSnapshot.java:1285)
at io.deephaven.db.v2.remote.ConstructSnapshot.lambda$constructBackplaneSnapshotInPositionSpace$2(ConstructSnapshot.java:575)
at io.deephaven.db.v2.remote.ConstructSnapshot.callDataSnapshotFunction(ConstructSnapshot.java:1045)
at io.deephaven.db.v2.remote.ConstructSnapshot.callDataSnapshotFunction(ConstructSnapshot.java:977)
at io.deephaven.db.v2.remote.ConstructSnapshot.constructBackplaneSnapshotInPositionSpace(ConstructSnapshot.java:578)
at io.deephaven.grpc_api.barrage.BarrageMessageProducer.getSnapshot(BarrageMessageProducer.java:1524)
at io.deephaven.grpc_api.barrage.BarrageMessageProducer.updateSubscriptionsSnapshotAndPropagate(BarrageMessageProducer.java:926)
at io.deephaven.grpc_api.barrage.BarrageMessageProducer.access$1400(BarrageMessageProducer.java:89)
at io.deephaven.grpc_api.barrage.BarrageMessageProducer$UpdatePropagationJob.run(BarrageMessageProducer.java:790)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at io.deephaven.grpc_api.runner.DeephavenApiServerModule$ThreadFactory.lambda$newThread$0(DeephavenApiServerModule.java:143)
at java.lang.Thread.run(Thread.java:748)

The code I used:

t = readTable('/data/eth_v2_p1_cBROTLI.parquet')   # From the deephaven-core-parquet-examples repo
writeTable(t, 'LZ4', '/data/t_LZ4.parquet')
tmore = readTable('/data/t_LZ4.parquet')
@jcferretti jcferretti added feature request New feature or request triage parquet Related to the Parquet integration labels Jul 26, 2021
@jcferretti jcferretti added this to the Backlog milestone Jul 26, 2021
@jcferretti jcferretti self-assigned this Jul 26, 2021
@jcferretti jcferretti changed the title Remove the hack that makes the ParquetHadoop module necessary and remove the module. Remove the hacked ParquetMetadataConverter.java and the need for the ParquetHadoop module. Aug 9, 2021
@rcaudy
Copy link
Member

rcaudy commented Aug 9, 2021

Even worse, this file has bad conversion support for converted type:

            if (schemaElement.isSetConverted_type()) {
                OriginalType originalType = getLogicalTypeAnnotation(schemaElement.converted_type, schemaElement).toOriginalType();
                OriginalType newOriginalType = (schemaElement.isSetLogicalType() && getLogicalTypeAnnotation(schemaElement.logicalType) != null) ?
                        getLogicalTypeAnnotation(schemaElement.logicalType).toOriginalType() : null;
                if (!originalType.equals(newOriginalType)) {
                    if (newOriginalType != null) {
                        LOG.warn("Converted type and logical type metadata mismatch (convertedType: {}, logical type: {}). Using value in converted type.",
                                schemaElement.converted_type, schemaElement.logicalType);
                    }
                    childBuilder.as(originalType);
                }
            }

The one in ParquetFileReader is better:

            if (schemaElement.isSetConverted_type()) {
                LogicalTypeAnnotation originalType = getLogicalTypeAnnotation(schemaElement.converted_type, schemaElement);
                LogicalTypeAnnotation newOriginalType = schemaElement.isSetLogicalType() && getLogicalTypeAnnotation(schemaElement.logicalType) != null ? getLogicalTypeAnnotation(schemaElement.logicalType) : null;
                if (!originalType.equals(newOriginalType)) {
                    ((org.apache.parquet.schema.Types.Builder)childBuilder).as(originalType);
                }
            }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request parquet Related to the Parquet integration triage
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants