feat: Modified parquet decompression from buffered to streaming operation #5712

malhotrashivam · 2024-07-03T20:46:41Z

This helps reduce memory consumption when reading parquet files by almost 30%.

extensions/parquet/compression/build.gradle

...mpression/src/main/java/io/deephaven/parquet/compress/DeephavenCompressorAdapterFactory.java

Util/channel/src/main/java/io/deephaven/util/channel/BaseSeekableChannelContext.java

extensions/parquet/base/src/main/java/io/deephaven/parquet/base/ColumnPageReaderImpl.java

...ession/src/main/java/io/deephaven/parquet/compress/LZ4WithLZ4RawBackupCompressorAdapter.java

devinrsmith · 2024-07-03T23:44:52Z

...ession/src/main/java/io/deephaven/parquet/compress/LZ4WithLZ4RawBackupCompressorAdapter.java

+ final InputStream decompressedInput =
+ super.decompress(bufferedInputStream, compressedSize, uncompressedSize, decompressorCache);
+ final ByteBuffer decompressedBuffer =
+ CompressorAdapter.readNBytes(decompressedInput, uncompressedSize, new byte[uncompressedSize]);


It seems very sad that we need to do the full decompression to figure out whether it's LZ4 or LZ4_RAW.

We should either handle the LZ4/LZ4_RAW exception handling at a higher layer (so we don't need to materialize all at once into ByteBuffer), or we should have a specialized InputStream that could do the reset() + fallback internally without the all-at-once read. Is there a hard limit on how many bytes it takes to fail an LZ4_RAW mislabelled as LZ4?

I agree with @devinrsmith in principle, but:

At least it's only the first read.

I don't see how moving it up a layer helps, since that would imply that every caller needs to handle the fallback.

I don't see how moving it down into a wrapped InputStream is better than this, since we'd need to write more code in order to present the InputStream interface, and I worry about edge cases where we can read some bytes but not all.

I don't think there is an exact limit that I can check here, so just checking for a failure.
And yea, I couldn't find an easy way to do this without making a bigger change, and like Ryan said, this extra double buffering will happen only once.

Maybe there is room to have a better impl (unified InputStream) return in the future. Or, some way for the user to disable this fallback logic (so if they trust their parquet when it says LZ4, they don't have to pay this extra buffering cost).

Util/channel/src/main/java/io/deephaven/util/channel/SeekableChannelContext.java

Util/channel/src/main/java/io/deephaven/util/channel/BaseSeekableChannelContext.java

...ession/src/main/java/io/deephaven/parquet/compress/LZ4WithLZ4RawBackupCompressorAdapter.java

rcaudy · 2024-07-05T20:24:24Z

...ession/src/main/java/io/deephaven/parquet/compress/LZ4WithLZ4RawBackupCompressorAdapter.java

+ final InputStream decompressedInput =
+ super.decompress(bufferedInputStream, compressedSize, uncompressedSize, decompressorCache);
+ final ByteBuffer decompressedBuffer =
+ CompressorAdapter.readNBytes(decompressedInput, uncompressedSize, new byte[uncompressedSize]);


I agree with @devinrsmith in principle, but:

At least it's only the first read.

I don't see how moving it up a layer helps, since that would imply that every caller needs to handle the fallback.

I don't see how moving it down into a wrapped InputStream is better than this, since we'd need to write more code in order to present the InputStream interface, and I worry about edge cases where we can read some bytes but not all.

...ession/src/main/java/io/deephaven/parquet/compress/LZ4WithLZ4RawBackupCompressorAdapter.java

extensions/parquet/base/src/main/java/io/deephaven/parquet/base/ColumnPageReaderImpl.java

...ession/src/main/java/io/deephaven/parquet/compress/LZ4WithLZ4RawBackupCompressorAdapter.java

...mpression/src/main/java/io/deephaven/parquet/compress/DeephavenCompressorAdapterFactory.java

engine/chunk/src/main/java/io/deephaven/chunk/sized/SizedByteChunk.java

...sions/parquet/compression/src/main/java/io/deephaven/parquet/compress/CompressorAdapter.java

...ions/parquet/compression/src/main/java/io/deephaven/parquet/compress/DecompressorHolder.java

...mpression/src/main/java/io/deephaven/parquet/compress/DeephavenCompressorAdapterFactory.java

extensions/parquet/base/src/main/java/io/deephaven/parquet/base/ColumnPageReaderImpl.java

devinrsmith

It's tough to review and be confident that we are handling things "correctly" in all cases. Looks ok. Need to fix conflict wrt main.

...sions/parquet/compression/src/main/java/io/deephaven/parquet/compress/CompressorAdapter.java

Util/channel/src/main/java/io/deephaven/util/channel/BaseSeekableChannelContext.java

...mpression/src/main/java/io/deephaven/parquet/compress/DeephavenCompressorAdapterFactory.java

...ession/src/main/java/io/deephaven/parquet/compress/LZ4WithLZ4RawBackupCompressorAdapter.java

Util/channel/src/main/java/io/deephaven/util/channel/BaseSeekableChannelContext.java

rcaudy

.

Streaming decompression support

f52c438

malhotrashivam added feature request New feature or request parquet Related to the Parquet integration NoDocumentationNeeded ReleaseNotesNeeded Release notes are needed labels Jul 3, 2024

malhotrashivam added this to the 0.36.0 milestone Jul 3, 2024

malhotrashivam requested review from rcaudy and devinrsmith July 3, 2024 20:46

malhotrashivam self-assigned this Jul 3, 2024

malhotrashivam changed the title ~~Moved parquet decompression from buffered to streaming operation~~ feat: Moved parquet decompression from buffered to streaming operation Jul 3, 2024

malhotrashivam changed the title ~~feat: Moved parquet decompression from buffered to streaming operation~~ feat: Modified parquet decompression from buffered to streaming operation Jul 3, 2024

devinrsmith reviewed Jul 3, 2024

View reviewed changes

rcaudy reviewed Jul 5, 2024

View reviewed changes

malhotrashivam added 2 commits July 8, 2024 14:40

Merge branch 'main' into sm-stream-decom

15df515

Resolving review comments

af9ff9d

malhotrashivam requested a review from devinrsmith July 8, 2024 21:46

malhotrashivam commented Jul 8, 2024

View reviewed changes

extensions/parquet/base/src/main/java/io/deephaven/parquet/base/ColumnPageReaderImpl.java Outdated Show resolved Hide resolved

extensions/parquet/base/src/main/java/io/deephaven/parquet/base/ColumnPageReaderImpl.java Outdated Show resolved Hide resolved

Resolving more review comments

a6bf695

devinrsmith reviewed Jul 9, 2024

View reviewed changes

...ession/src/main/java/io/deephaven/parquet/compress/LZ4WithLZ4RawBackupCompressorAdapter.java Outdated Show resolved Hide resolved

...mpression/src/main/java/io/deephaven/parquet/compress/DeephavenCompressorAdapterFactory.java Outdated Show resolved Hide resolved

rcaudy reviewed Jul 9, 2024

View reviewed changes

Resolving more review comments

b573630

malhotrashivam requested review from rcaudy and devinrsmith July 10, 2024 17:32

Prevent double buffering

7981e1b

rcaudy previously approved these changes Jul 11, 2024

View reviewed changes

devinrsmith reviewed Jul 12, 2024

View reviewed changes

malhotrashivam added 2 commits July 12, 2024 09:50

Merge branch 'main' into sm-stream-decom

6a5e232

Resolving some more comments

fee242a

malhotrashivam dismissed rcaudy’s stale review via fee242a July 12, 2024 14:52

Removed some extra comments

23cad8c

devinrsmith previously approved these changes Jul 12, 2024

View reviewed changes

rcaudy reviewed Jul 12, 2024

View reviewed changes

Util/channel/src/main/java/io/deephaven/util/channel/BaseSeekableChannelContext.java Outdated Show resolved Hide resolved

rcaudy reviewed Jul 12, 2024

View reviewed changes

Improved handling for empty resource cache in context

dc3cce2

malhotrashivam dismissed devinrsmith’s stale review via dc3cce2 July 12, 2024 18:05

rcaudy approved these changes Jul 12, 2024

View reviewed changes

malhotrashivam merged commit f8b5e19 into deephaven:main Jul 12, 2024
16 checks passed

github-actions bot locked and limited conversation to collaborators Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Modified parquet decompression from buffered to streaming operation #5712

feat: Modified parquet decompression from buffered to streaming operation #5712

malhotrashivam commented Jul 3, 2024 •

edited

Loading

devinrsmith Jul 3, 2024

rcaudy Jul 5, 2024

malhotrashivam Jul 8, 2024

devinrsmith Jul 9, 2024

rcaudy Jul 5, 2024

devinrsmith left a comment

rcaudy left a comment

feat: Modified parquet decompression from buffered to streaming operation #5712

feat: Modified parquet decompression from buffered to streaming operation #5712

Conversation

malhotrashivam commented Jul 3, 2024 • edited Loading

devinrsmith Jul 3, 2024

Choose a reason for hiding this comment

rcaudy Jul 5, 2024

Choose a reason for hiding this comment

malhotrashivam Jul 8, 2024

Choose a reason for hiding this comment

devinrsmith Jul 9, 2024

Choose a reason for hiding this comment

rcaudy Jul 5, 2024

Choose a reason for hiding this comment

devinrsmith left a comment

Choose a reason for hiding this comment

rcaudy left a comment

Choose a reason for hiding this comment

malhotrashivam commented Jul 3, 2024 •

edited

Loading