-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Modified parquet decompression from buffered to streaming operation #5712
feat: Modified parquet decompression from buffered to streaming operation #5712
Conversation
...mpression/src/main/java/io/deephaven/parquet/compress/DeephavenCompressorAdapterFactory.java
Show resolved
Hide resolved
Util/channel/src/main/java/io/deephaven/util/channel/BaseSeekableChannelContext.java
Outdated
Show resolved
Hide resolved
Util/channel/src/main/java/io/deephaven/util/channel/BaseSeekableChannelContext.java
Outdated
Show resolved
Hide resolved
extensions/parquet/base/src/main/java/io/deephaven/parquet/base/ColumnPageReaderImpl.java
Outdated
Show resolved
Hide resolved
...ession/src/main/java/io/deephaven/parquet/compress/LZ4WithLZ4RawBackupCompressorAdapter.java
Outdated
Show resolved
Hide resolved
final InputStream decompressedInput = | ||
super.decompress(bufferedInputStream, compressedSize, uncompressedSize, decompressorCache); | ||
final ByteBuffer decompressedBuffer = | ||
CompressorAdapter.readNBytes(decompressedInput, uncompressedSize, new byte[uncompressedSize]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems very sad that we need to do the full decompression to figure out whether it's LZ4 or LZ4_RAW.
We should either handle the LZ4/LZ4_RAW exception handling at a higher layer (so we don't need to materialize all at once into ByteBuffer), or we should have a specialized InputStream that could do the reset() + fallback internally without the all-at-once read. Is there a hard limit on how many bytes it takes to fail an LZ4_RAW mislabelled as LZ4?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @devinrsmith in principle, but:
- At least it's only the first read.
- I don't see how moving it up a layer helps, since that would imply that every caller needs to handle the fallback.
- I don't see how moving it down into a wrapped
InputStream
is better than this, since we'd need to write more code in order to present theInputStream
interface, and I worry about edge cases where we can read some bytes but not all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there is an exact limit that I can check here, so just checking for a failure.
And yea, I couldn't find an easy way to do this without making a bigger change, and like Ryan said, this extra double buffering will happen only once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe there is room to have a better impl (unified InputStream) return in the future. Or, some way for the user to disable this fallback logic (so if they trust their parquet when it says LZ4, they don't have to pay this extra buffering cost).
Util/channel/src/main/java/io/deephaven/util/channel/SeekableChannelContext.java
Outdated
Show resolved
Hide resolved
Util/channel/src/main/java/io/deephaven/util/channel/SeekableChannelContext.java
Outdated
Show resolved
Hide resolved
Util/channel/src/main/java/io/deephaven/util/channel/BaseSeekableChannelContext.java
Outdated
Show resolved
Hide resolved
Util/channel/src/main/java/io/deephaven/util/channel/BaseSeekableChannelContext.java
Outdated
Show resolved
Hide resolved
Util/channel/src/main/java/io/deephaven/util/channel/BaseSeekableChannelContext.java
Outdated
Show resolved
Hide resolved
...ession/src/main/java/io/deephaven/parquet/compress/LZ4WithLZ4RawBackupCompressorAdapter.java
Outdated
Show resolved
Hide resolved
final InputStream decompressedInput = | ||
super.decompress(bufferedInputStream, compressedSize, uncompressedSize, decompressorCache); | ||
final ByteBuffer decompressedBuffer = | ||
CompressorAdapter.readNBytes(decompressedInput, uncompressedSize, new byte[uncompressedSize]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @devinrsmith in principle, but:
- At least it's only the first read.
- I don't see how moving it up a layer helps, since that would imply that every caller needs to handle the fallback.
- I don't see how moving it down into a wrapped
InputStream
is better than this, since we'd need to write more code in order to present theInputStream
interface, and I worry about edge cases where we can read some bytes but not all.
...ession/src/main/java/io/deephaven/parquet/compress/LZ4WithLZ4RawBackupCompressorAdapter.java
Outdated
Show resolved
Hide resolved
extensions/parquet/base/src/main/java/io/deephaven/parquet/base/ColumnPageReaderImpl.java
Outdated
Show resolved
Hide resolved
extensions/parquet/base/src/main/java/io/deephaven/parquet/base/ColumnPageReaderImpl.java
Outdated
Show resolved
Hide resolved
extensions/parquet/base/src/main/java/io/deephaven/parquet/base/ColumnPageReaderImpl.java
Outdated
Show resolved
Hide resolved
extensions/parquet/base/src/main/java/io/deephaven/parquet/base/ColumnPageReaderImpl.java
Outdated
Show resolved
Hide resolved
...ession/src/main/java/io/deephaven/parquet/compress/LZ4WithLZ4RawBackupCompressorAdapter.java
Outdated
Show resolved
Hide resolved
...mpression/src/main/java/io/deephaven/parquet/compress/DeephavenCompressorAdapterFactory.java
Outdated
Show resolved
Hide resolved
...sions/parquet/compression/src/main/java/io/deephaven/parquet/compress/CompressorAdapter.java
Outdated
Show resolved
Hide resolved
...sions/parquet/compression/src/main/java/io/deephaven/parquet/compress/CompressorAdapter.java
Outdated
Show resolved
Hide resolved
...ions/parquet/compression/src/main/java/io/deephaven/parquet/compress/DecompressorHolder.java
Outdated
Show resolved
Hide resolved
...mpression/src/main/java/io/deephaven/parquet/compress/DeephavenCompressorAdapterFactory.java
Outdated
Show resolved
Hide resolved
extensions/parquet/base/src/main/java/io/deephaven/parquet/base/ColumnPageReaderImpl.java
Outdated
Show resolved
Hide resolved
extensions/parquet/base/src/main/java/io/deephaven/parquet/base/ColumnPageReaderImpl.java
Outdated
Show resolved
Hide resolved
extensions/parquet/base/src/main/java/io/deephaven/parquet/base/ColumnPageReaderImpl.java
Outdated
Show resolved
Hide resolved
extensions/parquet/base/src/main/java/io/deephaven/parquet/base/ColumnPageReaderImpl.java
Outdated
Show resolved
Hide resolved
extensions/parquet/base/src/main/java/io/deephaven/parquet/base/ColumnPageReaderImpl.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's tough to review and be confident that we are handling things "correctly" in all cases. Looks ok. Need to fix conflict wrt main.
...sions/parquet/compression/src/main/java/io/deephaven/parquet/compress/CompressorAdapter.java
Show resolved
Hide resolved
Util/channel/src/main/java/io/deephaven/util/channel/BaseSeekableChannelContext.java
Outdated
Show resolved
Hide resolved
...mpression/src/main/java/io/deephaven/parquet/compress/DeephavenCompressorAdapterFactory.java
Outdated
Show resolved
Hide resolved
...ession/src/main/java/io/deephaven/parquet/compress/LZ4WithLZ4RawBackupCompressorAdapter.java
Outdated
Show resolved
Hide resolved
Util/channel/src/main/java/io/deephaven/util/channel/BaseSeekableChannelContext.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.
This helps reduce memory consumption when reading parquet files by almost 30%.