LZ4_RAW parquet support #3148

niloc132 · 2022-12-05T22:07:49Z

Description

It appears that the "LZ4" constant in pandas/pyarrow results in a compression value that can't be read by our Java code.

Steps to reproduce
Create a dataframe and write it to a parquet file with LZ4 compression:

dh_table = empty_table(10).update("I=i")
dataframe = to_pandas(dh_table)
# this line will fail
dataframe.to_parquet('data_from_pandas.parquet', compression="LZ4")
result_table = read('data_from_pandas.parquet')
# next verify contents...

Expected results

The parquet file can be decompressed and read

Actual results

java.lang.RuntimeException: Error in Python interpreter:

Type: <class 'deephaven.dherror.DHError'>
Value: failed to read parquet data. : RuntimeError: java.lang.IllegalArgumentException: No enum constant org.apache.parquet.hadoop.metadata.CompressionCodecName.LZ4_RAW
Traceback (most recent call last):
  File "/home/colin/.pyenv/versions/3.7.12/lib/python3.7/site-packages/deephaven/parquet.py", line 88, in read
    return Table(j_table=_JParquetTools.readTable(path))
RuntimeError: java.lang.IllegalArgumentException: No enum constant org.apache.parquet.hadoop.metadata.CompressionCodecName.LZ4_RAW
	at java.base/java.lang.Enum.valueOf(Enum.java:240)
	at org.apache.parquet.hadoop.metadata.CompressionCodecName.valueOf(CompressionCodecName.java:26)
	at io.deephaven.parquet.base.tempfix.ParquetMetadataConverter.fromFormatCodec(ParquetMetadataConverter.java:535)
	at io.deephaven.parquet.base.tempfix.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:1195)
	at io.deephaven.parquet.table.location.ParquetTableLocationKey.getMetadata(ParquetTableLocationKey.java:99)
	at io.deephaven.parquet.table.ParquetTools.readTableInternal(ParquetTools.java:386)
	at io.deephaven.parquet.table.ParquetTools.readTable(ParquetTools.java:77)
	at org.jpy.PyLib.executeCode(Native Method)
	at org.jpy.PyObject.executeCode(PyObject.java:138)
	at io.deephaven.engine.util.PythonEvaluatorJpy.evalScript(PythonEvaluatorJpy.java:73)
	at io.deephaven.integrations.python.PythonDeephavenSession.lambda$evaluate$1(PythonDeephavenSession.java:183)
	at io.deephaven.util.locks.FunctionalLock.doLockedInterruptibly(FunctionalLock.java:50)
	at io.deephaven.integrations.python.PythonDeephavenSession.evaluate(PythonDeephavenSession.java:182)
	at io.deephaven.engine.util.AbstractScriptSession.lambda$evaluateScript$1(AbstractScriptSession.java:145)
	at io.deephaven.engine.context.ExecutionContext.lambda$apply$0(ExecutionContext.java:123)
	at io.deephaven.engine.context.ExecutionContext.apply(ExecutionContext.java:134)
	at io.deephaven.engine.context.ExecutionContext.apply(ExecutionContext.java:122)
	at io.deephaven.engine.util.AbstractScriptSession.evaluateScript(AbstractScriptSession.java:145)
	at io.deephaven.engine.util.DelegatingScriptSession.evaluateScript(DelegatingScriptSession.java:87)
	at io.deephaven.engine.util.ScriptSession.evaluateScript(ScriptSession.java:113)
	at io.deephaven.server.console.ConsoleServiceGrpcImpl.lambda$executeCommand$8(ConsoleServiceGrpcImpl.java:169)
	at io.deephaven.server.session.SessionState$ExportBuilder.lambda$submit$2(SessionState.java:1343)
	at io.deephaven.server.session.SessionState$ExportObject.doExport(SessionState.java:880)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at io.deephaven.server.runner.DeephavenApiServerModule$ThreadFactory.lambda$newThread$0(DeephavenApiServerModule.java:166)
	at java.base/java.lang.Thread.run(Thread.java:829)


Line: 90
Namespace: read
File: /home/colin/.pyenv/versions/3.7.12/lib/python3.7/site-packages/deephaven/parquet.py
Traceback (most recent call last):
  File "<string>", line 4, in <module>
  File "/home/colin/.pyenv/versions/3.7.12/lib/python3.7/site-packages/deephaven/parquet.py", line 90, in read

        at org.jpy.PyLib.executeCode(PyLib.java:-2)
        at org.jpy.PyObject.executeCode(PyObject.java:138)
        at io.deephaven.engine.util.PythonEvaluatorJpy.evalScript(PythonEvaluatorJpy.java:73)

Likely we could resolve this by treating the value "LZ4_RAW" as if it was "LZ4" in our codec factory.

Versions

Deephaven: 0.18+
OS: Linux
Docker: N/A

The text was updated successfully, but these errors were encountered:

devinrsmith · 2023-03-23T15:12:16Z

This can't simply be resolved by treating LZ4_RAW as if it was LZ4. I closed #3585, will add that info here.

devinrsmith · 2023-03-23T15:12:41Z

DH is unable to read "newer" parquet files that are ostensibly written out as 'lz4'. Python / c++ pyarrow treat this user string as the "new" LZ4_RAW codec, with LZ4 being deprecated since at least March 2021, https://github.com/apache/parquet-format/blob/master/Compression.md#lz4:

LZ4
A deprecated codec loosely based on the LZ4 compression algorithm, but with an additional undocumented framing scheme. The framing is part of the original Hadoop compression library and was historically copied first in parquet-mr, then emulated with mixed results by parquet-cpp.
It is strongly suggested that implementors of Parquet writers deprecate this compression codec in their user-facing APIs, and advise users to switch to the newer, interoperable LZ4_RAW codec

We will need to bump our version of parquet-hadoop past 1.12.3 once apache/parquet-java#1000 gets officially released to support this feature.

https://github.com/apache/parquet-testing/blob/master/data/lz4_raw_compressed.parquet and https://github.com/apache/parquet-testing/blob/master/data/lz4_raw_compressed_larger.parquet should be the minimum viable files for testing purposes.

pyarrow 4.0.1 released May 30, 2021 was the last python / C++ release to write out LZ4 codec. https://pypi.org/project/pyarrow/4.0.1/

https://issues.apache.org/jira/browse/PARQUET-2032

devinrsmith · 2023-03-23T17:56:18Z

Potentially related to #806

devinrsmith · 2023-04-05T17:24:25Z

Looks like there is some "good" news; there has been some activity recently on parquet-mr suggesting they are preparing a new release (first in 2 years):

https://github.com/apache/parquet-mr/releases/tag/apache-parquet-1.13.0-rc0

devinrsmith · 2023-04-05T20:44:07Z

Note: I've tested out 1.13.0-SNAPSHOT that has the proper LZ4_RAW codec - but the hadoop org.apache.hadoop.io.compress.CompressionCodecFactory#getCodecByName code doesn't work with this out of the box b/c the enum has an underscore in it :/

See deephaven#3148

niloc132 added bug Something isn't working python parquet Related to the Parquet integration labels Dec 5, 2022

niloc132 added this to the Dec 2022 milestone Dec 5, 2022

niloc132 self-assigned this Dec 5, 2022

devinrsmith mentioned this issue Mar 23, 2023

LZ4_RAW parquet support #3585

Closed

devinrsmith changed the title ~~Parquet/pandas: cannot roundtrip data between pandas and deephaven when compressed with lz4~~ LZ4_RAW parquet support Mar 23, 2023

devinrsmith modified the milestones: Dec 2022, Backlog Mar 23, 2023

devinrsmith added the version-bump label Mar 23, 2023

devinrsmith self-assigned this Apr 5, 2023

devinrsmith added a commit to devinrsmith/deephaven-core that referenced this issue Apr 5, 2023

WIP for LZ4_RAW support

ff753ef

See deephaven#3148

malhotrashivam assigned malhotrashivam and unassigned niloc132 and devinrsmith Sep 1, 2023

malhotrashivam mentioned this issue Sep 5, 2023

Added support for LZ4_RAW compression codec for Parquet #4446

Merged

malhotrashivam closed this as completed in #4446 Sep 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LZ4_RAW parquet support #3148

LZ4_RAW parquet support #3148

niloc132 commented Dec 5, 2022

devinrsmith commented Mar 23, 2023 •

edited

Loading

devinrsmith commented Mar 23, 2023

devinrsmith commented Mar 23, 2023

devinrsmith commented Apr 5, 2023

devinrsmith commented Apr 5, 2023

LZ4_RAW parquet support #3148

LZ4_RAW parquet support #3148

Comments

niloc132 commented Dec 5, 2022

devinrsmith commented Mar 23, 2023 • edited Loading

devinrsmith commented Mar 23, 2023

devinrsmith commented Mar 23, 2023

devinrsmith commented Apr 5, 2023

devinrsmith commented Apr 5, 2023

devinrsmith commented Mar 23, 2023 •

edited

Loading