Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LZ4_RAW parquet support #3148

Closed
niloc132 opened this issue Dec 5, 2022 · 5 comments · Fixed by #4446
Closed

LZ4_RAW parquet support #3148

niloc132 opened this issue Dec 5, 2022 · 5 comments · Fixed by #4446
Assignees
Labels
bug Something isn't working parquet Related to the Parquet integration python version-bump
Milestone

Comments

@niloc132
Copy link
Member

niloc132 commented Dec 5, 2022

Description

It appears that the "LZ4" constant in pandas/pyarrow results in a compression value that can't be read by our Java code.

Steps to reproduce
Create a dataframe and write it to a parquet file with LZ4 compression:

dh_table = empty_table(10).update("I=i")
dataframe = to_pandas(dh_table)
# this line will fail
dataframe.to_parquet('data_from_pandas.parquet', compression="LZ4")
result_table = read('data_from_pandas.parquet')
# next verify contents...

Expected results

The parquet file can be decompressed and read

Actual results

java.lang.RuntimeException: Error in Python interpreter:

Type: <class 'deephaven.dherror.DHError'>
Value: failed to read parquet data. : RuntimeError: java.lang.IllegalArgumentException: No enum constant org.apache.parquet.hadoop.metadata.CompressionCodecName.LZ4_RAW
Traceback (most recent call last):
  File "/home/colin/.pyenv/versions/3.7.12/lib/python3.7/site-packages/deephaven/parquet.py", line 88, in read
    return Table(j_table=_JParquetTools.readTable(path))
RuntimeError: java.lang.IllegalArgumentException: No enum constant org.apache.parquet.hadoop.metadata.CompressionCodecName.LZ4_RAW
	at java.base/java.lang.Enum.valueOf(Enum.java:240)
	at org.apache.parquet.hadoop.metadata.CompressionCodecName.valueOf(CompressionCodecName.java:26)
	at io.deephaven.parquet.base.tempfix.ParquetMetadataConverter.fromFormatCodec(ParquetMetadataConverter.java:535)
	at io.deephaven.parquet.base.tempfix.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:1195)
	at io.deephaven.parquet.table.location.ParquetTableLocationKey.getMetadata(ParquetTableLocationKey.java:99)
	at io.deephaven.parquet.table.ParquetTools.readTableInternal(ParquetTools.java:386)
	at io.deephaven.parquet.table.ParquetTools.readTable(ParquetTools.java:77)
	at org.jpy.PyLib.executeCode(Native Method)
	at org.jpy.PyObject.executeCode(PyObject.java:138)
	at io.deephaven.engine.util.PythonEvaluatorJpy.evalScript(PythonEvaluatorJpy.java:73)
	at io.deephaven.integrations.python.PythonDeephavenSession.lambda$evaluate$1(PythonDeephavenSession.java:183)
	at io.deephaven.util.locks.FunctionalLock.doLockedInterruptibly(FunctionalLock.java:50)
	at io.deephaven.integrations.python.PythonDeephavenSession.evaluate(PythonDeephavenSession.java:182)
	at io.deephaven.engine.util.AbstractScriptSession.lambda$evaluateScript$1(AbstractScriptSession.java:145)
	at io.deephaven.engine.context.ExecutionContext.lambda$apply$0(ExecutionContext.java:123)
	at io.deephaven.engine.context.ExecutionContext.apply(ExecutionContext.java:134)
	at io.deephaven.engine.context.ExecutionContext.apply(ExecutionContext.java:122)
	at io.deephaven.engine.util.AbstractScriptSession.evaluateScript(AbstractScriptSession.java:145)
	at io.deephaven.engine.util.DelegatingScriptSession.evaluateScript(DelegatingScriptSession.java:87)
	at io.deephaven.engine.util.ScriptSession.evaluateScript(ScriptSession.java:113)
	at io.deephaven.server.console.ConsoleServiceGrpcImpl.lambda$executeCommand$8(ConsoleServiceGrpcImpl.java:169)
	at io.deephaven.server.session.SessionState$ExportBuilder.lambda$submit$2(SessionState.java:1343)
	at io.deephaven.server.session.SessionState$ExportObject.doExport(SessionState.java:880)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at io.deephaven.server.runner.DeephavenApiServerModule$ThreadFactory.lambda$newThread$0(DeephavenApiServerModule.java:166)
	at java.base/java.lang.Thread.run(Thread.java:829)


Line: 90
Namespace: read
File: /home/colin/.pyenv/versions/3.7.12/lib/python3.7/site-packages/deephaven/parquet.py
Traceback (most recent call last):
  File "<string>", line 4, in <module>
  File "/home/colin/.pyenv/versions/3.7.12/lib/python3.7/site-packages/deephaven/parquet.py", line 90, in read

        at org.jpy.PyLib.executeCode(PyLib.java:-2)
        at org.jpy.PyObject.executeCode(PyObject.java:138)
        at io.deephaven.engine.util.PythonEvaluatorJpy.evalScript(PythonEvaluatorJpy.java:73)

Likely we could resolve this by treating the value "LZ4_RAW" as if it was "LZ4" in our codec factory.

Versions

  • Deephaven: 0.18+
  • OS: Linux
  • Docker: N/A
@niloc132 niloc132 added bug Something isn't working python parquet Related to the Parquet integration labels Dec 5, 2022
@niloc132 niloc132 added this to the Dec 2022 milestone Dec 5, 2022
@niloc132 niloc132 self-assigned this Dec 5, 2022
@devinrsmith devinrsmith changed the title Parquet/pandas: cannot roundtrip data between pandas and deephaven when compressed with lz4 LZ4_RAW parquet support Mar 23, 2023
@devinrsmith
Copy link
Member

devinrsmith commented Mar 23, 2023

This can't simply be resolved by treating LZ4_RAW as if it was LZ4. I closed #3585, will add that info here.

@devinrsmith
Copy link
Member

DH is unable to read "newer" parquet files that are ostensibly written out as 'lz4'. Python / c++ pyarrow treat this user string as the "new" LZ4_RAW codec, with LZ4 being deprecated since at least March 2021, https://github.com/apache/parquet-format/blob/master/Compression.md#lz4:

LZ4
A deprecated codec loosely based on the LZ4 compression algorithm, but with an additional undocumented framing scheme. The framing is part of the original Hadoop compression library and was historically copied first in parquet-mr, then emulated with mixed results by parquet-cpp.
It is strongly suggested that implementors of Parquet writers deprecate this compression codec in their user-facing APIs, and advise users to switch to the newer, interoperable LZ4_RAW codec

We will need to bump our version of parquet-hadoop past 1.12.3 once apache/parquet-java#1000 gets officially released to support this feature.

https://github.com/apache/parquet-testing/blob/master/data/lz4_raw_compressed.parquet and https://github.com/apache/parquet-testing/blob/master/data/lz4_raw_compressed_larger.parquet should be the minimum viable files for testing purposes.

pyarrow 4.0.1 released May 30, 2021 was the last python / C++ release to write out LZ4 codec. https://pypi.org/project/pyarrow/4.0.1/

https://issues.apache.org/jira/browse/PARQUET-2032

@devinrsmith
Copy link
Member

Potentially related to #806

@devinrsmith
Copy link
Member

Looks like there is some "good" news; there has been some activity recently on parquet-mr suggesting they are preparing a new release (first in 2 years):

https://github.com/apache/parquet-mr/releases/tag/apache-parquet-1.13.0-rc0

@devinrsmith devinrsmith self-assigned this Apr 5, 2023
@devinrsmith
Copy link
Member

Note: I've tested out 1.13.0-SNAPSHOT that has the proper LZ4_RAW codec - but the hadoop org.apache.hadoop.io.compress.CompressionCodecFactory#getCodecByName code doesn't work with this out of the box b/c the enum has an underscore in it :/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working parquet Related to the Parquet integration python version-bump
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants