Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove dependency on hadoop-common #806

Closed
devinrsmith opened this issue Jun 29, 2021 · 5 comments · Fixed by #4457
Closed

Remove dependency on hadoop-common #806

devinrsmith opened this issue Jun 29, 2021 · 5 comments · Fixed by #4457
Assignees
Labels
build clean up java parquet Related to the Parquet integration
Milestone

Comments

@devinrsmith
Copy link
Member

devinrsmith commented Jun 29, 2021

Related to #294
Related to #901

During review on #798 dug more into why we need some hadoop dependencies.

Essentially, parquet uses org.apache.hadoop.conf.Configuration from hadoop-common. Unfortunately, hadoop-common has sprawling dependencies that makes it undesirable for inclusion as part of a library.

Potentially relevant link.

https://issues.apache.org/jira/browse/PARQUET-1822

http://mail-archives.apache.org/mod_mbox/parquet-dev/202001.mbox/%3cCAO4re1m-Y9X3yQABX1_XaSaof4NZWBb8Tg_TBXgepK8rCJfU-g@mail.gmail.com%3e

https://stackoverflow.com/questions/59939309/read-local-parquet-file-without-hadoop-path-api

https://github.com/benwatson528/intellij-avro-parquet-plugin

@devinrsmith devinrsmith added this to the Backlog milestone Jun 29, 2021
@rcaudy rcaudy added the parquet Related to the Parquet integration label Jul 20, 2021
@jcferretti
Copy link
Member

I spent a few hours yesterday trying to make this work; I got everything except LZ4 to work, this is the branch:
https://github.com/jcferretti/deephaven-core/tree/cfs-parquet-hadoop-deps-0

The remaining failure:

log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.NativeCodeLoader).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

Error writing table to /a0/h/cfs/dh/oss3/deephaven-core/tmp/workspace/TestParquetTools/Table1.parquet
io.deephaven.UncheckedDeephavenException: Error writing table to /a0/h/cfs/dh/oss3/deephaven-core/tmp/workspace/TestParquetTools/Table1.parquet
	at io.deephaven.db.tables.utils.ParquetTools.writeParquetTableImpl(ParquetTools.java:566)
	at io.deephaven.db.tables.utils.ParquetTools.writeTable(ParquetTools.java:249)
	at io.deephaven.db.tables.utils.ParquetTools.writeTable(ParquetTools.java:172)
	at io.deephaven.db.tables.utils.TestParquetTools.compressionCodecTestHelper(TestParquetTools.java:298)
	at io.deephaven.db.tables.utils.TestParquetTools.testParquetLz4CompressionCodec(TestParquetTools.java:309)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
	at org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.runTestClass(JUnitTestClassExecutor.java:110)
	at org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:58)
	at org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:38)
	at org.gradle.api.internal.tasks.testing.junit.AbstractJUnitTestClassProcessor.processTestClass(AbstractJUnitTestClassProcessor.java:62)
	at org.gradle.api.internal.tasks.testing.SuiteTestClassProcessor.processTestClass(SuiteTestClassProcessor.java:51)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:36)
	at org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24)
	at org.gradle.internal.dispatch.ContextClassLoaderDispatch.dispatch(ContextClassLoaderDispatch.java:33)
	at org.gradle.internal.dispatch.ProxyDispatchAdapter$DispatchingInvocationHandler.invoke(ProxyDispatchAdapter.java:94)
	at com.sun.proxy.$Proxy2.processTestClass(Unknown Source)
	at org.gradle.api.internal.tasks.testing.worker.TestWorker.processTestClass(TestWorker.java:119)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:36)
	at org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24)
	at org.gradle.internal.remote.internal.hub.MessageHubBackedObjectConnection$DispatchWrapper.dispatch(MessageHubBackedObjectConnection.java:182)
	at org.gradle.internal.remote.internal.hub.MessageHubBackedObjectConnection$DispatchWrapper.dispatch(MessageHubBackedObjectConnection.java:164)
	at org.gradle.internal.remote.internal.hub.MessageHub$Handler.run(MessageHub.java:414)
	at org.gradle.internal.concurrent.ExecutorPolicy$CatchAndRecordFailures.onExecute(ExecutorPolicy.java:64)
	at org.gradle.internal.concurrent.ManagedExecutorImpl$1.run(ManagedExecutorImpl.java:48)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at org.gradle.internal.concurrent.ThreadFactoryImpl$ManagedThreadRunnable.run(ThreadFactoryImpl.java:56)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: native lz4 library not available
	at org.apache.hadoop.io.compress.Lz4Codec.getCompressorType(Lz4Codec.java:125)
	at org.apache.hadoop.io.compress.CodecPool.getCompressor(CodecPool.java:150)
	at org.apache.hadoop.io.compress.CodecPool.getCompressor(CodecPool.java:168)
	at org.apache.parquet.hadoop.CodecFactory$HeapBytesCompressor.<init>(CodecFactory.java:146)
	at org.apache.parquet.hadoop.CodecFactory.createCompressor(CodecFactory.java:208)
	at org.apache.parquet.hadoop.CodecFactory.getCompressor(CodecFactory.java:191)
	at io.deephaven.parquet.ParquetFileWriter.<init>(ParquetFileWriter.java:58)
	at io.deephaven.db.v2.parquet.ParquetTableWriter.getParquetFileWriter(ParquetTableWriter.java:306)
	at io.deephaven.db.v2.parquet.ParquetTableWriter.write(ParquetTableWriter.java:227)
	at io.deephaven.db.tables.utils.ParquetTools.writeParquetTableImpl(ParquetTools.java:561)
	... 54 more

@jcferretti
Copy link
Member

I just tried loading a LZ4 parquet file in the IDEA "Avro and parquet viewer plugin", it fails. The plugin can load GZIP compressed, but not LZ4 compressed; tried eth_v1_p1_cLZ4.parquet from Amanda's deephaven-core-parquet-examples repository, which community core can load.

@malhotrashivam
Copy link
Contributor

malhotrashivam commented Sep 8, 2023

It doesn't look like we can remove the dependency on hadoop-common because we are heavily dependent on org.apache.hadoop.conf.Configuration for our compression codecs.

The above mentioned approach which uses hadoop-client instead of hadoop-common would not be any better because hadoop-client depends on hadoop-common and thus will pull in all the same dependencies.

Note that hadoop-common is imported with transitive=false to minimize the burden of its dependencies and we selectively import the runtime dependencies that are needed. Best we can do right now is change its Gradle dependency type from Api to Implementation (#4457)

@niloc132
Copy link
Member

niloc132 commented Sep 8, 2023

Agreed, I think this issue should be closed, to be reevaluated if we ever rewrite our compression codec handling again.

@malhotrashivam
Copy link
Contributor

Thanks Colin, I can close it once we merge #4457

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build clean up java parquet Related to the Parquet integration
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants