Skip to content

[SUPPORT] Missing dependency on hive-exec (core) #8147

@kkrugler

Description

@kkrugler

Describe the problem you faced

When using Flink to do an incremental query read from a table, using the 0.12.2 and Flink 1.15, I occasionally get a ClassNotFoundException for org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat. This usually happens when running the test from inside Eclipse, occasionally from the command line.

To Reproduce

Steps to reproduce the behavior:

  1. git clone https://github.com/kkrugler/flink-hudi-query-test
  2. cd flink-hudi-query-test
  3. Edit pom.xml to remove explicit dependency on hive-exec.
  4. mvn clean package

Expected behavior

The tests should all pass.

Environment Description

  • Hudi version : 0.12.2

  • Flink version : 1.15.1

Additional context

I believe the problem is that the hudi-hadoop-mr dependency on hive-exec (with classifier core) is marked as provided, but when running a Flink workflow in a typical Flink cluster you don't have Hive jars installed. I think maybe it's OK for hudi-hadoop-mr to say this is provided, but hudi-flink should then have an explicit dependency on this artifact, something like:

        <dependency>
            <groupId>org.apache.hive</groupId>
            <artifactId>hive-exec</artifactId>
            <classifier>core</classifier>
            <version>${hive.version}</version>
            <exclusions>
                <exclusion>
                    <groupId>*</groupId>
                    <artifactId>*</artifactId>
                </exclusion>
            </exclusions>
        </dependency>

Note the exclusion of all transitive dependencies. All that Hudi needs from hive-exec is the one missing class, as Hudi uses HoodieParquetInputFormatBase as the base class, as per:

/**
 * !!! PLEASE READ CAREFULLY !!!
 *
 * NOTE: Hive bears optimizations which are based upon validating whether {@link FileInputFormat}
 * implementation inherits from {@link MapredParquetInputFormat}.
 *
 * To make sure that Hudi implementations are leveraging these optimizations to the fullest, this class
 * serves as a base-class for every {@link FileInputFormat} implementations working with Parquet file-format.
 *
 * However, this class serves as a simple delegate to the actual implementation hierarchy: it expects
 * either {@link HoodieCopyOnWriteTableInputFormat} or {@link HoodieMergeOnReadTableInputFormat} to be supplied
 * to which it delegates all of its necessary methods.
 */
public abstract class HoodieParquetInputFormatBase extends MapredParquetInputFormat implements Configurable {

And if you don't do this exclusion, you wind up pulling in lots of additional code that's not needed (AFAICT).

Stacktrace

23/03/09 10:08:22 INFO executiongraph.ExecutionGraph:1423 - Source: split_monitor(table=[example-table], fields=[event_time, data, enrichment, key, partition]) (1/1) (16f707e9f9462ca1ac57f69e5bc9ae4e) switched from RUNNING to FAILED on 9cbbe102-0f19-48d6-849c-4755cab4fa2d @ localhost (dataPort=-1).
java.lang.NoClassDefFoundError: org/apache/hadoop/hive/ql/io/parquet/MapredParquetInputFormat
	at java.lang.ClassLoader.defineClass1(Native Method) ~[?:?]
	at java.lang.ClassLoader.defineClass(ClassLoader.java:1016) ~[?:?]
	at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:174) ~[?:?]
	at jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:800) ~[?:?]
	at jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:698) ~[?:?]
	at jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:621) ~[?:?]
	at jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:579) ~[?:?]
	at jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178) ~[?:?]
	at java.lang.ClassLoader.loadClass(ClassLoader.java:521) ~[?:?]
	at java.lang.ClassLoader.defineClass1(Native Method) ~[?:?]
	at java.lang.ClassLoader.defineClass(ClassLoader.java:1016) ~[?:?]
	at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:174) ~[?:?]
	at jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:800) ~[?:?]
	at jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:698) ~[?:?]
	at jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:621) ~[?:?]
	at jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:579) ~[?:?]
	at jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178) ~[?:?]
	at java.lang.ClassLoader.loadClass(ClassLoader.java:521) ~[?:?]
	at java.lang.ClassLoader.defineClass1(Native Method) ~[?:?]
	at java.lang.ClassLoader.defineClass(ClassLoader.java:1016) ~[?:?]
	at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:174) ~[?:?]
	at jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:800) ~[?:?]
	at jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:698) ~[?:?]
	at jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:621) ~[?:?]
	at jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:579) ~[?:?]
	at jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178) ~[?:?]
	at java.lang.ClassLoader.loadClass(ClassLoader.java:521) ~[?:?]
	at org.apache.hudi.sink.partitioner.profile.WriteProfiles.getCommitMetadata(WriteProfiles.java:236) ~[hudi-flink-0.12.2.jar:0.12.2]
	at org.apache.hudi.source.IncrementalInputSplits.lambda$inputSplits$2(IncrementalInputSplits.java:285) ~[hudi-flink-0.12.2.jar:0.12.2]
	at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) ~[?:?]
	at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1655) ~[?:?]
	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) ~[?:?]
	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) ~[?:?]
	at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913) ~[?:?]
	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:?]
	at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578) ~[?:?]
	at org.apache.hudi.source.IncrementalInputSplits.inputSplits(IncrementalInputSplits.java:285) ~[hudi-flink-0.12.2.jar:0.12.2]
	at org.apache.hudi.source.StreamReadMonitoringFunction.monitorDirAndForwardSplits(StreamReadMonitoringFunction.java:199) ~[hudi-flink-0.12.2.jar:0.12.2]
	at org.apache.hudi.source.StreamReadMonitoringFunction.run(StreamReadMonitoringFunction.java:172) ~[hudi-flink-0.12.2.jar:0.12.2]
	at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:110) ~[flink-streaming-java-1.15.1.jar:1.15.1]
	at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:67) ~[flink-streaming-java-1.15.1.jar:1.15.1]
	at org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:332) ~[flink-streaming-java-1.15.1.jar:1.15.1]
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
	at jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581) ~[?:?]
	at jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178) ~[?:?]
	at java.lang.ClassLoader.loadClass(ClassLoader.java:521) ~[?:?]
	... 42 more

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

Status

✅ Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions