-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
parquet-hadoop provides the only mechanism to load .parquet files and has an optional (provided) dependency on hadoop-common, implying that it is possible to use parquet-hadoop without using hadoop. However, it is required.
The following code is needed to instantiate a ParquetFileReader
final class LocalInputFile(file: File) extends InputFile {
def getLength() = file.length()
def newStream(): SeekableInputStream = {
val input = new FileInputStream(file)
new DelegatingSeekableInputStream(input) {
def getPos(): Long = input.getChannel().position()
def seek(bs: Long): Unit = {
val _ = input.getChannel().position(bs)
}
}
}
}but using this leads to a runtime exception because hadoop is missing transitive dependency on org.apache.hadoop.fs.PathFilter which then depends on org.apache.hadoop.fs.Path, both in hadoop-common.
Requiring downstream users to depend on hadoop-common is an extremely large dependency and I would rather that this was not the case.
A search for "import org.apache.hadoop" in src/main reveals a few more places where the dependency is hardwired, although often in deprecated static constructors and therefore benign.
Reporter: Sam Halliday
Related issues:
Note: This issue was originally created as PARQUET-1953. Please see the migration documentation for further details.