ParquetRDD

Project aims to give ability to read parquet files using Apache Spark RDD API. To get the RDD of type T (RDD[T]) the API requires to

provide the implementation of ReadSupport[T] that transforms each row to a value of T
ensure that ReadSupprot[T] is Serializable

Example usage 1

Given example file that looks like this:

id	login	age
1	login1	11
2	login2	12
3	login3	13
	....

We need to provide an instance of ReadSupport[T] that is serializable. For this example we will use ReadSupprot that is shiped with with parquet-mr project called GroupReadSupport. The minor problem is that this implementation is not serializable - something we can easily fix with a nice trick

class SerializableGroupReadSupport extends GroupReadSupport with Serializable

We can now read our file from HDFS (or local file system) by calling sc.parquet:

import ParquetRDD._
val path: Path = new Path("hdfs://localhost:9000/example.parquet")

val rdd: RDD[Group] = sc.parquet(path, new SerializableGroupReadSupport())

println(rdd.collect())

Example 2 - projection with schema

You can also want to read only specific parquet columns (aka projection). Just implement ReadSupport that does the projection for you

class ProjectableGroupReadSupport(private val projectionStr: String)
    extends GroupReadSupport
    with Serializable {
  override def init(configuration: Configuration,
                    keyValueMetaData: java.util.Map[String, String],
                    fileSchema: MessageType): ReadContext =
    new ReadContext(MessageTypeParser.parseMessageType(projectionStr))
}

import ParquetRDD._
val path: Path = new Path("hdfs://localhost:9000/example.parquet")
val projection = "message User {\n" +
            "   required int32 age;\n" +
            "}"

val rdd: RDD[Group] = sc.parquet(path, new ProjectableGroupReadSupport(projection))

println(rdd.collect())

Note that ProjectableGroupReadSupport holds refernece to "serialized" String representation of MessageType (aka schema), because MessageType does not implement Serializable.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
project		project
src		src
.gitignore		.gitignore
.scalafmt.conf		.scalafmt.conf
LICENSE		LICENSE
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ParquetRDD

Example usage 1

Example 2 - projection with schema

About

Releases

Packages

Languages

License

EncodePanda/ParquetRDD

Folders and files

Latest commit

History

Repository files navigation

ParquetRDD

Example usage 1

Example 2 - projection with schema

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages