[SPARK-13665][SQL] Separate the concerns of HadoopFsRelation #11509

marmbrus · 2016-03-04T03:10:26Z

HadoopFsRelation is used for reading most files into Spark SQL. However today this class mixes the concerns of file management, schema reconciliation, scan building, bucketing, partitioning, and writing data. As a result, many data sources are forced to reimplement the same functionality and the various layers have accumulated a fair bit of inefficiency. This PR is a first cut at separating this into several components / interfaces that are each described below. Additionally, all implementations inside of Spark (parquet, csv, json, text, orc, svmlib) have been ported to the new API FileFormat. External libraries, such as spark-avro will also need to be ported to work with Spark 2.0.

HadoopFsRelation

A simple case class that acts as a container for all of the metadata required to read from a datasource. All discovery, resolution and merging logic for schemas and partitions has been removed. This an internal representation that no longer needs to be exposed to developers.

case class HadoopFsRelation(
    sqlContext: SQLContext,
    location: FileCatalog,
    partitionSchema: StructType,
    dataSchema: StructType,
    bucketSpec: Option[BucketSpec],
    fileFormat: FileFormat,
    options: Map[String, String]) extends BaseRelation

FileFormat

The primary interface that will be implemented by each different format including external libraries. Implementors are responsible for reading a given format and converting it into InternalRow as well as writing out an InternalRow. A format can optionally return a schema that is inferred from a set of files.

trait FileFormat {
  def inferSchema(
      sqlContext: SQLContext,
      options: Map[String, String],
      files: Seq[FileStatus]): Option[StructType]

  def prepareWrite(
      sqlContext: SQLContext,
      job: Job,
      options: Map[String, String],
      dataSchema: StructType): OutputWriterFactory

  def buildInternalScan(
      sqlContext: SQLContext,
      dataSchema: StructType,
      requiredColumns: Array[String],
      filters: Array[Filter],
      bucketSet: Option[BitSet],
      inputFiles: Array[FileStatus],
      broadcastedConf: Broadcast[SerializableConfiguration],
      options: Map[String, String]): RDD[InternalRow]
}

The current interface is based on what was required to get all the tests passing again, but still mixes a couple of concerns (i.e. bucketSet is passed down to the scan instead of being resolved by the planner). Additionally, scans are still returning RDDs instead of iterators for single files. In a future PR, bucketing should be removed from this interface and the scan should be isolated to a single file.

FileCatalog

This interface is used to list the files that make up a given relation, as well as handle directory based partitioning.

trait FileCatalog {
  def paths: Seq[Path]
  def partitionSpec(schema: Option[StructType]): PartitionSpec
  def allFiles(): Seq[FileStatus]
  def getStatus(path: Path): Array[FileStatus]
  def refresh(): Unit
}

Currently there are two implementations:

HDFSFileCatalog - based on code from the old HadoopFsRelation. Infers partitioning by recursive listing and caches this data for performance
HiveFileCatalog - based on the above, but it uses the partition spec from the Hive Metastore.

ResolvedDataSource

Produces a logical plan given the following description of a Data Source (which can come from DataFrameReader or a metastore):

paths: Seq[String] = Nil
userSpecifiedSchema: Option[StructType] = None
partitionColumns: Array[String] = Array.empty
bucketSpec: Option[BucketSpec] = None
provider: String
options: Map[String, String]

This class is responsible for deciding which of the Data Source APIs a given provider is using (including the non-file based ones). All reconciliation of partitions, buckets, schema from metastores or inference is done here.

DataSourceAnalysis / DataSourceStrategy

Responsible for analyzing and planning reading/writing of data using any of the Data Source APIs, including:

pruning the files from partitions that will be read based on filters.
appending partition columns*
applying additional filters when a data source can not evaluate them internally.
constructing an RDD that is bucketed correctly when required*
sanity checking schema match-up and other analysis when writing.

*In the future we should do that following:

Break out file handling into its own Strategy as its sufficiently complex / isolated.
Push the appending of partition columns down in to FileFormat to avoid an extra copy / unvectorization.
Use a custom RDD for scans instead of SQLNewNewHadoopRDD2

Conflicts: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala

fix all tests

Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriterContainer.scala sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala sql/hive/src/test/scala/org/apache/spark/sql/sources/CommitFailureTestRelationSuite.scala sql/hive/src/test/scala/org/apache/spark/sql/sources/SimpleTextRelation.scala

Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala sql/hive/src/test/scala/org/apache/spark/sql/sources/CommitFailureTestRelationSuite.scala

marmbrus · 2016-03-04T03:11:29Z

@rxin @nongli @cloud-fan @liancheng @yhuai

SparkQA · 2016-03-04T03:28:23Z

Test build #52439 has finished for PR 11509 at commit ac54278.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-03-04T18:23:30Z

sql/hive/src/test/scala/org/apache/spark/sql/sources/CommitFailureTestRelationSuite.scala

-    }
-  }
-
-  test("call failure callbacks before close writer - default") {


This is deleted because it's flaky? Or because it does not work with new APIs?

This needs to be rewritten to work against the new API. Filed SPARK-13681

davies · 2016-03-05T07:46:37Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/DefaultSource.scala

-class DefaultSource extends HadoopFsRelationProvider with DataSourceRegister {
+class DefaultSource extends FileFormat with DataSourceRegister {

  override def shortName(): String = "csv"


Add toString for this?

davies · 2016-03-05T08:06:42Z

Did one pass on this, looks great! All the comments are minor, it's fine to be addressed later.

cloud-fan · 2016-03-07T13:44:35Z

mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala

+      options: Map[String, String]): RDD[InternalRow] = {
+    // TODO: This does not handle cases where column pruning has been performed.
+
+    verifySchema(dataSchema)


should we also verify schema when write? i.e. in prepareWrite

I think that we do already, on line 69

Conflicts: sql/hive/src/test/scala/org/apache/spark/sql/sources/SimpleTextHadoopFsRelationSuite.scala sql/hive/src/test/scala/org/apache/spark/sql/sources/SimpleTextRelation.scala

SparkQA · 2016-03-07T22:57:34Z

Test build #52590 has finished for PR 11509 at commit 3e5c7b7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-03-07T23:14:48Z

Going to merge this in master.

We should rename HiveFileCatalog to MetastoreFileCatalog. cc @andrewor14

SparkQA · 2016-03-07T23:47:34Z

Test build #52582 has finished for PR 11509 at commit fd65bcb.

This patch fails from timeout after a configured wait of 250m.
This patch merges cleanly.
This patch adds no public classes.

tedyu · 2016-03-08T05:36:25Z

sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala


-  private var _partitionSpec: PartitionSpec = _
+/**
+ * Used to read a write data in files to [[InternalRow]] format.


nit: a write -> and write

koertkuipers · 2016-03-08T17:24:30Z

i believe the need to pass all files along (e.g. inputFiles: Array[FileStatus]) instead of just the input paths came from the need to cache it so that stuff looked snappy on s3 which has slow meta operations.
however it is not very realistic to pass along all files for real datasets, since it can easily be size 100k+ (and some people reported using millions of files on mailing list).

because of this inputFiles param we now need driver programs with 16G of heap or larger (before 1G was enough), and even then it doesn't always work on very large datasets. i would hate to see inputFiles make it into spark 2.0 api, instead of just inputPaths.

marmbrus · 2016-03-08T18:25:07Z

@koertkuipers improving the efficiency of working with large files was certainly a goal in this refactoring and this API is definitely not done yet. That said, I'm not really sure that the correct thing to do is to avoid listing all of the files at the driver. Every version of Spark SQL has done this listing AFAIK during split planning even before we added a caching layer.

koertkuipers · 2016-03-08T19:11:10Z

if it did then it was not always in the apis i think? i remember the apis
having paths: Seq[String] instead of files: Seq[FileStatus]. by explicitly
requiring the user to list all files in the api you make it impossible not
to, even if it turns out it is not always necessary. for 1mm files thats no
joke.

i found it was relatively straightforward to revert back to paths:
Seq[String] once i ripped out the cache, modified partition discovery, and
disabled some kind of data size estimation. so i more or less assumed it
wasn't used anywhere else. but i might have missed split planning.

On Tue, Mar 8, 2016 at 1:26 PM, Michael Armbrust notifications@github.com
wrote:

@koertkuipers https://github.com/koertkuipers improving the efficiency
of working with large files was certainly a goal in this refactoring and
this API is definitely not done yet. That said, I'm not really sure that
the correct thing to do is to avoid listing all of the files at the driver.
Every version of Spark SQL has done this listing AFAIK during split
planning even before we added a caching layer.

—
Reply to this email directly or view it on GitHub
#11509 (comment).

Follow-up to apache#11509, that simply refactors the interface that we use when resolving a pluggable `DataSource`. - Multiple functions share the same set of arguments so we make this a case class, called `DataSource`. Actual resolution is now done by calling a function on this class. - Instead of having multiple methods named `apply` (some of which do writing some of which do reading) we now explicitly have `resolveRelation()` and `write(mode, df)`. - Get rid of `Array[String]` since this is an internal API and was forcing us to awkwardly call `toArray` in a bunch of places. Author: Michael Armbrust <michael@databricks.com> Closes apache#11572 from marmbrus/dataSourceResolution.

…cked. ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-13728 apache#11509 makes the output only single ORC file. It was 10 files but this PR writes only single file. So, this could not skip stripes in ORC by the pushed down filters. So, this PR simply repartitions data into 10 so that the test could pass. ## How was this patch tested? unittest and `./dev/run_tests` for code style test. Author: hyukjinkwon <gurwls223@gmail.com> Closes apache#11593 from HyukjinKwon/SPARK-13728.

`HadoopFsRelation` is used for reading most files into Spark SQL. However today this class mixes the concerns of file management, schema reconciliation, scan building, bucketing, partitioning, and writing data. As a result, many data sources are forced to reimplement the same functionality and the various layers have accumulated a fair bit of inefficiency. This PR is a first cut at separating this into several components / interfaces that are each described below. Additionally, all implementations inside of Spark (parquet, csv, json, text, orc, svmlib) have been ported to the new API `FileFormat`. External libraries, such as spark-avro will also need to be ported to work with Spark 2.0. ### HadoopFsRelation A simple `case class` that acts as a container for all of the metadata required to read from a datasource. All discovery, resolution and merging logic for schemas and partitions has been removed. This an internal representation that no longer needs to be exposed to developers. ```scala case class HadoopFsRelation( sqlContext: SQLContext, location: FileCatalog, partitionSchema: StructType, dataSchema: StructType, bucketSpec: Option[BucketSpec], fileFormat: FileFormat, options: Map[String, String]) extends BaseRelation ``` ### FileFormat The primary interface that will be implemented by each different format including external libraries. Implementors are responsible for reading a given format and converting it into `InternalRow` as well as writing out an `InternalRow`. A format can optionally return a schema that is inferred from a set of files. ```scala trait FileFormat { def inferSchema( sqlContext: SQLContext, options: Map[String, String], files: Seq[FileStatus]): Option[StructType] def prepareWrite( sqlContext: SQLContext, job: Job, options: Map[String, String], dataSchema: StructType): OutputWriterFactory def buildInternalScan( sqlContext: SQLContext, dataSchema: StructType, requiredColumns: Array[String], filters: Array[Filter], bucketSet: Option[BitSet], inputFiles: Array[FileStatus], broadcastedConf: Broadcast[SerializableConfiguration], options: Map[String, String]): RDD[InternalRow] } ``` The current interface is based on what was required to get all the tests passing again, but still mixes a couple of concerns (i.e. `bucketSet` is passed down to the scan instead of being resolved by the planner). Additionally, scans are still returning `RDD`s instead of iterators for single files. In a future PR, bucketing should be removed from this interface and the scan should be isolated to a single file. ### FileCatalog This interface is used to list the files that make up a given relation, as well as handle directory based partitioning. ```scala trait FileCatalog { def paths: Seq[Path] def partitionSpec(schema: Option[StructType]): PartitionSpec def allFiles(): Seq[FileStatus] def getStatus(path: Path): Array[FileStatus] def refresh(): Unit } ``` Currently there are two implementations: - `HDFSFileCatalog` - based on code from the old `HadoopFsRelation`. Infers partitioning by recursive listing and caches this data for performance - `HiveFileCatalog` - based on the above, but it uses the partition spec from the Hive Metastore. ### ResolvedDataSource Produces a logical plan given the following description of a Data Source (which can come from DataFrameReader or a metastore): - `paths: Seq[String] = Nil` - `userSpecifiedSchema: Option[StructType] = None` - `partitionColumns: Array[String] = Array.empty` - `bucketSpec: Option[BucketSpec] = None` - `provider: String` - `options: Map[String, String]` This class is responsible for deciding which of the Data Source APIs a given provider is using (including the non-file based ones). All reconciliation of partitions, buckets, schema from metastores or inference is done here. ### DataSourceAnalysis / DataSourceStrategy Responsible for analyzing and planning reading/writing of data using any of the Data Source APIs, including: - pruning the files from partitions that will be read based on filters. - appending partition columns* - applying additional filters when a data source can not evaluate them internally. - constructing an RDD that is bucketed correctly when required* - sanity checking schema match-up and other analysis when writing. *In the future we should do that following: - Break out file handling into its own Strategy as its sufficiently complex / isolated. - Push the appending of partition columns down in to `FileFormat` to avoid an extra copy / unvectorization. - Use a custom RDD for scans instead of `SQLNewNewHadoopRDD2` Author: Michael Armbrust <michael@databricks.com> Author: Wenchen Fan <wenchen@databricks.com> Closes apache#11509 from marmbrus/fileDataSource.

Follow-up to apache#11509, that simply refactors the interface that we use when resolving a pluggable `DataSource`. - Multiple functions share the same set of arguments so we make this a case class, called `DataSource`. Actual resolution is now done by calling a function on this class. - Instead of having multiple methods named `apply` (some of which do writing some of which do reading) we now explicitly have `resolveRelation()` and `write(mode, df)`. - Get rid of `Array[String]` since this is an internal API and was forcing us to awkwardly call `toArray` in a bunch of places. Author: Michael Armbrust <michael@databricks.com> Closes apache#11572 from marmbrus/dataSourceResolution.

…cked. ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-13728 apache#11509 makes the output only single ORC file. It was 10 files but this PR writes only single file. So, this could not skip stripes in ORC by the pushed down filters. So, this PR simply repartitions data into 10 so that the test could pass. ## How was this patch tested? unittest and `./dev/run_tests` for code style test. Author: hyukjinkwon <gurwls223@gmail.com> Closes apache#11593 from HyukjinKwon/SPARK-13728.

marmbrus and others added 23 commits February 26, 2016 14:39

WIP

c2c2fcd

WIP

4687a66

WIP: basic read/write workign

0bf0d02

WIP: trying to get appending

1f35b90

working on partitioning

4bc04e3

WIP: many tests passing

a27b4a6

WIP: parquet/hive compiling

159e4c4

:(

7299660

much of hive passing

049ac1b

Merge remote-tracking branch 'apache/master' into fileDataSource

405f284

Conflicts: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala

more progress

d28300b

WIP

6b13674

WIP: all but bucketing

a975f2d

Still workign on bucketing...

5275c41

restore

0d4b08a

remove

428a62f

fix all tests

1a41e15

Merge pull request #32 from cloud-fan/fileDataSource

2a49e8a

fix all tests

TESTS PASSING?\!?

83fbb44

cleanup

175e78f

style

216078c

Merge remote-tracking branch 'apache/master' into fileDataSource

ac54278

Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala sql/hive/src/test/scala/org/apache/spark/sql/sources/CommitFailureTestRelationSuite.scala

marmbrus mentioned this pull request Mar 4, 2016

[SPARK-13671] [SPARK-13311] [SQL] Use different physical plans for RDD and data sources #11514

Closed

davies reviewed Mar 4, 2016
View reviewed changes

davies reviewed Mar 5, 2016
View reviewed changes

cloud-fan reviewed Mar 7, 2016
View reviewed changes

marmbrus added 3 commits March 7, 2016 10:43

Merge remote-tracking branch 'apache/master' into fileDataSource

bb9e092

comments

fd65bcb

Merge remote-tracking branch 'apache/master' into fileDataSource

3e5c7b7

Conflicts: sql/hive/src/test/scala/org/apache/spark/sql/sources/SimpleTextHadoopFsRelationSuite.scala sql/hive/src/test/scala/org/apache/spark/sql/sources/SimpleTextRelation.scala

asfgit closed this in e720dda Mar 7, 2016

marmbrus mentioned this pull request Mar 8, 2016

[SPARK-13738][SQL] Cleanup Data Source resolution #11572

Closed

tedyu reviewed Mar 8, 2016
View reviewed changes

HyukjinKwon mentioned this pull request Mar 9, 2016

[SPARK-13728][SQL] Fix ORC PPD test so that pushed filters can be checked. #11593

Closed

gatorsmile mentioned this pull request Aug 17, 2016

[MINOR][SQL] Add missing functions for some options in SQLConf and use them where applicable #14678

Closed

[SPARK-13665][SQL] Separate the concerns of HadoopFsRelation #11509

[SPARK-13665][SQL] Separate the concerns of HadoopFsRelation #11509

Uh oh!

Conversation

marmbrus commented Mar 4, 2016

HadoopFsRelation

FileFormat

FileCatalog

ResolvedDataSource

DataSourceAnalysis / DataSourceStrategy

Uh oh!

marmbrus commented Mar 4, 2016

Uh oh!

SparkQA commented Mar 4, 2016

Uh oh!

davies Mar 4, 2016

Choose a reason for hiding this comment

Uh oh!

marmbrus Mar 4, 2016

Choose a reason for hiding this comment

Uh oh!

davies Mar 5, 2016

Choose a reason for hiding this comment

Uh oh!

marmbrus Mar 7, 2016

Choose a reason for hiding this comment

Uh oh!

davies commented Mar 5, 2016

Uh oh!

cloud-fan Mar 7, 2016

Choose a reason for hiding this comment

Uh oh!

marmbrus Mar 7, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 7, 2016

Uh oh!

rxin commented Mar 7, 2016

Uh oh!

SparkQA commented Mar 7, 2016

Uh oh!

tedyu Mar 8, 2016

Choose a reason for hiding this comment

Uh oh!

koertkuipers commented Mar 8, 2016

Uh oh!

marmbrus commented Mar 8, 2016

Uh oh!

koertkuipers commented Mar 8, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants