-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-13665][SQL] Separate the concerns of HadoopFsRelation #11509
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Conflicts: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala
fix all tests
Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriterContainer.scala sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala sql/hive/src/test/scala/org/apache/spark/sql/sources/CommitFailureTestRelationSuite.scala sql/hive/src/test/scala/org/apache/spark/sql/sources/SimpleTextRelation.scala
Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala sql/hive/src/test/scala/org/apache/spark/sql/sources/CommitFailureTestRelationSuite.scala
|
Test build #52439 has finished for PR 11509 at commit
|
| } | ||
| } | ||
|
|
||
| test("call failure callbacks before close writer - default") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is deleted because it's flaky? Or because it does not work with new APIs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be rewritten to work against the new API. Filed SPARK-13681
| class DefaultSource extends HadoopFsRelationProvider with DataSourceRegister { | ||
| class DefaultSource extends FileFormat with DataSourceRegister { | ||
|
|
||
| override def shortName(): String = "csv" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add toString for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
Did one pass on this, looks great! All the comments are minor, it's fine to be addressed later. |
| options: Map[String, String]): RDD[InternalRow] = { | ||
| // TODO: This does not handle cases where column pruning has been performed. | ||
|
|
||
| verifySchema(dataSchema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we also verify schema when write? i.e. in prepareWrite
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we do already, on line 69
Conflicts: sql/hive/src/test/scala/org/apache/spark/sql/sources/SimpleTextHadoopFsRelationSuite.scala sql/hive/src/test/scala/org/apache/spark/sql/sources/SimpleTextRelation.scala
|
Test build #52590 has finished for PR 11509 at commit
|
|
Going to merge this in master. We should rename HiveFileCatalog to MetastoreFileCatalog. cc @andrewor14 |
|
Test build #52582 has finished for PR 11509 at commit
|
|
|
||
| private var _partitionSpec: PartitionSpec = _ | ||
| /** | ||
| * Used to read a write data in files to [[InternalRow]] format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: a write -> and write
|
i believe the need to pass all files along (e.g. inputFiles: Array[FileStatus]) instead of just the input paths came from the need to cache it so that stuff looked snappy on s3 which has slow meta operations. because of this inputFiles param we now need driver programs with 16G of heap or larger (before 1G was enough), and even then it doesn't always work on very large datasets. i would hate to see inputFiles make it into spark 2.0 api, instead of just inputPaths. |
|
@koertkuipers improving the efficiency of working with large files was certainly a goal in this refactoring and this API is definitely not done yet. That said, I'm not really sure that the correct thing to do is to avoid listing all of the files at the driver. Every version of Spark SQL has done this listing AFAIK during split planning even before we added a caching layer. |
|
if it did then it was not always in the apis i think? i remember the apis i found it was relatively straightforward to revert back to paths: On Tue, Mar 8, 2016 at 1:26 PM, Michael Armbrust notifications@github.com
|
Follow-up to apache#11509, that simply refactors the interface that we use when resolving a pluggable `DataSource`. - Multiple functions share the same set of arguments so we make this a case class, called `DataSource`. Actual resolution is now done by calling a function on this class. - Instead of having multiple methods named `apply` (some of which do writing some of which do reading) we now explicitly have `resolveRelation()` and `write(mode, df)`. - Get rid of `Array[String]` since this is an internal API and was forcing us to awkwardly call `toArray` in a bunch of places. Author: Michael Armbrust <michael@databricks.com> Closes apache#11572 from marmbrus/dataSourceResolution.
…cked. ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-13728 apache#11509 makes the output only single ORC file. It was 10 files but this PR writes only single file. So, this could not skip stripes in ORC by the pushed down filters. So, this PR simply repartitions data into 10 so that the test could pass. ## How was this patch tested? unittest and `./dev/run_tests` for code style test. Author: hyukjinkwon <gurwls223@gmail.com> Closes apache#11593 from HyukjinKwon/SPARK-13728.
`HadoopFsRelation` is used for reading most files into Spark SQL. However today this class mixes the concerns of file management, schema reconciliation, scan building, bucketing, partitioning, and writing data. As a result, many data sources are forced to reimplement the same functionality and the various layers have accumulated a fair bit of inefficiency. This PR is a first cut at separating this into several components / interfaces that are each described below. Additionally, all implementations inside of Spark (parquet, csv, json, text, orc, svmlib) have been ported to the new API `FileFormat`. External libraries, such as spark-avro will also need to be ported to work with Spark 2.0.
### HadoopFsRelation
A simple `case class` that acts as a container for all of the metadata required to read from a datasource. All discovery, resolution and merging logic for schemas and partitions has been removed. This an internal representation that no longer needs to be exposed to developers.
```scala
case class HadoopFsRelation(
sqlContext: SQLContext,
location: FileCatalog,
partitionSchema: StructType,
dataSchema: StructType,
bucketSpec: Option[BucketSpec],
fileFormat: FileFormat,
options: Map[String, String]) extends BaseRelation
```
### FileFormat
The primary interface that will be implemented by each different format including external libraries. Implementors are responsible for reading a given format and converting it into `InternalRow` as well as writing out an `InternalRow`. A format can optionally return a schema that is inferred from a set of files.
```scala
trait FileFormat {
def inferSchema(
sqlContext: SQLContext,
options: Map[String, String],
files: Seq[FileStatus]): Option[StructType]
def prepareWrite(
sqlContext: SQLContext,
job: Job,
options: Map[String, String],
dataSchema: StructType): OutputWriterFactory
def buildInternalScan(
sqlContext: SQLContext,
dataSchema: StructType,
requiredColumns: Array[String],
filters: Array[Filter],
bucketSet: Option[BitSet],
inputFiles: Array[FileStatus],
broadcastedConf: Broadcast[SerializableConfiguration],
options: Map[String, String]): RDD[InternalRow]
}
```
The current interface is based on what was required to get all the tests passing again, but still mixes a couple of concerns (i.e. `bucketSet` is passed down to the scan instead of being resolved by the planner). Additionally, scans are still returning `RDD`s instead of iterators for single files. In a future PR, bucketing should be removed from this interface and the scan should be isolated to a single file.
### FileCatalog
This interface is used to list the files that make up a given relation, as well as handle directory based partitioning.
```scala
trait FileCatalog {
def paths: Seq[Path]
def partitionSpec(schema: Option[StructType]): PartitionSpec
def allFiles(): Seq[FileStatus]
def getStatus(path: Path): Array[FileStatus]
def refresh(): Unit
}
```
Currently there are two implementations:
- `HDFSFileCatalog` - based on code from the old `HadoopFsRelation`. Infers partitioning by recursive listing and caches this data for performance
- `HiveFileCatalog` - based on the above, but it uses the partition spec from the Hive Metastore.
### ResolvedDataSource
Produces a logical plan given the following description of a Data Source (which can come from DataFrameReader or a metastore):
- `paths: Seq[String] = Nil`
- `userSpecifiedSchema: Option[StructType] = None`
- `partitionColumns: Array[String] = Array.empty`
- `bucketSpec: Option[BucketSpec] = None`
- `provider: String`
- `options: Map[String, String]`
This class is responsible for deciding which of the Data Source APIs a given provider is using (including the non-file based ones). All reconciliation of partitions, buckets, schema from metastores or inference is done here.
### DataSourceAnalysis / DataSourceStrategy
Responsible for analyzing and planning reading/writing of data using any of the Data Source APIs, including:
- pruning the files from partitions that will be read based on filters.
- appending partition columns*
- applying additional filters when a data source can not evaluate them internally.
- constructing an RDD that is bucketed correctly when required*
- sanity checking schema match-up and other analysis when writing.
*In the future we should do that following:
- Break out file handling into its own Strategy as its sufficiently complex / isolated.
- Push the appending of partition columns down in to `FileFormat` to avoid an extra copy / unvectorization.
- Use a custom RDD for scans instead of `SQLNewNewHadoopRDD2`
Author: Michael Armbrust <michael@databricks.com>
Author: Wenchen Fan <wenchen@databricks.com>
Closes apache#11509 from marmbrus/fileDataSource.
Follow-up to apache#11509, that simply refactors the interface that we use when resolving a pluggable `DataSource`. - Multiple functions share the same set of arguments so we make this a case class, called `DataSource`. Actual resolution is now done by calling a function on this class. - Instead of having multiple methods named `apply` (some of which do writing some of which do reading) we now explicitly have `resolveRelation()` and `write(mode, df)`. - Get rid of `Array[String]` since this is an internal API and was forcing us to awkwardly call `toArray` in a bunch of places. Author: Michael Armbrust <michael@databricks.com> Closes apache#11572 from marmbrus/dataSourceResolution.
…cked. ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-13728 apache#11509 makes the output only single ORC file. It was 10 files but this PR writes only single file. So, this could not skip stripes in ORC by the pushed down filters. So, this PR simply repartitions data into 10 so that the test could pass. ## How was this patch tested? unittest and `./dev/run_tests` for code style test. Author: hyukjinkwon <gurwls223@gmail.com> Closes apache#11593 from HyukjinKwon/SPARK-13728.
HadoopFsRelationis used for reading most files into Spark SQL. However today this class mixes the concerns of file management, schema reconciliation, scan building, bucketing, partitioning, and writing data. As a result, many data sources are forced to reimplement the same functionality and the various layers have accumulated a fair bit of inefficiency. This PR is a first cut at separating this into several components / interfaces that are each described below. Additionally, all implementations inside of Spark (parquet, csv, json, text, orc, svmlib) have been ported to the new APIFileFormat. External libraries, such as spark-avro will also need to be ported to work with Spark 2.0.HadoopFsRelation
A simple
case classthat acts as a container for all of the metadata required to read from a datasource. All discovery, resolution and merging logic for schemas and partitions has been removed. This an internal representation that no longer needs to be exposed to developers.FileFormat
The primary interface that will be implemented by each different format including external libraries. Implementors are responsible for reading a given format and converting it into
InternalRowas well as writing out anInternalRow. A format can optionally return a schema that is inferred from a set of files.The current interface is based on what was required to get all the tests passing again, but still mixes a couple of concerns (i.e.
bucketSetis passed down to the scan instead of being resolved by the planner). Additionally, scans are still returningRDDs instead of iterators for single files. In a future PR, bucketing should be removed from this interface and the scan should be isolated to a single file.FileCatalog
This interface is used to list the files that make up a given relation, as well as handle directory based partitioning.
Currently there are two implementations:
HDFSFileCatalog- based on code from the oldHadoopFsRelation. Infers partitioning by recursive listing and caches this data for performanceHiveFileCatalog- based on the above, but it uses the partition spec from the Hive Metastore.ResolvedDataSource
Produces a logical plan given the following description of a Data Source (which can come from DataFrameReader or a metastore):
paths: Seq[String] = NiluserSpecifiedSchema: Option[StructType] = NonepartitionColumns: Array[String] = Array.emptybucketSpec: Option[BucketSpec] = Noneprovider: Stringoptions: Map[String, String]This class is responsible for deciding which of the Data Source APIs a given provider is using (including the non-file based ones). All reconciliation of partitions, buckets, schema from metastores or inference is done here.
DataSourceAnalysis / DataSourceStrategy
Responsible for analyzing and planning reading/writing of data using any of the Data Source APIs, including:
*In the future we should do that following:
FileFormatto avoid an extra copy / unvectorization.SQLNewNewHadoopRDD2