-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-13665][SQL] Separate the concerns of HadoopFsRelation #11509
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
c2c2fcd
4687a66
0bf0d02
1f35b90
4bc04e3
a27b4a6
159e4c4
7299660
049ac1b
405f284
d28300b
6b13674
a975f2d
5275c41
0d4b08a
428a62f
1a41e15
2a49e8a
023f133
83fbb44
175e78f
216078c
ac54278
af8baff
3b7e3a8
4b53adb
bb9e092
fd65bcb
3e5c7b7
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -22,7 +22,7 @@ import java.io.{File, IOException} | |
| import com.google.common.base.Charsets | ||
| import com.google.common.io.Files | ||
|
|
||
| import org.apache.spark.SparkFunSuite | ||
| import org.apache.spark.{SparkException, SparkFunSuite} | ||
| import org.apache.spark.mllib.linalg.{DenseVector, SparseVector, Vectors} | ||
| import org.apache.spark.mllib.util.MLlibTestSparkContext | ||
| import org.apache.spark.sql.SaveMode | ||
|
|
@@ -88,7 +88,8 @@ class LibSVMRelationSuite extends SparkFunSuite with MLlibTestSparkContext { | |
| val df = sqlContext.read.format("libsvm").load(path) | ||
| val tempDir2 = Utils.createTempDir() | ||
| val writepath = tempDir2.toURI.toString | ||
| df.write.format("libsvm").mode(SaveMode.Overwrite).save(writepath) | ||
| // TODO: Remove requirement to coalesce by supporting mutiple reads. | ||
| df.coalesce(1).write.format("libsvm").mode(SaveMode.Overwrite).save(writepath) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't get this, lib svm relation doesn't support multiple reads even before your PR right?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same here. What does "multiple reads" mean here and why
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, honestly I'm not sure where we were implicitly coalescing before, but this is required to make the test case pass. Before this PR the implementation had a restriction that it throws an error if there is more than one file and I did not try and remove that. |
||
|
|
||
| val df2 = sqlContext.read.format("libsvm").load(writepath) | ||
| val row1 = df2.first() | ||
|
|
@@ -98,9 +99,8 @@ class LibSVMRelationSuite extends SparkFunSuite with MLlibTestSparkContext { | |
|
|
||
| test("write libsvm data failed due to invalid schema") { | ||
| val df = sqlContext.read.format("text").load(path) | ||
| val e = intercept[IOException] { | ||
| val e = intercept[SparkException] { | ||
| df.write.format("libsvm").save(path + "_2") | ||
| } | ||
| assert(e.getMessage.contains("Illegal schema for libsvm data")) | ||
| } | ||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -103,7 +103,7 @@ object DataType { | |
|
|
||
| /** Given the string representation of a type, return its DataType */ | ||
| private def nameToType(name: String): DataType = { | ||
| val FIXED_DECIMAL = """decimal\(\s*(\d+)\s*,\s*(\d+)\s*\)""".r | ||
| val FIXED_DECIMAL = """decimal\(\s*(\d+)\s*,\s*(\-?\d+)\s*\)""".r | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why change this?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This test case is hard coded to have a negative value here. I'm not actually sure if thats correct, but I would argue either way that we should be more permissive about parsing (otherwise the error message is |
||
| name match { | ||
| case "decimal" => DecimalType.USER_DEFAULT | ||
| case FIXED_DECIMAL(precision, scale) => DecimalType(precision.toInt, scale.toInt) | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -366,13 +366,6 @@ final class DataFrameWriter private[sql](df: DataFrame) { | |
| case (true, SaveMode.ErrorIfExists) => | ||
| throw new AnalysisException(s"Table $tableIdent already exists.") | ||
|
|
||
| case (true, SaveMode.Append) => | ||
| // If it is Append, we just ask insertInto to handle it. We will not use insertInto | ||
| // to handle saveAsTable with Overwrite because saveAsTable can change the schema of | ||
| // the table. But, insertInto with Overwrite requires the schema of data be the same | ||
| // the schema of the table. | ||
| insertInto(tableIdent) | ||
|
|
||
| case _ => | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why remove this?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I consolidated the code paths. Any writing to |
||
| val cmd = | ||
| CreateTableUsingAsSelect( | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -25,7 +25,7 @@ import org.apache.spark.sql.catalyst.expressions._ | |
| import org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext | ||
| import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Statistics} | ||
| import org.apache.spark.sql.catalyst.plans.physical.{HashPartitioning, Partitioning, UnknownPartitioning} | ||
| import org.apache.spark.sql.execution.datasources.parquet.ParquetRelation | ||
| import org.apache.spark.sql.execution.datasources.parquet.{DefaultSource => ParquetSource} | ||
| import org.apache.spark.sql.execution.metric.SQLMetrics | ||
| import org.apache.spark.sql.internal.SQLConf | ||
| import org.apache.spark.sql.sources.{BaseRelation, HadoopFsRelation} | ||
|
|
@@ -226,16 +226,17 @@ private[sql] object PhysicalRDD { | |
| rdd: RDD[InternalRow], | ||
| relation: BaseRelation, | ||
| metadata: Map[String, String] = Map.empty): PhysicalRDD = { | ||
| val outputUnsafeRows = if (relation.isInstanceOf[ParquetRelation]) { | ||
| // The vectorized parquet reader does not produce unsafe rows. | ||
| !SQLContext.getActive().get.conf.getConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED) | ||
| } else { | ||
| // All HadoopFsRelations output UnsafeRows | ||
| relation.isInstanceOf[HadoopFsRelation] | ||
|
|
||
| val outputUnsafeRows = relation match { | ||
| case r: HadoopFsRelation if r.fileFormat.isInstanceOf[ParquetSource] => | ||
| !SQLContext.getActive().get.conf.getConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED) | ||
| case _: HadoopFsRelation => true | ||
| case _ => false | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just to confirm, after merging this and all the other planned
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, there is not interface to do that today since the only version returns an RDD[InternalRow], but SPARK-13682 tracks fixing that. When we do this we should avoid hacks that rely on erasure in the bytecode. Instead just have an internal function that returns |
||
| } | ||
|
|
||
| val bucketSpec = relation match { | ||
| case r: HadoopFsRelation => r.getBucketSpec | ||
| // TODO: this should be closer to bucket planning. | ||
| case r: HadoopFsRelation if r.sqlContext.conf.bucketingEnabled() => r.bucketSpec | ||
| case _ => None | ||
| } | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we also verify schema when write? i.e. in
prepareWriteThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we do already, on line 69