[SPARK-27128][SQL] Migrate JSON to File Data Source V2 #24058

gengliangwang · 2019-03-11T16:43:56Z

What changes were proposed in this pull request?

Migrate JSON to File Data Source V2

How was this patch tested?

Unit test

SparkQA · 2019-03-11T18:38:42Z

Test build #103339 has finished for PR 24058 at commit aac8841.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class JsonDataSourceV2 extends FileDataSourceV2
case class JsonPartitionReaderFactory(
case class JsonScan(
class JsonScanBuilder (
case class JsonTable(
class JsonWriteBuilder(options: DataSourceOptions) extends FileWriteBuilder(options)

SparkQA · 2019-03-27T19:37:28Z

Test build #104023 has finished for PR 24058 at commit 7133bd2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-03-28T09:18:29Z

Test build #104039 has finished for PR 24058 at commit 67dcaa2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2019-03-29T04:40:04Z

retest this please.

SparkQA · 2019-03-29T07:05:02Z

Test build #104075 has finished for PR 24058 at commit 67dcaa2.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

dilipbiswal · 2019-03-29T07:19:08Z

retest this please

SparkQA · 2019-03-29T11:04:52Z

Test build #104080 has finished for PR 24058 at commit 67dcaa2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-04-08T12:53:47Z

Test build #104379 has finished for PR 24058 at commit ce5f77b.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class JsonDataSourceV2 extends FileDataSourceV2
case class JsonPartitionReaderFactory(
case class JsonScan(
class JsonScanBuilder (
case class JsonTable(
class JsonWriteBuilder(

SparkQA · 2019-04-08T13:16:46Z

Test build #104380 has finished for PR 24058 at commit c11d54f.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class JsonOutputWriter(

SparkQA · 2019-04-08T13:48:16Z

Test build #104382 has finished for PR 24058 at commit c74e09a.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-04-11T16:39:07Z

Test build #104517 has finished for PR 24058 at commit cbcd2c7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2019-04-12T04:42:06Z

retest this please.

SparkQA · 2019-04-12T07:05:02Z

Test build #104537 has finished for PR 24058 at commit cbcd2c7.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2019-04-12T07:09:19Z

retest this please.

SparkQA · 2019-04-12T11:52:02Z

Test build #104542 has finished for PR 24058 at commit cbcd2c7.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-04-12T18:30:46Z

Test build #104549 has finished for PR 24058 at commit 838b5f6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2019-04-13T07:21:56Z

This is ready. Please help review it. @cloud-fan @dongjoon-hyun @HyukjinKwon

felixcheung · 2019-04-13T18:34:06Z

R/pkg/tests/fulltests/test_sparkSQL.R

hmm, it doesn't say "JSON" now?

The current error message is changed to Unable to infer schema for $tableName, while the tableNameis shortName + path. I can create another PR to fix that.

HyukjinKwon · 2019-04-15T05:06:32Z

Looks good if this is matched to CSV one. Will take a closer look late in this week

…ailure in file source V2 ## What changes were proposed in this pull request? Since https://github.com/apache/spark/pull/23383/files#diff-db4a140579c1ac4b1dbec7fe5057eecaR36, the exception message of schema inference failure in file source V2 is `tableName`, which is equivalent to `shortName + path`. While in file source V1, the message is `Unable to infer schema from ORC/CSV/JSON...`. We should make the message in V2 consistent with V1, so that in the future migration the related test cases don't need to be modified. #24058 (review) ## How was this patch tested? Revert the modified unit test cases in https://github.com/apache/spark/pull/24005/files#diff-b9ddfbc9be8d83ecf100b3b8ff9610b9R431 and https://github.com/apache/spark/pull/23383/files#diff-9ab56940ee5a53f2bb81e3c008653362R577, and test with them. Closes #24369 from gengliangwang/reviseInferSchemaMessage. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

SparkQA · 2019-04-15T17:56:51Z

Test build #104588 has finished for PR 24058 at commit 77bfc25.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class JsonOutputWriter(
class JsonDataSourceV2 extends FileDataSourceV2
case class JsonPartitionReaderFactory(
case class JsonScan(
class JsonScanBuilder (
case class JsonTable(
class JsonWriteBuilder(

gengliangwang · 2019-04-17T07:28:02Z

retest this please.

cloud-fan · 2019-04-17T10:07:23Z

sql/core/src/test/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuerySuite.scala

        val df = spark.read.json(path.getCanonicalPath).select("CoL1", "CoL5", "CoL3").distinct()
        checkAnswer(df, Row("a", "e", "c"))

+        df.explain(true)


we should remove it

cloud-fan · 2019-04-17T10:07:43Z

sql/core/src/test/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuerySuite.scala

  test("Incorrect result caused by the rule OptimizeMetadataOnlyQuery") {
-    withSQLConf(OPTIMIZER_METADATA_ONLY.key -> "true") {
+    withSQLConf(OPTIMIZER_METADATA_ONLY.key -> "true",
+      SQLConf.USE_V1_SOURCE_READER_LIST.key -> "json") {


isn't v2 disabled by default?

V2 reader is enabled by default.

cloud-fan · 2019-04-17T10:08:08Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

      }.getMessage
      assert(msg.contains("only include the internal corrupt record column"))
-      intercept[catalyst.errors.TreeNodeException[_]] {
-        spark.read.schema(schema).json(path).filter($"_corrupt_record".isNotNull).count()


do we change the behavior for this case?

See the discussion in https://github.com/apache/spark/pull/24005/files#r263881555

SparkQA · 2019-04-17T11:53:26Z

Test build #104652 has finished for PR 24058 at commit 77bfc25.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class JsonOutputWriter(
class JsonDataSourceV2 extends FileDataSourceV2
case class JsonPartitionReaderFactory(
case class JsonScan(
class JsonScanBuilder (
case class JsonTable(
class JsonWriteBuilder(

SparkQA · 2019-04-17T16:28:26Z

Test build #104659 has finished for PR 24058 at commit 92bfe89.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-04-17T21:33:30Z

Test build #104668 has finished for PR 24058 at commit 4715b0e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-04-22T00:22:47Z

Retest this please.

SparkQA · 2019-04-22T03:42:02Z

Test build #104788 has finished for PR 24058 at commit 4715b0e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-04-23T14:40:11Z

thanks, merging to master!

gengliangwang force-pushed the jsonV2 branch 2 times, most recently from 7b7fb79 to 7133bd2 Compare March 27, 2019 17:25

gengliangwang changed the title ~~[WIP][SPARK-27128][SQL] Migrate JSON to File Data Source V2~~ [SPARK-27128][SQL] Migrate JSON to File Data Source V2 Mar 27, 2019

gengliangwang force-pushed the jsonV2 branch from 7133bd2 to 67dcaa2 Compare March 28, 2019 07:07

gengliangwang force-pushed the jsonV2 branch from 67dcaa2 to ce5f77b Compare April 8, 2019 09:00

gengliangwang force-pushed the jsonV2 branch from c74e09a to cbcd2c7 Compare April 11, 2019 14:31

gengliangwang changed the title ~~[SPARK-27128][SQL] Migrate JSON to File Data Source V2~~ [WIP][SPARK-27128][SQL] Migrate JSON to File Data Source V2 Apr 11, 2019

gengliangwang force-pushed the jsonV2 branch from cbcd2c7 to 838b5f6 Compare April 12, 2019 13:44

gengliangwang changed the title ~~[WIP][SPARK-27128][SQL] Migrate JSON to File Data Source V2~~ [SPARK-27128][SQL] Migrate JSON to File Data Source V2 Apr 12, 2019

felixcheung reviewed Apr 13, 2019

View reviewed changes

gengliangwang mentioned this pull request Apr 14, 2019

[SPARK-27459][SQL] Revise the exception message of schema inference failure in file source V2 #24369

Closed

json V2

77bfc25

gengliangwang force-pushed the jsonV2 branch from 838b5f6 to 77bfc25 Compare April 15, 2019 13:17

cloud-fan reviewed Apr 17, 2019

View reviewed changes

address comments

92bfe89

fix

4715b0e

cloud-fan closed this in 00f2f31 Apr 23, 2019

gengliangwang mentioned this pull request Jun 3, 2019

[SPARK-27926][SQL] Allow altering table add columns with CSVFileFormat/JsonFileFormat provider #24776

Closed

[SPARK-27128][SQL] Migrate JSON to File Data Source V2 #24058

[SPARK-27128][SQL] Migrate JSON to File Data Source V2 #24058

Uh oh!

Conversation

gengliangwang commented Mar 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Mar 11, 2019

Uh oh!

SparkQA commented Mar 27, 2019

Uh oh!

SparkQA commented Mar 28, 2019

Uh oh!

gengliangwang commented Mar 29, 2019

Uh oh!

SparkQA commented Mar 29, 2019

Uh oh!

dilipbiswal commented Mar 29, 2019

Uh oh!

SparkQA commented Mar 29, 2019

Uh oh!

SparkQA commented Apr 8, 2019

Uh oh!

SparkQA commented Apr 8, 2019

Uh oh!

SparkQA commented Apr 8, 2019

Uh oh!

SparkQA commented Apr 11, 2019

Uh oh!

gengliangwang commented Apr 12, 2019

Uh oh!

SparkQA commented Apr 12, 2019

Uh oh!

gengliangwang commented Apr 12, 2019

Uh oh!

SparkQA commented Apr 12, 2019

Uh oh!

SparkQA commented Apr 12, 2019

Uh oh!

gengliangwang commented Apr 13, 2019

Uh oh!

felixcheung Apr 13, 2019

Choose a reason for hiding this comment

Uh oh!

gengliangwang Apr 14, 2019

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Apr 15, 2019

Uh oh!

SparkQA commented Apr 15, 2019

Uh oh!

gengliangwang commented Apr 17, 2019

Uh oh!

cloud-fan Apr 17, 2019

Choose a reason for hiding this comment

Uh oh!

cloud-fan Apr 17, 2019

Choose a reason for hiding this comment

Uh oh!

gengliangwang Apr 18, 2019

Choose a reason for hiding this comment

Uh oh!

cloud-fan Apr 17, 2019

Choose a reason for hiding this comment

Uh oh!

gengliangwang Apr 17, 2019

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 17, 2019

Uh oh!

SparkQA commented Apr 17, 2019

Uh oh!

SparkQA commented Apr 17, 2019

Uh oh!

dongjoon-hyun commented Apr 22, 2019

Uh oh!

SparkQA commented Apr 22, 2019

Uh oh!

gengliangwang commented Mar 11, 2019 •

edited

Loading