-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-27128][SQL] Migrate JSON to File Data Source V2 #24058
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #103339 has finished for PR 24058 at commit
|
7b7fb79 to
7133bd2
Compare
|
Test build #104023 has finished for PR 24058 at commit
|
|
Test build #104039 has finished for PR 24058 at commit
|
|
retest this please. |
|
Test build #104075 has finished for PR 24058 at commit
|
|
retest this please |
|
Test build #104080 has finished for PR 24058 at commit
|
|
Test build #104379 has finished for PR 24058 at commit
|
|
Test build #104380 has finished for PR 24058 at commit
|
|
Test build #104382 has finished for PR 24058 at commit
|
|
Test build #104517 has finished for PR 24058 at commit
|
|
retest this please. |
|
Test build #104537 has finished for PR 24058 at commit
|
|
retest this please. |
|
Test build #104542 has finished for PR 24058 at commit
|
|
Test build #104549 has finished for PR 24058 at commit
|
|
This is ready. Please help review it. @cloud-fan @dongjoon-hyun @HyukjinKwon |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, it doesn't say "JSON" now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current error message is changed to Unable to infer schema for $tableName, while the tableNameis shortName + path. I can create another PR to fix that.
|
Looks good if this is matched to CSV one. Will take a closer look late in this week |
…ailure in file source V2 ## What changes were proposed in this pull request? Since https://github.com/apache/spark/pull/23383/files#diff-db4a140579c1ac4b1dbec7fe5057eecaR36, the exception message of schema inference failure in file source V2 is `tableName`, which is equivalent to `shortName + path`. While in file source V1, the message is `Unable to infer schema from ORC/CSV/JSON...`. We should make the message in V2 consistent with V1, so that in the future migration the related test cases don't need to be modified. #24058 (review) ## How was this patch tested? Revert the modified unit test cases in https://github.com/apache/spark/pull/24005/files#diff-b9ddfbc9be8d83ecf100b3b8ff9610b9R431 and https://github.com/apache/spark/pull/23383/files#diff-9ab56940ee5a53f2bb81e3c008653362R577, and test with them. Closes #24369 from gengliangwang/reviseInferSchemaMessage. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
|
Test build #104588 has finished for PR 24058 at commit
|
|
retest this please. |
| val df = spark.read.json(path.getCanonicalPath).select("CoL1", "CoL5", "CoL3").distinct() | ||
| checkAnswer(df, Row("a", "e", "c")) | ||
|
|
||
| df.explain(true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should remove it
| test("Incorrect result caused by the rule OptimizeMetadataOnlyQuery") { | ||
| withSQLConf(OPTIMIZER_METADATA_ONLY.key -> "true") { | ||
| withSQLConf(OPTIMIZER_METADATA_ONLY.key -> "true", | ||
| SQLConf.USE_V1_SOURCE_READER_LIST.key -> "json") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't v2 disabled by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
V2 reader is enabled by default.
| }.getMessage | ||
| assert(msg.contains("only include the internal corrupt record column")) | ||
| intercept[catalyst.errors.TreeNodeException[_]] { | ||
| spark.read.schema(schema).json(path).filter($"_corrupt_record".isNotNull).count() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we change the behavior for this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the discussion in https://github.com/apache/spark/pull/24005/files#r263881555
|
Test build #104652 has finished for PR 24058 at commit
|
|
Test build #104659 has finished for PR 24058 at commit
|
|
Test build #104668 has finished for PR 24058 at commit
|
|
Retest this please. |
|
Test build #104788 has finished for PR 24058 at commit
|
|
thanks, merging to master! |
What changes were proposed in this pull request?
Migrate JSON to File Data Source V2
How was this patch tested?
Unit test