[SPARK-33134][SQL][3.0] Return partial results only for root JSON objects #30032

MaxGekk · 2020-10-13T16:48:58Z

What changes were proposed in this pull request?

In the PR, I propose to restrict the partial result feature only by root JSON objects. JSON datasource as well as from_json() will return null for malformed nested JSON objects.

Why are the changes needed?

To not raise exception to users in the PERMISSIVE mode
To fix a regression and to have the same behavior as Spark 2.4.x has
Current implementation of partial result is supposed to work only for root (top-level) JSON objects, and not tested for bad nested complex JSON fields.

Does this PR introduce any user-facing change?

Yes. Before the changes, the code below:

    val pokerhand_raw = Seq("""[{"cards": [19], "playerId": 123456}]""").toDF("events")
    val event = new StructType().add("playerId", LongType).add("cards", ArrayType(new StructType().add("id", LongType).add("rank", StringType)))
    val pokerhand_events = pokerhand_raw.select(from_json($"events", ArrayType(event)).as("event"))
    pokerhand_events.show

throws the exception even in the default PERMISSIVE mode:

java.lang.ClassCastException: java.lang.Long cannot be cast to org.apache.spark.sql.catalyst.util.ArrayData
  at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:48)
  at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray$(rows.scala:48)
  at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getArray(rows.scala:195)

After the changes:

+-----+
|event|
+-----+
| null|
+-----+

How was this patch tested?

Added a test to JsonFunctionsSuite.

HyukjinKwon · 2020-10-14T03:14:11Z

Merged to branch-3.0.

…ects ### What changes were proposed in this pull request? In the PR, I propose to restrict the partial result feature only by root JSON objects. JSON datasource as well as `from_json()` will return `null` for malformed nested JSON objects. ### Why are the changes needed? 1. To not raise exception to users in the PERMISSIVE mode 2. To fix a regression and to have the same behavior as Spark 2.4.x has 3. Current implementation of partial result is supposed to work only for root (top-level) JSON objects, and not tested for bad nested complex JSON fields. ### Does this PR introduce _any_ user-facing change? Yes. Before the changes, the code below: ```scala val pokerhand_raw = Seq("""[{"cards": [19], "playerId": 123456}]""").toDF("events") val event = new StructType().add("playerId", LongType).add("cards", ArrayType(new StructType().add("id", LongType).add("rank", StringType))) val pokerhand_events = pokerhand_raw.select(from_json($"events", ArrayType(event)).as("event")) pokerhand_events.show ``` throws the exception even in the default **PERMISSIVE** mode: ```java java.lang.ClassCastException: java.lang.Long cannot be cast to org.apache.spark.sql.catalyst.util.ArrayData at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:48) at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray$(rows.scala:48) at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getArray(rows.scala:195) ``` After the changes: ``` +-----+ |event| +-----+ | null| +-----+ ``` ### How was this patch tested? Added a test to `JsonFunctionsSuite`. Closes #30032 from MaxGekk/json-skip-row-wrong-schema-3.0. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…ects ### What changes were proposed in this pull request? In the PR, I propose to restrict the partial result feature only by root JSON objects. JSON datasource as well as `from_json()` will return `null` for malformed nested JSON objects. ### Why are the changes needed? 1. To not raise exception to users in the PERMISSIVE mode 2. To fix a regression and to have the same behavior as Spark 2.4.x has 3. Current implementation of partial result is supposed to work only for root (top-level) JSON objects, and not tested for bad nested complex JSON fields. ### Does this PR introduce _any_ user-facing change? Yes. Before the changes, the code below: ```scala val pokerhand_raw = Seq("""[{"cards": [19], "playerId": 123456}]""").toDF("events") val event = new StructType().add("playerId", LongType).add("cards", ArrayType(new StructType().add("id", LongType).add("rank", StringType))) val pokerhand_events = pokerhand_raw.select(from_json($"events", ArrayType(event)).as("event")) pokerhand_events.show ``` throws the exception even in the default **PERMISSIVE** mode: ```java java.lang.ClassCastException: java.lang.Long cannot be cast to org.apache.spark.sql.catalyst.util.ArrayData at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:48) at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray$(rows.scala:48) at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getArray(rows.scala:195) ``` After the changes: ``` +-----+ |event| +-----+ | null| +-----+ ``` ### How was this patch tested? Added a test to `JsonFunctionsSuite`. Closes apache#30032 from MaxGekk/json-skip-row-wrong-schema-3.0. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

MaxGekk added 5 commits October 13, 2020 19:30

Add a test

9f2b03a

Fix

50e2393

Improve test

4da25ea

Simplify test

1ea8e77

Test more cases

a8c1403

MaxGekk mentioned this pull request Oct 13, 2020

[SPARK-33134][SQL] Return partial results only for root JSON objects #30031

Closed

MaxGekk added 2 commits October 13, 2020 22:16

Test refactoring

7073a43

Fix test

ebcb728

HyukjinKwon closed this Oct 14, 2020

MaxGekk deleted the json-skip-row-wrong-schema-3.0 branch December 11, 2020 20:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-33134][SQL][3.0] Return partial results only for root JSON objects #30032

[SPARK-33134][SQL][3.0] Return partial results only for root JSON objects #30032

Uh oh!

MaxGekk commented Oct 13, 2020 •

edited

Loading

Uh oh!

HyukjinKwon commented Oct 14, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-33134][SQL][3.0] Return partial results only for root JSON objects #30032

[SPARK-33134][SQL][3.0] Return partial results only for root JSON objects #30032

Uh oh!

Conversation

MaxGekk commented Oct 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

HyukjinKwon commented Oct 14, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MaxGekk commented Oct 13, 2020 •

edited

Loading