[SPARK-11980] [SQL] Fix json_tuple and add test cases for SPARK-10621 #9977

gatorsmile · 2015-11-25T21:19:42Z

Added Python test cases for the function isnan, isnull, nanvl and json_tuple.

Fixed a bug in the function json_tuple

@rxin , could you help me review my changes? Please let me know anything is missing.

Thank you! Have a good Thanksgiving day!

rxin · 2015-11-25T21:25:35Z

python/pyspark/sql/functions.py

can you test both string names and columns? e.g.

df.select(isnan("a").alias("r1"), isnan(df.a).alias("r2")).collect()

and do the same thing for the rest of the functions

SparkQA · 2015-11-25T21:53:20Z

Test build #46709 has finished for PR 9977 at commit d47f1e7.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * public final class OneWayMessage implements RequestMessage\n

SparkQA · 2015-11-25T22:51:30Z

Test build #46713 has finished for PR 9977 at commit d6e29d5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2015-11-25T23:03:46Z

cc @davies for a final look.

The changes LGTM.

gatorsmile · 2015-11-25T23:08:53Z

In the next few days, I will look at the implementation of get_json_object. I suspect the original implementation has an issue regarding null values.

If you compare the results of get_json_object and json_tuple, one attribute is different. I will try to fix it in a separate PR. This should not be related to the Python interface, I think.

Thank you!

gatorsmile · 2015-11-26T01:51:45Z

Just did a quick check. I can confirm this is not caused by Python. I reproduced it using the scala API.

gatorsmile · 2015-11-26T03:08:33Z

Narrowed down to the following code in jsonExpressions.scala:

          val output = new ByteArrayOutputStream()
          val matched = Utils.tryWithResource(
            jsonFactory.createGenerator(output, JsonEncoding.UTF8)) { generator =>
            parser.nextToken()
            evaluatePath(parser, generator, RawStyle, parsed.get)
          }

So far, our parser returns the same results of output for the following two cases. Both results are "null":

    val tuple: Seq[(String, String)] = ("5", """{"f1": null}""") :: Nil
    val df: DataFrame = tuple.toDF("key", "jstring")
    val res = df.select(functions.get_json_object($"jstring", "$.f1")).collect()

    val tuple2: Seq[(String, String)] = ("5", """{"f1": "null"}""") :: Nil
    val df2: DataFrame = tuple2.toDF("key", "jstring")
    val res3 = df2.select(functions.get_json_object($"jstring", "$.f1")).collect()

gatorsmile · 2015-11-26T03:30:39Z

Found a discussion about this issue:

http://www.scriptscoop.net/t/1a9222820510/java-how-to-tell-jackson-to-deserialize-null-string-to-null-literal.html

Please let me know what I should do next. Thanks! @rxin @davies @marmbrus @cloud-fan

davies · 2015-11-26T05:46:52Z

python/pyspark/sql/functions.py

nit: I think one simple case should be enough for Python tests, other corner cases should be tested in Scala.

The Python doc tests will be part of API doc, so it's better to be read friendly.

Sure, will do. I will move the test case of get_json_object to the scala test file. Will simplify the existing test cases of get_json_object and json_tuple. Thanks!

rxin · 2015-11-26T06:13:34Z

I'd just simply the test case as Davies suggested, and then merge this in. In parallel you can work on a patch to fix whatever bugs you find.

SparkQA · 2015-11-26T07:19:46Z

Test build #46750 has finished for PR 9977 at commit b83525a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2015-11-26T07:24:18Z

Thanks - I'm going to merge this.

Added Python test cases for the function `isnan`, `isnull`, `nanvl` and `json_tuple`. Fixed a bug in the function `json_tuple` rxin , could you help me review my changes? Please let me know anything is missing. Thank you! Have a good Thanksgiving day! Author: gatorsmile <gatorsmile@gmail.com> Closes #9977 from gatorsmile/json_tuple. (cherry picked from commit 068b643) Signed-off-by: Reynold Xin <rxin@databricks.com>

gatorsmile · 2015-11-26T07:26:52Z

Thank you for your help! @rxin @davies

Just let me know if you need me to do any JIRA. Have a good holiday!

gatorsmile added 3 commits November 25, 2015 12:56

Merge remote-tracking branch 'upstream/master' into json_tuple

b47e127

fixed json_tuple and added a few unit test cases.

d0f33a9

Merge remote-tracking branch 'upstream/master' into json_tuple

d47f1e7

rxin reviewed Nov 25, 2015
View reviewed changes

updated unit test cases.

d6e29d5

davies reviewed Nov 26, 2015
View reviewed changes

simplified unit test cases.

b83525a

asfgit closed this in 068b643 Nov 26, 2015

gatorsmile mentioned this pull request Nov 28, 2015

[SPARK-12028] [SQL] get_json_object returns an incorrect result when the value is null literals #10018

Closed

gatorsmile deleted the json_tuple branch December 5, 2015 18:51

[SPARK-11980] [SQL] Fix json_tuple and add test cases for SPARK-10621 #9977

[SPARK-11980] [SQL] Fix json_tuple and add test cases for SPARK-10621 #9977

Uh oh!

Conversation

gatorsmile commented Nov 25, 2015

Uh oh!

rxin Nov 25, 2015

Choose a reason for hiding this comment

Uh oh!

rxin Nov 25, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 25, 2015

Uh oh!

SparkQA commented Nov 25, 2015

Uh oh!

rxin commented Nov 25, 2015

Uh oh!

gatorsmile commented Nov 25, 2015

Uh oh!

gatorsmile commented Nov 26, 2015

Uh oh!

gatorsmile commented Nov 26, 2015

Uh oh!

gatorsmile commented Nov 26, 2015

Uh oh!

davies Nov 26, 2015

Choose a reason for hiding this comment

Uh oh!

gatorsmile Nov 26, 2015

Choose a reason for hiding this comment

Uh oh!

rxin commented Nov 26, 2015

Uh oh!

SparkQA commented Nov 26, 2015

Uh oh!

rxin commented Nov 26, 2015

Uh oh!

gatorsmile commented Nov 26, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants