[SPARK-24709][SQL][FOLLOW-UP] Make schema_of_json's input json as literal only #22775

HyukjinKwon · 2018-10-19T09:05:10Z

What changes were proposed in this pull request?

The main purpose of schema_of_json is the usage of combination with from_json (to make up the leak of schema inference) which takes its schema only as literal; however, currently schema_of_json allows JSON input as non-literal expressions (e.g, column).

This was mistakenly allowed - we don't have to take other usages rather then the main purpose into account for now.

This PR makes a followup to only allow literals for schema_of_json's JSON input. We can allow non literal expressions later when it's needed or there are some usecase for it.

How was this patch tested?

Unit tests were added.

HyukjinKwon · 2018-10-19T09:15:06Z

This should be targeted to 2.4 .. otherwise we should describe the behaviour change at migration note.

SparkQA · 2018-10-19T10:44:33Z

Test build #97605 has finished for PR 22775 at commit 2705aa8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-10-19T10:53:23Z

retest this please

SparkQA · 2018-10-19T14:08:01Z

Test build #97610 has finished for PR 22775 at commit 2705aa8.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-10-19T16:46:02Z

The latest Python failure looks relevant to this PR.

AnalysisException: u"cannot resolve 'schemaofjson(`value`)' due to data type mismatch: The input json should be a string literal and not null; however, got `value`.;;\n'Project [schemaofjson(value#2191) AS json#2194]\n+- LogicalRDD [key#2190L, value#2191], false\n"

HyukjinKwon · 2018-10-20T02:14:30Z

Yup, will fix.

SparkQA · 2018-10-20T05:31:36Z

Test build #97631 has finished for PR 22775 at commit 9cb0b94.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-10-20T05:42:30Z

Ah.. let me rebase and sync the tests

SparkQA · 2018-10-20T07:05:02Z

Test build #97640 has finished for PR 22775 at commit c132407.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-10-20T09:11:21Z

retest this please

SparkQA · 2018-10-20T12:29:41Z

Test build #97646 has finished for PR 22775 at commit c132407.

This patch fails PySpark pip packaging tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-10-20T12:36:44Z

retest this please

SparkQA · 2018-10-20T15:53:09Z

Test build #97649 has finished for PR 22775 at commit c132407.

This patch fails PySpark pip packaging tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-10-20T18:05:30Z

retest this please

SparkQA · 2018-10-20T21:55:02Z

Test build #97661 has finished for PR 22775 at commit c132407.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-10-23T06:02:46Z

@cloud-fan, mind taking a look please?

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala

SparkQA · 2018-10-23T07:05:03Z

Test build #97905 has finished for PR 22775 at commit dbf9cab.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-10-23T07:11:30Z

retest this please

SparkQA · 2018-10-23T10:52:07Z

Test build #97910 has finished for PR 22775 at commit dbf9cab.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-10-25T04:56:36Z

retest this please

HyukjinKwon · 2018-10-25T04:57:13Z

@cloud-fan, looks we are going to start another RC. Would you mind if I ask to take a quick look before the new RC?

SparkQA · 2018-10-25T07:05:02Z

Test build #98001 has finished for PR 22775 at commit dbf9cab.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-10-25T08:35:08Z

retest this please

SparkQA · 2018-10-25T12:21:13Z

Test build #98008 has finished for PR 22775 at commit dbf9cab.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-10-25T14:42:33Z

I think I'm not qualified to make the decision here, as I don't fully understand the use case.

It looks to me that one use case would be to run schema_of_json on a column and manually figure out what's the final schema, and then use this schema in from_json. If schema_of_string only accepts literal, I'm not how users would use it.

HyukjinKwon · 2018-10-25T23:54:04Z

Actually, that usecase can more easily accomplished by simply inferring schema by JSON datasource. Yea, I indeed suggested that as workaround for this issue before. Let's say, spark.read.json(df.select("json").as[String]).schema.

I know it's not super clear about the usecase of schema_of_json and actually that's also partly why I want to allow what we need for this expression now.

@rxin, WDYT? This PR tries to allow what we only need for now. Let's say disallow:

schema_of_json(column)

and only allow

schema_of_json(literal)

because the main usecase is:

from_json(schema_of_json(literal))

The case below:

from_json(schema_of_json(column))

is already not being supported.

My judgement was schema_of_json(column) doesn't have to be exposed for now and frankly I want to have a talk with @MaxGekk about this when he comes back from his vacation (middle of next month).

HyukjinKwon · 2018-10-25T23:57:18Z

Maybe I am too much careful about it but I am kind of nervous about this column case. I don't intend to disallow it entirely but only for Spark 2.4. We might have to find a way to use column column with from_json but I want to defer to 3.0.

rxin · 2018-10-25T23:58:54Z

I agree it should be a literal value.

cloud-fan · 2018-10-26T02:23:18Z

python/pyspark/sql/functions.py

    >>> df.select(schema.alias("json")).collect()
    [Row(json=u'struct<a:bigint>')]
    """
+    if isinstance(col, basestring):


shall we do the same for scala APIs? i.e. create def schema_of_json(json: String)

cloud-fan · 2018-10-26T02:24:20Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala


+  override def dataType: DataType = StringType
+
+  override def inputTypes: Seq[DataType] = Seq(StringType)


why do we need it since we already override checkInputDataTypes?

cloud-fan · 2018-10-26T02:25:11Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala

+      s"The input json should be a string literal and not null; however, got ${child.sql}.")
+  }
+
+  override def eval(v: InternalRow = EmptyRow): Any = {


when implementing eval, we usually don't put the default value. Shall we follow this code style?

cloud-fan · 2018-10-26T02:26:07Z

if we are ok with this direction, this LGTM except a few minor comments. Thanks!

SparkQA · 2018-10-26T06:07:25Z

Test build #98066 has finished for PR 22775 at commit e2ca651.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-26T06:23:13Z

Test build #98067 has finished for PR 22775 at commit 03f34d9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-10-26T07:27:01Z

seems like a real test failure

HyukjinKwon · 2018-10-26T07:37:20Z

Yup, yup .. I should sync the tests

SparkQA · 2018-10-26T11:24:00Z

Test build #98078 has finished for PR 22775 at commit 8e6b97a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-10-26T11:54:25Z

python/pyspark/sql/functions.py

    >>> df.select(schema.alias("json")).collect()
    [Row(json=u'struct<a:bigint>')]
    """
+    if isinstance(json, basestring):


after more thoughts, maybe we should not add new features to 2.4? We can accept strings directly in 3.0.

Actually, Wenchen, I think that's going to make it a bit complicated after few more thoughts .. I think it's okay to go ahead. It's kind of a mistake that we had to fix in 2.4.

cloud-fan · 2018-10-26T14:15:48Z

thanks, merging to master! can you send a new PR for 2.4? it conflicts

HyukjinKwon · 2018-10-26T14:17:40Z

Sure!

cloud-fan · 2018-10-26T16:12:42Z

Actually this is not that hard. The conflict comes from the fact that in 2.4 schema_of_json doesn't take option parameter.

I've fixed the conflict and pushed to 2.4. You can take a look at the commit and see if there is something wrong. I ran the touched tests locally to verify it.

…eral only The main purpose of `schema_of_json` is the usage of combination with `from_json` (to make up the leak of schema inference) which takes its schema only as literal; however, currently `schema_of_json` allows JSON input as non-literal expressions (e.g, column). This was mistakenly allowed - we don't have to take other usages rather then the main purpose into account for now. This PR makes a followup to only allow literals for `schema_of_json`'s JSON input. We can allow non literal expressions later when it's needed or there are some usecase for it. Unit tests were added. Closes #22775 from HyukjinKwon/SPARK-25447-followup. Lead-authored-by: hyukjinkwon <gurwls223@apache.org> Co-authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 33e337c) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

HyukjinKwon · 2018-10-27T00:33:45Z

Oh you mean the conflict fixing is not that hard. Thanks for doing this @cloud-fan. I planned to do this today .. :-).

## What changes were proposed in this pull request? after backport #22775 to 2.4, the 2.4 sbt Jenkins QA job is broken, see https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.4-test-sbt-hadoop-2.7/147/console This PR adds `if sys.version >= '3': basestring = str` which onlly exists in master. ## How was this patch tested? existing test Closes #22858 from cloud-fan/python. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>

…eral only ## What changes were proposed in this pull request? The main purpose of `schema_of_json` is the usage of combination with `from_json` (to make up the leak of schema inference) which takes its schema only as literal; however, currently `schema_of_json` allows JSON input as non-literal expressions (e.g, column). This was mistakenly allowed - we don't have to take other usages rather then the main purpose into account for now. This PR makes a followup to only allow literals for `schema_of_json`'s JSON input. We can allow non literal expressions later when it's needed or there are some usecase for it. ## How was this patch tested? Unit tests were added. Closes apache#22775 from HyukjinKwon/SPARK-25447-followup. Lead-authored-by: hyukjinkwon <gurwls223@apache.org> Co-authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

## What changes were proposed in this pull request? after backport apache#22775 to 2.4, the 2.4 sbt Jenkins QA job is broken, see https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.4-test-sbt-hadoop-2.7/147/console This PR adds `if sys.version >= '3': basestring = str` which onlly exists in master. ## How was this patch tested? existing test Closes apache#22858 from cloud-fan/python. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>

## What changes were proposed in this pull request? after backport apache/spark#22775 to 2.4, the 2.4 sbt Jenkins QA job is broken, see https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.4-test-sbt-hadoop-2.7/147/console This PR adds `if sys.version >= '3': basestring = str` which onlly exists in master. ## How was this patch tested? existing test Closes #22858 from cloud-fan/python. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org> (cherry picked from commit 0f74bac)

nchammas · 2020-03-03T22:12:53Z

This change seems like a step back from the original version introduced in #21686.

I have a DataFrame with a JSON column. I suspect the JSON values have an inconsistent schema, so I want to first check whether a single schema can apply before trying to parse the column.

With the original version of schema_of_json(), I could do something like this to check whether or not I have a consistent schema:

df.select(schema_of_json(...)).distinct().count()

But now I can't do that. I can't even wrap schema_of_json() in a UDF to get something like that, because it returns a Column. It seems surprising from an API design point of view for a function to only accept literals but return Columns. And it seems inconsistent with the general tenor of Spark SQL functions for a function not to accept Columns as input.

Can we revisit the design of this function (as well as that of its cousin, schema_of_csv())?

Alternately, would it make sense to deprecate these functions and instead recommend the approach that @HyukjinKwon suggested?

Actually, that usecase can more easily accomplished by simply inferring schema by JSON datasource. Yea, I indeed suggested that as workaround for this issue before. Let's say, spark.read.json(df.select("json").as[String]).schema.

This demonstrates good Spark style (at least to me), and perhaps we can just promote this as a solution in the docs somewhere and do away with these functions.

For the passing reader, the Python equivalent of Hyukjin's suggestion is:

spark.read.json(df.rdd.map(lambda x: x[0])).schema

HyukjinKwon · 2020-03-03T23:51:00Z

@nchammas, thanks for some input here.

I could do something like this to check whether or not I have a consistent schema:
df.select(schema_of_json(...)).distinct().count()

I wanted to make the expression only for the specific usecase and avoid to have multiple ways for the same thing. Other cases can be easily worked around. For the case you mentioned, it can be worked around as below:

spark.read.json(df.select("json").as[String]).schema == StructType.fromDDL(...)

I think it is fine to have an expression that takes literals to return a column to support a missing usecase requested multiple times.

HyukjinKwon force-pushed the SPARK-25447-followup branch from 9cb0b94 to c132407 Compare October 20, 2018 06:07

Make schema_of_json's input json as liternal only

dbf9cab

HyukjinKwon force-pushed the SPARK-25447-followup branch from c132407 to dbf9cab Compare October 23, 2018 06:02

cloud-fan reviewed Oct 23, 2018

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala Show resolved Hide resolved

cloud-fan reviewed Oct 26, 2018

View reviewed changes

HyukjinKwon added 2 commits October 26, 2018 11:29

Address comments

e2ca651

match documentation

03f34d9

update tests

8e6b97a

cloud-fan reviewed Oct 26, 2018

View reviewed changes

asfgit closed this in 33e337c Oct 26, 2018

cloud-fan mentioned this pull request Oct 27, 2018

[SPARK-24709][SQL][2.4] map basestring to str for python 3 #22858

Closed

HyukjinKwon mentioned this pull request Nov 6, 2018

[SPARK-25446][R] Add schema_of_json() and schema_of_csv() to R #22939

Closed

HyukjinKwon deleted the SPARK-25447-followup branch March 3, 2020 01:20


		override def dataType: DataType = StringType

		override def inputTypes: Seq[DataType] = Seq(StringType)

[SPARK-24709][SQL][FOLLOW-UP] Make schema_of_json's input json as literal only #22775

[SPARK-24709][SQL][FOLLOW-UP] Make schema_of_json's input json as literal only #22775

Uh oh!

Conversation

HyukjinKwon commented Oct 19, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

HyukjinKwon commented Oct 19, 2018

Uh oh!

SparkQA commented Oct 19, 2018

Uh oh!

HyukjinKwon commented Oct 19, 2018

Uh oh!

SparkQA commented Oct 19, 2018

Uh oh!

dongjoon-hyun commented Oct 19, 2018

Uh oh!

HyukjinKwon commented Oct 20, 2018

Uh oh!

SparkQA commented Oct 20, 2018

Uh oh!

HyukjinKwon commented Oct 20, 2018

Uh oh!

SparkQA commented Oct 20, 2018

Uh oh!

HyukjinKwon commented Oct 20, 2018

Uh oh!

SparkQA commented Oct 20, 2018

Uh oh!

HyukjinKwon commented Oct 20, 2018

Uh oh!

SparkQA commented Oct 20, 2018

Uh oh!

HyukjinKwon commented Oct 20, 2018

Uh oh!

SparkQA commented Oct 20, 2018

Uh oh!

HyukjinKwon commented Oct 23, 2018

Uh oh!

Uh oh!

SparkQA commented Oct 23, 2018

Uh oh!

HyukjinKwon commented Oct 23, 2018

Uh oh!

SparkQA commented Oct 23, 2018

Uh oh!

HyukjinKwon commented Oct 25, 2018

Uh oh!

HyukjinKwon commented Oct 25, 2018

Uh oh!

SparkQA commented Oct 25, 2018

Uh oh!

HyukjinKwon commented Oct 25, 2018

Uh oh!

SparkQA commented Oct 25, 2018

Uh oh!

cloud-fan commented Oct 25, 2018

Uh oh!

HyukjinKwon commented Oct 25, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Oct 25, 2018

Uh oh!

rxin commented Oct 25, 2018

Uh oh!

cloud-fan Oct 26, 2018

Choose a reason for hiding this comment

Uh oh!

cloud-fan Oct 26, 2018

Choose a reason for hiding this comment

Uh oh!

cloud-fan Oct 26, 2018

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Oct 26, 2018

Uh oh!

SparkQA commented Oct 26, 2018

Uh oh!

SparkQA commented Oct 26, 2018

HyukjinKwon commented Oct 25, 2018 •

edited

Loading

HyukjinKwon Oct 26, 2018 •

edited

Loading

nchammas commented Mar 3, 2020 •

edited

Loading