Skip to content

Conversation

@HyukjinKwon
Copy link
Member

What changes were proposed in this pull request?

The main purpose of schema_of_json is the usage of combination with from_json (to make up the leak of schema inference) which takes its schema only as literal; however, currently schema_of_json allows JSON input as non-literal expressions (e.g, column).

This was mistakenly allowed - we don't have to take other usages rather then the main purpose into account for now.

This PR makes a followup to only allow literals for schema_of_json's JSON input. We can allow non literal expressions later when it's needed or there are some usecase for it.

How was this patch tested?

Unit tests were added.

@HyukjinKwon
Copy link
Member Author

This should be targeted to 2.4 .. otherwise we should describe the behaviour change at migration note.

@SparkQA
Copy link

SparkQA commented Oct 19, 2018

Test build #97605 has finished for PR 22775 at commit 2705aa8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Oct 19, 2018

Test build #97610 has finished for PR 22775 at commit 2705aa8.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

The latest Python failure looks relevant to this PR.

AnalysisException: u"cannot resolve 'schemaofjson(`value`)' due to data type mismatch: The input json should be a string literal and not null; however, got `value`.;;\n'Project [schemaofjson(value#2191) AS json#2194]\n+- LogicalRDD [key#2190L, value#2191], false\n"

@HyukjinKwon
Copy link
Member Author

Yup, will fix.

@SparkQA
Copy link

SparkQA commented Oct 20, 2018

Test build #97631 has finished for PR 22775 at commit 9cb0b94.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

Ah.. let me rebase and sync the tests

@SparkQA
Copy link

SparkQA commented Oct 20, 2018

Test build #97640 has finished for PR 22775 at commit c132407.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Oct 20, 2018

Test build #97646 has finished for PR 22775 at commit c132407.

  • This patch fails PySpark pip packaging tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Oct 20, 2018

Test build #97649 has finished for PR 22775 at commit c132407.

  • This patch fails PySpark pip packaging tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Oct 20, 2018

Test build #97661 has finished for PR 22775 at commit c132407.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

@cloud-fan, mind taking a look please?

@SparkQA
Copy link

SparkQA commented Oct 23, 2018

Test build #97905 has finished for PR 22775 at commit dbf9cab.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Oct 23, 2018

Test build #97910 has finished for PR 22775 at commit dbf9cab.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

retest this please

@HyukjinKwon
Copy link
Member Author

@cloud-fan, looks we are going to start another RC. Would you mind if I ask to take a quick look before the new RC?

@SparkQA
Copy link

SparkQA commented Oct 25, 2018

Test build #98001 has finished for PR 22775 at commit dbf9cab.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Oct 25, 2018

Test build #98008 has finished for PR 22775 at commit dbf9cab.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

I think I'm not qualified to make the decision here, as I don't fully understand the use case.

It looks to me that one use case would be to run schema_of_json on a column and manually figure out what's the final schema, and then use this schema in from_json. If schema_of_string only accepts literal, I'm not how users would use it.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Oct 25, 2018

Actually, that usecase can more easily accomplished by simply inferring schema by JSON datasource. Yea, I indeed suggested that as workaround for this issue before. Let's say, spark.read.json(df.select("json").as[String]).schema.

I know it's not super clear about the usecase of schema_of_json and actually that's also partly why I want to allow what we need for this expression now.

@rxin, WDYT? This PR tries to allow what we only need for now. Let's say disallow:

schema_of_json(column)

and only allow

schema_of_json(literal)

because the main usecase is:

from_json(schema_of_json(literal))

The case below:

from_json(schema_of_json(column))

is already not being supported.

My judgement was schema_of_json(column) doesn't have to be exposed for now and frankly I want to have a talk with @MaxGekk about this when he comes back from his vacation (middle of next month).

@HyukjinKwon
Copy link
Member Author

Maybe I am too much careful about it but I am kind of nervous about this column case. I don't intend to disallow it entirely but only for Spark 2.4. We might have to find a way to use column column with from_json but I want to defer to 3.0.

@rxin
Copy link
Contributor

rxin commented Oct 25, 2018

I agree it should be a literal value.

>>> df.select(schema.alias("json")).collect()
[Row(json=u'struct<a:bigint>')]
"""
if isinstance(col, basestring):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we do the same for scala APIs? i.e. create def schema_of_json(json: String)


override def dataType: DataType = StringType

override def inputTypes: Seq[DataType] = Seq(StringType)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need it since we already override checkInputDataTypes?

s"The input json should be a string literal and not null; however, got ${child.sql}.")
}

override def eval(v: InternalRow = EmptyRow): Any = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when implementing eval, we usually don't put the default value. Shall we follow this code style?

@cloud-fan
Copy link
Contributor

if we are ok with this direction, this LGTM except a few minor comments. Thanks!

@SparkQA
Copy link

SparkQA commented Oct 26, 2018

Test build #98066 has finished for PR 22775 at commit e2ca651.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 26, 2018

Test build #98067 has finished for PR 22775 at commit 03f34d9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

seems like a real test failure

@HyukjinKwon
Copy link
Member Author

Yup, yup .. I should sync the tests

@SparkQA
Copy link

SparkQA commented Oct 26, 2018

Test build #98078 has finished for PR 22775 at commit 8e6b97a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

>>> df.select(schema.alias("json")).collect()
[Row(json=u'struct<a:bigint>')]
"""
if isinstance(json, basestring):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after more thoughts, maybe we should not add new features to 2.4? We can accept strings directly in 3.0.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member Author

@HyukjinKwon HyukjinKwon Oct 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, Wenchen, I think that's going to make it a bit complicated after few more thoughts .. I think it's okay to go ahead. It's kind of a mistake that we had to fix in 2.4.

@cloud-fan
Copy link
Contributor

thanks, merging to master! can you send a new PR for 2.4? it conflicts

@HyukjinKwon
Copy link
Member Author

Sure!

@asfgit asfgit closed this in 33e337c Oct 26, 2018
@cloud-fan
Copy link
Contributor

Actually this is not that hard. The conflict comes from the fact that in 2.4 schema_of_json doesn't take option parameter.

I've fixed the conflict and pushed to 2.4. You can take a look at the commit and see if there is something wrong. I ran the touched tests locally to verify it.

asfgit pushed a commit that referenced this pull request Oct 26, 2018
…eral only

The main purpose of `schema_of_json` is the usage of combination with `from_json` (to make up the leak of schema inference) which takes its schema only as literal; however, currently `schema_of_json` allows JSON input as non-literal expressions (e.g, column).

This was mistakenly allowed - we don't have to take other usages rather then the main purpose into account for now.

This PR makes a followup to only allow literals for `schema_of_json`'s JSON input. We can allow non literal expressions later when it's needed or there are some usecase for it.

Unit tests were added.

Closes #22775 from HyukjinKwon/SPARK-25447-followup.

Lead-authored-by: hyukjinkwon <gurwls223@apache.org>
Co-authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 33e337c)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@HyukjinKwon
Copy link
Member Author

Oh you mean the conflict fixing is not that hard. Thanks for doing this @cloud-fan. I planned to do this today .. :-).

asfgit pushed a commit that referenced this pull request Oct 28, 2018
## What changes were proposed in this pull request?

after backport #22775 to 2.4, the 2.4 sbt Jenkins QA job is broken, see https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.4-test-sbt-hadoop-2.7/147/console

This PR adds `if sys.version >= '3': basestring = str` which onlly exists in master.

## How was this patch tested?

existing test

Closes #22858 from cloud-fan/python.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
…eral only

## What changes were proposed in this pull request?

The main purpose of `schema_of_json` is the usage of combination with `from_json` (to make up the leak of schema inference) which takes its schema only as literal; however, currently `schema_of_json` allows JSON input as non-literal expressions (e.g, column).

This was mistakenly allowed - we don't have to take other usages rather then the main purpose into account for now.

This PR makes a followup to only allow literals for `schema_of_json`'s JSON input. We can allow non literal expressions later when it's needed or there are some usecase for it.

## How was this patch tested?

Unit tests were added.

Closes apache#22775 from HyukjinKwon/SPARK-25447-followup.

Lead-authored-by: hyukjinkwon <gurwls223@apache.org>
Co-authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Aug 1, 2019
## What changes were proposed in this pull request?

after backport apache#22775 to 2.4, the 2.4 sbt Jenkins QA job is broken, see https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.4-test-sbt-hadoop-2.7/147/console

This PR adds `if sys.version >= '3': basestring = str` which onlly exists in master.

## How was this patch tested?

existing test

Closes apache#22858 from cloud-fan/python.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
zhongjinhan pushed a commit to zhongjinhan/spark-1 that referenced this pull request Sep 3, 2019
## What changes were proposed in this pull request?

after backport apache/spark#22775 to 2.4, the 2.4 sbt Jenkins QA job is broken, see https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.4-test-sbt-hadoop-2.7/147/console

This PR adds `if sys.version >= '3': basestring = str` which onlly exists in master.

## How was this patch tested?

existing test

Closes #22858 from cloud-fan/python.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
(cherry picked from commit 0f74bac)
@HyukjinKwon HyukjinKwon deleted the SPARK-25447-followup branch March 3, 2020 01:20
@nchammas
Copy link
Contributor

nchammas commented Mar 3, 2020

This change seems like a step back from the original version introduced in #21686.

I have a DataFrame with a JSON column. I suspect the JSON values have an inconsistent schema, so I want to first check whether a single schema can apply before trying to parse the column.

With the original version of schema_of_json(), I could do something like this to check whether or not I have a consistent schema:

df.select(schema_of_json(...)).distinct().count()

But now I can't do that. I can't even wrap schema_of_json() in a UDF to get something like that, because it returns a Column. It seems surprising from an API design point of view for a function to only accept literals but return Columns. And it seems inconsistent with the general tenor of Spark SQL functions for a function not to accept Columns as input.

Can we revisit the design of this function (as well as that of its cousin, schema_of_csv())?

Alternately, would it make sense to deprecate these functions and instead recommend the approach that @HyukjinKwon suggested?

Actually, that usecase can more easily accomplished by simply inferring schema by JSON datasource. Yea, I indeed suggested that as workaround for this issue before. Let's say, spark.read.json(df.select("json").as[String]).schema.

This demonstrates good Spark style (at least to me), and perhaps we can just promote this as a solution in the docs somewhere and do away with these functions.

For the passing reader, the Python equivalent of Hyukjin's suggestion is:

spark.read.json(df.rdd.map(lambda x: x[0])).schema

@HyukjinKwon
Copy link
Member Author

@nchammas, thanks for some input here.

I could do something like this to check whether or not I have a consistent schema:

df.select(schema_of_json(...)).distinct().count()

I wanted to make the expression only for the specific usecase and avoid to have multiple ways for the same thing. Other cases can be easily worked around. For the case you mentioned, it can be worked around as below:

spark.read.json(df.select("json").as[String]).schema == StructType.fromDDL(...)

I think it is fine to have an expression that takes literals to return a column to support a missing usecase requested multiple times.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants