-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-24709][SQL][FOLLOW-UP] Make schema_of_json's input json as literal only #22775
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This should be targeted to 2.4 .. otherwise we should describe the behaviour change at migration note. |
|
Test build #97605 has finished for PR 22775 at commit
|
|
retest this please |
|
Test build #97610 has finished for PR 22775 at commit
|
|
The latest Python failure looks relevant to this PR. AnalysisException: u"cannot resolve 'schemaofjson(`value`)' due to data type mismatch: The input json should be a string literal and not null; however, got `value`.;;\n'Project [schemaofjson(value#2191) AS json#2194]\n+- LogicalRDD [key#2190L, value#2191], false\n" |
|
Yup, will fix. |
|
Test build #97631 has finished for PR 22775 at commit
|
|
Ah.. let me rebase and sync the tests |
9cb0b94 to
c132407
Compare
|
Test build #97640 has finished for PR 22775 at commit
|
|
retest this please |
|
Test build #97646 has finished for PR 22775 at commit
|
|
retest this please |
|
Test build #97649 has finished for PR 22775 at commit
|
|
retest this please |
|
Test build #97661 has finished for PR 22775 at commit
|
c132407 to
dbf9cab
Compare
|
@cloud-fan, mind taking a look please? |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
Show resolved
Hide resolved
|
Test build #97905 has finished for PR 22775 at commit
|
|
retest this please |
|
Test build #97910 has finished for PR 22775 at commit
|
|
retest this please |
|
@cloud-fan, looks we are going to start another RC. Would you mind if I ask to take a quick look before the new RC? |
|
Test build #98001 has finished for PR 22775 at commit
|
|
retest this please |
|
Test build #98008 has finished for PR 22775 at commit
|
|
I think I'm not qualified to make the decision here, as I don't fully understand the use case. It looks to me that one use case would be to run |
|
Actually, that usecase can more easily accomplished by simply inferring schema by JSON datasource. Yea, I indeed suggested that as workaround for this issue before. Let's say, I know it's not super clear about the usecase of @rxin, WDYT? This PR tries to allow what we only need for now. Let's say disallow: schema_of_json(column)and only allow schema_of_json(literal)because the main usecase is: from_json(schema_of_json(literal))The case below: from_json(schema_of_json(column))is already not being supported. My judgement was |
|
Maybe I am too much careful about it but I am kind of nervous about this column case. I don't intend to disallow it entirely but only for Spark 2.4. We might have to find a way to use column column with |
|
I agree it should be a literal value. |
python/pyspark/sql/functions.py
Outdated
| >>> df.select(schema.alias("json")).collect() | ||
| [Row(json=u'struct<a:bigint>')] | ||
| """ | ||
| if isinstance(col, basestring): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we do the same for scala APIs? i.e. create def schema_of_json(json: String)
|
|
||
| override def dataType: DataType = StringType | ||
|
|
||
| override def inputTypes: Seq[DataType] = Seq(StringType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need it since we already override checkInputDataTypes?
| s"The input json should be a string literal and not null; however, got ${child.sql}.") | ||
| } | ||
|
|
||
| override def eval(v: InternalRow = EmptyRow): Any = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when implementing eval, we usually don't put the default value. Shall we follow this code style?
|
if we are ok with this direction, this LGTM except a few minor comments. Thanks! |
|
Test build #98066 has finished for PR 22775 at commit
|
|
Test build #98067 has finished for PR 22775 at commit
|
|
seems like a real test failure |
|
Yup, yup .. I should sync the tests |
|
Test build #98078 has finished for PR 22775 at commit
|
| >>> df.select(schema.alias("json")).collect() | ||
| [Row(json=u'struct<a:bigint>')] | ||
| """ | ||
| if isinstance(json, basestring): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after more thoughts, maybe we should not add new features to 2.4? We can accept strings directly in 3.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, Wenchen, I think that's going to make it a bit complicated after few more thoughts .. I think it's okay to go ahead. It's kind of a mistake that we had to fix in 2.4.
|
thanks, merging to master! can you send a new PR for 2.4? it conflicts |
|
Sure! |
|
Actually this is not that hard. The conflict comes from the fact that in 2.4 I've fixed the conflict and pushed to 2.4. You can take a look at the commit and see if there is something wrong. I ran the touched tests locally to verify it. |
…eral only The main purpose of `schema_of_json` is the usage of combination with `from_json` (to make up the leak of schema inference) which takes its schema only as literal; however, currently `schema_of_json` allows JSON input as non-literal expressions (e.g, column). This was mistakenly allowed - we don't have to take other usages rather then the main purpose into account for now. This PR makes a followup to only allow literals for `schema_of_json`'s JSON input. We can allow non literal expressions later when it's needed or there are some usecase for it. Unit tests were added. Closes #22775 from HyukjinKwon/SPARK-25447-followup. Lead-authored-by: hyukjinkwon <gurwls223@apache.org> Co-authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 33e337c) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
|
Oh you mean the conflict fixing is not that hard. Thanks for doing this @cloud-fan. I planned to do this today .. :-). |
## What changes were proposed in this pull request? after backport #22775 to 2.4, the 2.4 sbt Jenkins QA job is broken, see https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.4-test-sbt-hadoop-2.7/147/console This PR adds `if sys.version >= '3': basestring = str` which onlly exists in master. ## How was this patch tested? existing test Closes #22858 from cloud-fan/python. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>
…eral only ## What changes were proposed in this pull request? The main purpose of `schema_of_json` is the usage of combination with `from_json` (to make up the leak of schema inference) which takes its schema only as literal; however, currently `schema_of_json` allows JSON input as non-literal expressions (e.g, column). This was mistakenly allowed - we don't have to take other usages rather then the main purpose into account for now. This PR makes a followup to only allow literals for `schema_of_json`'s JSON input. We can allow non literal expressions later when it's needed or there are some usecase for it. ## How was this patch tested? Unit tests were added. Closes apache#22775 from HyukjinKwon/SPARK-25447-followup. Lead-authored-by: hyukjinkwon <gurwls223@apache.org> Co-authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request? after backport apache#22775 to 2.4, the 2.4 sbt Jenkins QA job is broken, see https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.4-test-sbt-hadoop-2.7/147/console This PR adds `if sys.version >= '3': basestring = str` which onlly exists in master. ## How was this patch tested? existing test Closes apache#22858 from cloud-fan/python. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>
## What changes were proposed in this pull request? after backport apache/spark#22775 to 2.4, the 2.4 sbt Jenkins QA job is broken, see https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.4-test-sbt-hadoop-2.7/147/console This PR adds `if sys.version >= '3': basestring = str` which onlly exists in master. ## How was this patch tested? existing test Closes #22858 from cloud-fan/python. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org> (cherry picked from commit 0f74bac)
|
This change seems like a step back from the original version introduced in #21686. I have a DataFrame with a JSON column. I suspect the JSON values have an inconsistent schema, so I want to first check whether a single schema can apply before trying to parse the column. With the original version of df.select(schema_of_json(...)).distinct().count()But now I can't do that. I can't even wrap Can we revisit the design of this function (as well as that of its cousin, Alternately, would it make sense to deprecate these functions and instead recommend the approach that @HyukjinKwon suggested?
This demonstrates good Spark style (at least to me), and perhaps we can just promote this as a solution in the docs somewhere and do away with these functions. For the passing reader, the Python equivalent of Hyukjin's suggestion is: spark.read.json(df.rdd.map(lambda x: x[0])).schema |
|
@nchammas, thanks for some input here.
I wanted to make the expression only for the specific usecase and avoid to have multiple ways for the same thing. Other cases can be easily worked around. For the case you mentioned, it can be worked around as below: spark.read.json(df.select("json").as[String]).schema == StructType.fromDDL(...)I think it is fine to have an expression that takes literals to return a column to support a missing usecase requested multiple times. |
What changes were proposed in this pull request?
The main purpose of
schema_of_jsonis the usage of combination withfrom_json(to make up the leak of schema inference) which takes its schema only as literal; however, currentlyschema_of_jsonallows JSON input as non-literal expressions (e.g, column).This was mistakenly allowed - we don't have to take other usages rather then the main purpose into account for now.
This PR makes a followup to only allow literals for
schema_of_json's JSON input. We can allow non literal expressions later when it's needed or there are some usecase for it.How was this patch tested?
Unit tests were added.