Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ package org.apache.spark.sql.catalyst.expressions

import java.io.{ByteArrayInputStream, ByteArrayOutputStream, CharArrayWriter, InputStreamReader, StringWriter}

import scala.util.control.NonFatal
import scala.util.parsing.combinator.RegexParsers

import com.fasterxml.jackson.core._
Expand Down Expand Up @@ -615,7 +616,7 @@ case class JsonToStructs(
}
}

override def inputTypes: Seq[AbstractDataType] = StringType :: Nil
override def inputTypes: Seq[AbstractDataType] = Seq(StringType, StringType)

override def sql: String = schema match {
case _: MapType => "entries"
Expand Down Expand Up @@ -747,8 +748,13 @@ case class StructsToJson(

object JsonExprUtils {

def validateSchemaLiteral(exp: Expression): StructType = exp match {
case Literal(s, StringType) => CatalystSqlParser.parseTableSchema(s.toString)
def validateSchemaLiteral(exp: Expression): DataType = exp match {
case Literal(s, StringType) =>
try {
DataType.fromJson(s.toString)
Copy link
Member

@HyukjinKwon HyukjinKwon Jun 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should support JSON format here. DDL formatted schema is preferred. JSON in functions.scala is supported for backward compatibility because SQL functions wasn't added first. After that, we added SQL functions with DDL formatted schema support.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we should support JSON format because:

  • Functionality of SQL and Scala (and other languages) DSL should be equal otherwise we push users to use Scala DSL because SQL has less features.
  • The feature allows to save/restore schema in JSON format. Customer's use case is to have data in JSON format + meta info including schema in JSON format too. Schema in JSON format gives them more opportunities for processing in programatic way.
  • For now JSON format give us more flexibility and allows MapType (and ArrayType) as the root type for result of from_json

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually they should be consistent but we don't necessarily support the obsolete functionality newly and consistently. I'm not sure how common it is to write the JSON literal as a schema via SQL. How do they get the metadata and how do they insert it into SQL? Is that the only way to do it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do they get the metadata ...

Metadata is stored together with data in distributed fs and loaded by a standard facilities of language.

and how do they insert it into SQL?

SQL statements are formed programmatically as strings, and loaded schemas are inserted in particular positions in the string ( you can think about it as quasiquotes in Scala). The formed sql statements are sent via JDBC to Spark.

Is that the only way to do it?

Probably it is possible to convert schemas in JSON format to DDL format but:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Schema in DDL supports only StructType as root types. It is not possible to specify MapType like in the test:

Shall we add the support with type itself with CatalystSqlParser.parseDataType too?
Also, are you able to use catalogString?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add the support with type itself with CatalystSqlParser.parseDataType too?

I will do but it won't solve customer's problem fully.

Also, are you able to use catalogString?

I just check that:

val schema = MapType(StringType, IntegerType).catalogString
val ds = spark.sql(
      s"""
        |select from_json('{"a":1}', '$schema')
      """.stripMargin)
ds.show()

and got this one:

extraneous input '<' expecting {'SELECT', 'FROM', ...}(line 1, pos 3)

== SQL ==
map<string,int>
---^^^
; line 2 pos 7

The same with val schema = new StructType().add("a", IntegerType).catalogString

== SQL ==
struct<a:int>
------^^^
; line 2 pos 7
org.apache.spark.sql.AnalysisException

Am I doing something wrong?

Copy link
Member

@HyukjinKwon HyukjinKwon Jun 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, adding the type string support by CatalystSqlParser.parseDataType (like array<...> or map<...>) into from_json so that this can support struct<a:int> if I am not mistaken.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, like what we do in Python side:

try:
# DDL format, "fieldname datatype, fieldname datatype".
return from_ddl_schema(s)
except Exception as e:
try:
# For backwards compatibility, "integer", "struct<fieldname: datatype>" and etc.
return from_ddl_datatype(s)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible, I like @HyukjinKwon 's approach. I remember correctly we just keep json schema formats for back-compatibility. In future major releases, I think we possibly drop the support.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu @HyukjinKwon Please, have a look at the PR: #21550

} catch {
case NonFatal(_) => StructType.fromDDL(s.toString)
}
case e => throw new AnalysisException(s"Expected a string literal instead of $e")
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,10 @@ select from_json('{"a":1}', 'a InvalidType');
select from_json('{"a":1}', 'a INT', named_struct('mode', 'PERMISSIVE'));
select from_json('{"a":1}', 'a INT', map('mode', 1));
select from_json();
-- from_json - schema in json format
select from_json('{"a":1}', '{"type":"struct","fields":[{"name":"a","type":"integer", "nullable":true}]}');
select from_json('{"a":1}', '{"type":"map", "keyType":"string", "valueType":"integer","valueContainsNull":false}');

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make the output file changes smaller, can you add new tests in the end of file?

-- json_tuple
SELECT json_tuple('{"a" : 1, "b" : 2}', CAST(NULL AS STRING), 'b', CAST(NULL AS STRING), 'a');
CREATE TEMPORARY VIEW jsonTable(jsonField, a) AS SELECT * FROM VALUES ('{"a": 1, "b": 2}', 'a');
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
-- Automatically generated by SQLQueryTestSuite
-- Number of queries: 26
-- Number of queries: 28


-- !query 0
Expand Down Expand Up @@ -229,32 +229,48 @@ Invalid number of arguments for function from_json. Expected: one of 2 and 3; Fo


-- !query 22
SELECT json_tuple('{"a" : 1, "b" : 2}', CAST(NULL AS STRING), 'b', CAST(NULL AS STRING), 'a')
select from_json('{"a":1}', '{"type":"struct","fields":[{"name":"a","type":"integer", "nullable":true}]}')
-- !query 22 schema
struct<c0:string,c1:string,c2:string,c3:string>
struct<jsontostructs({"a":1}):struct<a:int>>
-- !query 22 output
NULL 2 NULL 1
{"a":1}


-- !query 23
CREATE TEMPORARY VIEW jsonTable(jsonField, a) AS SELECT * FROM VALUES ('{"a": 1, "b": 2}', 'a')
select from_json('{"a":1}', '{"type":"map", "keyType":"string", "valueType":"integer","valueContainsNull":false}')
-- !query 23 schema
struct<>
struct<entries:map<string,int>>
-- !query 23 output

{"a":1}


-- !query 24
SELECT json_tuple(jsonField, 'b', CAST(NULL AS STRING), a) FROM jsonTable
SELECT json_tuple('{"a" : 1, "b" : 2}', CAST(NULL AS STRING), 'b', CAST(NULL AS STRING), 'a')
-- !query 24 schema
struct<c0:string,c1:string,c2:string>
struct<c0:string,c1:string,c2:string,c3:string>
-- !query 24 output
2 NULL 1
NULL 2 NULL 1


-- !query 25
DROP VIEW IF EXISTS jsonTable
CREATE TEMPORARY VIEW jsonTable(jsonField, a) AS SELECT * FROM VALUES ('{"a": 1, "b": 2}', 'a')
-- !query 25 schema
struct<>
-- !query 25 output



-- !query 26
SELECT json_tuple(jsonField, 'b', CAST(NULL AS STRING), a) FROM jsonTable
-- !query 26 schema
struct<c0:string,c1:string,c2:string>
-- !query 26 output
2 NULL 1


-- !query 27
DROP VIEW IF EXISTS jsonTable
-- !query 27 schema
struct<>
-- !query 27 output