[SPARK-21365][PYTHON] Deduplicate logics parsing DDL type/schema definition #18590

HyukjinKwon · 2017-07-10T17:08:47Z

What changes were proposed in this pull request?

This PR deals with four points as below:

Reuse existing DDL parser APIs rather than reimplementing within PySpark
Support DDL formatted string, field type, field type.
Support case-insensitivity for parsing.

Support nested data types as below:

Before

>>> spark.createDataFrame([[[1]]], "struct<a: struct<b: int>>").show()
...
ValueError: The strcut field string format is: 'field_name:field_type', but got: a: struct<b: int>

>>> spark.createDataFrame([[[1]]], "a: struct<b: int>").show()
...
ValueError: The strcut field string format is: 'field_name:field_type', but got: a: struct<b: int>

>>> spark.createDataFrame([[1]], "a int").show()
...
ValueError: Could not parse datatype: a int

After

>>> spark.createDataFrame([[[1]]], "struct<a: struct<b: int>>").show()
+---+
|  a|
+---+
|[1]|
+---+

>>> spark.createDataFrame([[[1]]], "a: struct<b: int>").show()
+---+
|  a|
+---+
|[1]|
+---+

>>> spark.createDataFrame([[1]], "a int").show()
+---+
|  a|
+---+
|  1|
+---+

How was this patch tested?

HyukjinKwon · 2017-07-10T17:09:21Z

sql/core/src/main/scala/org/apache/spark/sql/api/python/PythonSQLUtils.scala

+import org.apache.spark.sql.types.DataType
+
+private[sql] object PythonSQLUtils {
+  def parseDataType(typeText: String): DataType = CatalystSqlParser.parseDataType(typeText)


Without this, I should do something like ...

getattr(getattr(sc._jvm.org.apache.spark.sql.catalyst.parser, "CatalystSqlParser$"), "MODULE$").parseDataType("a")

HyukjinKwon · 2017-07-10T17:14:22Z

python/pyspark/sql/tests.py

+    def test_parse_datatype_string(self):
+        from pyspark.sql.types import _all_atomic_types, _parse_datatype_string
+        for k, t in _all_atomic_types.items():
+            if t != NullType:


So, if I haven't missed anything, this PR drops the support the type parsing null. I guess it is almost seldom that we explicitly set the type with null. Also, IIRC, we will support NullType via void (SPARK-20680) soon as a workaround.

HyukjinKwon · 2017-07-10T17:15:54Z

cc @cloud-fan, @felixcheung and @zero323 who I remember I talked with about similar issues before.

felixcheung · 2017-07-10T17:47:16Z

Add
@gatorsmile
@holdenk

SparkQA · 2017-07-10T19:26:07Z

Test build #79470 has finished for PR 18590 at commit 3472873.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-07-10T22:56:25Z

retest this please

SparkQA · 2017-07-11T01:10:36Z

Test build #79481 has finished for PR 18590 at commit 3472873.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-07-11T04:32:01Z

Test build #79492 has finished for PR 18590 at commit 9d857e6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-07-11T05:17:38Z

python/pyspark/sql/types.py

+            return from_ddl_datatype(s)
+        except:
+            try:
+                # For backwards compatibility, "fieldname: datatype, fieldname: datatype" case.


won't fieldname: datatype, fieldname: datatype be parsed as DDL schema?

I tested few cases but it looks not:

scala> StructType.fromDDL("a struct<a: INT, b: STRING>") res5: org.apache.spark.sql.types.StructType = StructType(StructField(a,StructType(StructField(a,IntegerType,true), StructField(b,StringType,true)),true)) scala> StructType.fromDDL("a INT, b STRING") res6: org.apache.spark.sql.types.StructType = StructType(StructField(a,IntegerType,true), StructField(b,StringType,true)) scala> StructType.fromDDL("a: INT, b: STRING") org.apache.spark.sql.catalyst.parser.ParseException: extraneous input ':' expecting ...

holdenk

Thanks for working on this, unifiying the parsing logic to be Scala side seems like a good idea.

holdenk · 2017-07-11T05:28:03Z

python/pyspark/sql/functions.py

            else func.__class__.__name__)

+    @property
+    def returnType(self):


We have pretty similar logic bellow, would it make sense to think about if there is a nicer more general way to handle these delayed iniatilization classes?

hmm.. I tried several ways I could think at my best but I could not figure out ...

cloud-fan · 2017-07-11T14:03:21Z

LGTM, merging to master!

Deduplicate logics parsing DDL-like type definition

3472873

HyukjinKwon commented Jul 10, 2017

View reviewed changes

Add some comments and doctests

9d857e6

cloud-fan reviewed Jul 11, 2017

View reviewed changes

holdenk reviewed Jul 11, 2017

View reviewed changes

asfgit closed this in ebc124d Jul 11, 2017

HyukjinKwon deleted the deduplicate-python-ddl branch January 2, 2018 03:41

[SPARK-21365][PYTHON] Deduplicate logics parsing DDL type/schema definition #18590

[SPARK-21365][PYTHON] Deduplicate logics parsing DDL type/schema definition #18590

Uh oh!

Conversation

HyukjinKwon commented Jul 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

HyukjinKwon Jul 10, 2017

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jul 10, 2017

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Jul 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

felixcheung commented Jul 10, 2017

Uh oh!

SparkQA commented Jul 10, 2017

Uh oh!

HyukjinKwon commented Jul 10, 2017

Uh oh!

SparkQA commented Jul 11, 2017

Uh oh!

SparkQA commented Jul 11, 2017

Uh oh!

cloud-fan Jul 11, 2017

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jul 11, 2017

Choose a reason for hiding this comment

Uh oh!

holdenk left a comment

Choose a reason for hiding this comment

Uh oh!

holdenk Jul 11, 2017

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jul 11, 2017

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jul 11, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

HyukjinKwon commented Jul 10, 2017 •

edited

Loading

HyukjinKwon commented Jul 10, 2017 •

edited

Loading