[SPARK-46382][SQL][FOLLOW-UP] XML: Refactor inference and parsing #44571

shujingyang-db · 2024-01-03T08:01:29Z

What changes were proposed in this pull request?

This follow-up refactors the handling of value tags and endElement.

As value tags only exist in structure data, their handling will be confined to the inferObject method, eliminating the need for processing in inferField. This implies that when we encounter non-whitespace characters, we can invoke inferObject. For structures with a single primitive field, we'll simplify them into primitive types during the schema inference.
We wanted to make sure that the entire entry, including the starting tag, value, and ending tag are all consumed when we completed the parsing.

Why are the changes needed?

This follow-up simplifies the handling of value tags.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit tests

Was this patch authored or co-authored using generative AI tooling?

No

sandip-db · 2024-01-03T21:57:01Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala

+          shouldStop = parser.peek().isInstanceOf[EndDocument] || StaxXmlParserUtils
+              .getName(e.getName, options) == startElementName


Suggested change

shouldStop = parser.peek().isInstanceOf[EndDocument] || StaxXmlParserUtils

.getName(e.getName, options) == startElementName

shouldStop = if (firstEventIsStartElement) {

StaxXmlParserUtils.checkEndElement(parser)

} else {

true

}

We discussed offline and decided to simplify to shouldStop = true. With that, we will also make sure that the entire entry, including the starting tag, value, and ending tag are all consumed when we complete the parsing.

sandip-db · 2024-01-03T21:58:44Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala

   */
  private def inferObject(
      parser: XMLEventReader,
+      startElementName: String,


This won't be necessary after the change below.

sandip-db · 2024-01-04T23:48:33Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala

-          case _: EndElement if data.isEmpty => NullType
-          case _: EndElement if options.nullValue == "" => NullType
-          case _: EndElement => StringType
+          case _: EndElement if data.trim.isEmpty =>


Isn't data.trim.isEmpty will always be true because c.isWhiteSpace is true?

data.trim.isEmpty isn't always true. We do a parser.nextEvent(). c.isWhiteSpace is always true for the last event and we're checking the current event.

sandip-db · 2024-01-04T23:51:19Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala

+          case _: EndElement if data.trim.isEmpty =>
+            StaxXmlParserUtils.consumeNextEndElement(parser)
+            NullType
+          case _: EndElement if options.nullValue == "" =>


Given that c.isWhiteSpace is true, for EndElement, the previous case will be true and this will never be reached.

sandip-db · 2024-01-04T23:54:57Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala

        // if this is really data or space between other elements.
        val data = c.getData
        parser.nextEvent()
        parser.peek match {


This case (case c: Characters if c.isWhiteSpace) can be combined with the next one with a minor change suggested below

sandip-db · 2024-01-04T23:57:40Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala

+              && isPrimitiveType(structType.fields.head.dataType)
+              && isValueTagField(structType.fields.head) =>
+            simpleType.fields.head.dataType
+          case _ => structType


Remove the case (case c: Characters if c.isWhiteSpace) and add the following case here:

Suggested change

case _ => structType

case simpleType if structType.isEmpty => NullType

case _ => structType

👍 Added case _ if structType.fields.isEmpty =>

sandip-db · 2024-01-05T01:50:55Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala

    (parser.peek, dataType) match {
      case (_: StartElement, dt: DataType) => convertComplicatedType(dt, attributes)
      case (_: EndElement, _: StringType) =>
+        StaxXmlParserUtils.consumeNextEndElement(parser)


Suggested change

StaxXmlParserUtils.consumeNextEndElement(parser)

parser.next

sandip-db · 2024-01-05T01:51:05Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala

        }
-      case (_: EndElement, _: DataType) => null
+      case (_: EndElement, _: DataType) =>
+        StaxXmlParserUtils.consumeNextEndElement(parser)


Suggested change

StaxXmlParserUtils.consumeNextEndElement(parser)

parser.next

sandip-db · 2024-01-05T02:02:37Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala

-            }
-        }
+        val value = convertTo(c.getData, st)
+        StaxXmlParserUtils.consumeNextEndElement(parser)


The check in this function will fail if the next event is Comment, Cdata, etc.

Got you! I guess if the next event is Comment, Cdata, etc., we wanted to skip those events as well and also the end element.

sandip-db · 2024-01-05T02:03:01Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala

+        val value = convertTo(c.getData, dt)
        parser.next
-        convertTo(c.getData, dt)
+        StaxXmlParserUtils.consumeNextEndElement(parser)


The check in this function will fail if the next event is Comment, Cdata, etc.
<ROW><foo attr="a">1</foo></ROW>

sandip-db · 2024-01-05T02:27:14Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala

                StaxXmlParserUtils.skipChildren(parser)
+                StaxXmlParserUtils.consumeNextEndElement(parser)


Why not skip EndElement in skipChildren itself?

skipChildren is called recursively. We just wanted to skip EndElement after skipping all its children

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParserUtils.scala

sandip-db · 2024-01-05T17:06:24Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala

          val valueTagType = inferFrom(c.getData)
          addOrUpdateType(nameToDataType, options.valueTag, valueTagType)

        case _: EndElement =>


Is EndDocument needed here as well?

sandip-db · 2024-01-05T17:10:20Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/xml/XmlSuite.scala

+         |                    <array2>4</array2><!--A comment within tags-->
         |                </struct3>
         |                <string>string</string>
         |                value12


GitHub doesn't allow to add review comment to any arbirary line. So commenting here.

Add a comment between value16 and </ROW> with a space before and after the comment.

sandip-db · 2024-01-05T17:11:56Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/xml/XmlSuite.scala

         |                <struct3>
         |                    value5
-         |                    <array2>1</array2>
+         |                    <array2>1<!--A comment within tags--></array2>


Adding multiple comments with spaces between them:

Suggested change

| <array2>1</array2>

| <array2>1   </array2>

sandip-db · 2024-01-05T17:12:43Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/xml/XmlSuite.scala

         |                    <array2>3</array2>
         |                    value11
-         |                    <array2>4</array2>
+         |                    <array2>4</array2><!--A comment within tags-->


Adding multiple comments with spaces between them:

Suggested change

| <array2>4</array2>

| <array2>4</array2>

sandip-db · 2024-01-05T17:16:13Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParserUtils.scala

    }
  }
+
+  def consumeNextEndElement(parser: XMLEventReader): Unit = {


For defensive purpose, consider passing the name of the StartElement and assert that it matches with the EndElement:

Suggested change

def consumeNextEndElement(parser: XMLEventReader): Unit = {

def skipNextEndElement(parser: XMLEventReader): Unit = {

Added an assertion on the endElementName

sandip-db · 2024-01-05T17:18:12Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParserUtils.scala

+    parser.nextEvent() match {
+      case _: EndElement => // do nothing
+      case _ => throw new IllegalStateException("Invalid state")
+    }


Extra events (comments, etc.) before EndElement need to be skipped.

I think comments are removed before it reach parser. WDYT?

You are right. We are filtering comments here. Do we need special handling for CDATA or EntityReferences?

Discussed offline. The current implementation should work fine with CDATA and we will filter out EntityReferences

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala

…pature-values-follow-up

…XmlInferSchema.scala Co-authored-by: Sandip Agarwala <131817656+sandip-db@users.noreply.github.com>

…spark into cpature-values-follow-up

HyukjinKwon · 2024-01-07T00:59:10Z

#44601 had a conflict with this PR. would you mind resolving the conflicts please?

…w-up # Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala

HyukjinKwon · 2024-01-07T23:53:27Z

Merged to master.

init

f8961e8

github-actions bot added the SQL label Jan 3, 2024

shujingyang-db changed the title ~~[SPARK-46382][SQL] XML: Capture values interspersed between elements follow-up~~ [SPARK-46382][SQL] XML: Refactor the handling of values interspersed between elements Jan 3, 2024

shujingyang-db added 3 commits January 3, 2024 12:45

fix for rdc

fc0edc9

revert AstBuilder

d98a51b

nit

e6654d5

sandip-db reviewed Jan 3, 2024

View reviewed changes

shujingyang-db added 3 commits January 3, 2024 16:54

ckpt

d368ba0

ckpt

4eb292d

rm startElementName

4506de2

shujingyang-db requested a review from sandip-db January 4, 2024 05:37

shujingyang-db added 7 commits January 3, 2024 23:35

revert and style

596cc90

revert

0c0f47c

addOrUpdate

a2cd1ef

revert

98effee

nullType for newline character

f9e445d

currentStructureAsString

49f8b4e

rm valueTag field in a test case

3cdfe1f

sandip-db reviewed Jan 5, 2024

View reviewed changes

shujingyang-db added 2 commits January 4, 2024 17:05

simplify whitespace

e1585cf

whitespace

66f2449

shujingyang-db requested a review from sandip-db January 5, 2024 01:49

sandip-db suggested changes Jan 5, 2024

View reviewed changes

shujingyang-db added 3 commits January 4, 2024 23:18

add comments in testcase

d6cd704

badRecord

d4e82fc

badrecords

3ea45ae

sandip-db reviewed Jan 5, 2024

View reviewed changes

shujingyang-db added 3 commits January 5, 2024 10:00

badRecord:

1e61fff

comments

fe4cf28

cdata and unsupported events

9c8adb6

shujingyang-db added 3 commits January 5, 2024 11:20

comments

e87ab03

assert on endElement name

8f18b64

endDocument

685fd8b

shujingyang-db requested a review from sandip-db January 5, 2024 20:03

sandip-db approved these changes Jan 5, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala Outdated Show resolved Hide resolved

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala Outdated Show resolved Hide resolved

shujingyang-db and others added 6 commits January 5, 2024 13:40

valuetag case insensitive

4f56bf5

Merge remote-tracking branch 'origin/cpature-values-follow-up' into c…

271e99d

…pature-values-follow-up

nit

2d10bd0

Update sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/…

fb5d6a0

…XmlInferSchema.scala Co-authored-by: Sandip Agarwala <131817656+sandip-db@users.noreply.github.com>

fix partial results test case

c70236c

Merge branch 'cpature-values-follow-up' of github.com:shujingyang-db/…

0e26f71

…spark into cpature-values-follow-up

shujingyang-db changed the title ~~[SPARK-46382][SQL] XML: Refactor the handling of values interspersed between elements~~ [SPARK-46382][SQL] XML: Refactor inference and parsing Jan 5, 2024

fix StaxXmlParserUtilsSuite

d0dc7a0

HyukjinKwon approved these changes Jan 7, 2024

View reviewed changes

shujingyang-db added 3 commits January 6, 2024 19:43

Merge remote-tracking branch 'spark/master' into cpature-values-follo…

ec8d4f0

…w-up # Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala

rm unused functions

ec9ab95

rm unused functions

99f5204

shujingyang-db requested a review from HyukjinKwon January 7, 2024 06:20

HyukjinKwon closed this in 71e0fdb Jan 7, 2024

HyukjinKwon changed the title ~~[SPARK-46382][SQL] XML: Refactor inference and parsing~~ [SPARK-46382][SQL][FOLLOW-UP] XML: Refactor inference and parsing Jan 7, 2024

		shouldStop = parser.peek().isInstanceOf[EndDocument] \|\| StaxXmlParserUtils
		.getName(e.getName, options) == startElementName

-          shouldStop = parser.peek().isInstanceOf[EndDocument] || StaxXmlParserUtils
-              .getName(e.getName, options) == startElementName
+          shouldStop = if (firstEventIsStartElement) {
+                       StaxXmlParserUtils.checkEndElement(parser)
+                      } else {
+                        true
+                      }

	case _ => structType
	case simpleType if structType.isEmpty => NullType
	case _ => structType

		StaxXmlParserUtils.skipChildren(parser)
		StaxXmlParserUtils.consumeNextEndElement(parser)

	\| <array2>1<!--A comment within tags--></array2>
	\| <array2>1 <!--First comment--> <!--Second comment--> </array2>

	\| <array2>4</array2><!--A comment within tags-->
	\| <array2>4</array2> <!--First comment--> <!--Second comment-->

	def consumeNextEndElement(parser: XMLEventReader): Unit = {
	def skipNextEndElement(parser: XMLEventReader): Unit = {

[SPARK-46382][SQL][FOLLOW-UP] XML: Refactor inference and parsing #44571

[SPARK-46382][SQL][FOLLOW-UP] XML: Refactor inference and parsing #44571

Uh oh!

Conversation

shujingyang-db commented Jan 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

HyukjinKwon commented Jan 7, 2024

Uh oh!

HyukjinKwon commented Jan 7, 2024

Uh oh!

Reviewers

Assignees

shujingyang-db commented Jan 3, 2024 •

edited

Loading