Skip to content

Conversation

@shujingyang-db
Copy link
Contributor

@shujingyang-db shujingyang-db commented Jan 3, 2024

What changes were proposed in this pull request?

This follow-up refactors the handling of value tags and endElement.

  1. As value tags only exist in structure data, their handling will be confined to the inferObject method, eliminating the need for processing in inferField. This implies that when we encounter non-whitespace characters, we can invoke inferObject. For structures with a single primitive field, we'll simplify them into primitive types during the schema inference.
  2. We wanted to make sure that the entire entry, including the starting tag, value, and ending tag are all consumed when we completed the parsing.

Why are the changes needed?

This follow-up simplifies the handling of value tags.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit tests

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Jan 3, 2024
@shujingyang-db shujingyang-db changed the title [SPARK-46382][SQL] XML: Capture values interspersed between elements follow-up [SPARK-46382][SQL] XML: Refactor the handling of values interspersed between elements Jan 3, 2024
Comment on lines 277 to 278
shouldStop = parser.peek().isInstanceOf[EndDocument] || StaxXmlParserUtils
.getName(e.getName, options) == startElementName
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
shouldStop = parser.peek().isInstanceOf[EndDocument] || StaxXmlParserUtils
.getName(e.getName, options) == startElementName
shouldStop = if (firstEventIsStartElement) {
StaxXmlParserUtils.checkEndElement(parser)
} else {
true
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed offline and decided to simplify to shouldStop = true. With that, we will also make sure that the entire entry, including the starting tag, value, and ending tag are all consumed when we complete the parsing.

*/
private def inferObject(
parser: XMLEventReader,
startElementName: String,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't be necessary after the change below.

case _: EndElement if data.isEmpty => NullType
case _: EndElement if options.nullValue == "" => NullType
case _: EndElement => StringType
case _: EndElement if data.trim.isEmpty =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't data.trim.isEmpty will always be true because c.isWhiteSpace is true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data.trim.isEmpty isn't always true. We do a parser.nextEvent(). c.isWhiteSpace is always true for the last event and we're checking the current event.

case _: EndElement if data.trim.isEmpty =>
StaxXmlParserUtils.consumeNextEndElement(parser)
NullType
case _: EndElement if options.nullValue == "" =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that c.isWhiteSpace is true, for EndElement, the previous case will be true and this will never be reached.

// if this is really data or space between other elements.
val data = c.getData
parser.nextEvent()
parser.peek match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case (case c: Characters if c.isWhiteSpace) can be combined with the next one with a minor change suggested below

&& isPrimitiveType(structType.fields.head.dataType)
&& isValueTagField(structType.fields.head) =>
simpleType.fields.head.dataType
case _ => structType
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the case (case c: Characters if c.isWhiteSpace) and add the following case here:

Suggested change
case _ => structType
case simpleType if structType.isEmpty => NullType
case _ => structType

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Added case _ if structType.fields.isEmpty =>

(parser.peek, dataType) match {
case (_: StartElement, dt: DataType) => convertComplicatedType(dt, attributes)
case (_: EndElement, _: StringType) =>
StaxXmlParserUtils.consumeNextEndElement(parser)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
StaxXmlParserUtils.consumeNextEndElement(parser)
parser.next

}
case (_: EndElement, _: DataType) => null
case (_: EndElement, _: DataType) =>
StaxXmlParserUtils.consumeNextEndElement(parser)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
StaxXmlParserUtils.consumeNextEndElement(parser)
parser.next

}
}
val value = convertTo(c.getData, st)
StaxXmlParserUtils.consumeNextEndElement(parser)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check in this function will fail if the next event is Comment, Cdata, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got you! I guess if the next event is Comment, Cdata, etc., we wanted to skip those events as well and also the end element.

val value = convertTo(c.getData, dt)
parser.next
convertTo(c.getData, dt)
StaxXmlParserUtils.consumeNextEndElement(parser)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check in this function will fail if the next event is Comment, Cdata, etc.
<ROW><foo attr="a">1<!--This is a comment--></foo></ROW>

Comment on lines 398 to 399
StaxXmlParserUtils.skipChildren(parser)
StaxXmlParserUtils.consumeNextEndElement(parser)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not skip EndElement in skipChildren itself?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

skipChildren is called recursively. We just wanted to skip EndElement after skipping all its children

val valueTagType = inferFrom(c.getData)
addOrUpdateType(nameToDataType, options.valueTag, valueTagType)

case _: EndElement =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is EndDocument needed here as well?

| <array2>4</array2><!--A comment within tags-->
| </struct3>
| <string>string</string>
| value12
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GitHub doesn't allow to add review comment to any arbirary line. So commenting here.

Add a comment between value16 and </ROW> with a space before and after the comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added 👍

| <struct3>
| value5
| <array2>1</array2>
| <array2>1<!--A comment within tags--></array2>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding multiple comments with spaces between them:

Suggested change
| <array2>1<!--A comment within tags--></array2>
| <array2>1 <!--First comment--> <!--Second comment--> </array2>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added 👍

| <array2>3</array2>
| value11
| <array2>4</array2>
| <array2>4</array2><!--A comment within tags-->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding multiple comments with spaces between them:

Suggested change
| <array2>4</array2><!--A comment within tags-->
| <array2>4</array2> <!--First comment--> <!--Second comment-->

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added 👍

}
}

def consumeNextEndElement(parser: XMLEventReader): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For defensive purpose, consider passing the name of the StartElement and assert that it matches with the EndElement:

Suggested change
def consumeNextEndElement(parser: XMLEventReader): Unit = {
def skipNextEndElement(parser: XMLEventReader): Unit = {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an assertion on the endElementName

Comment on lines 184 to 187
parser.nextEvent() match {
case _: EndElement => // do nothing
case _ => throw new IllegalStateException("Invalid state")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra events (comments, etc.) before EndElement need to be skipped.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think comments are removed before it reach parser. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. We are filtering comments here. Do we need special handling for CDATA or EntityReferences?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline. The current implementation should work fine with CDATA and we will filter out EntityReferences

@shujingyang-db shujingyang-db changed the title [SPARK-46382][SQL] XML: Refactor the handling of values interspersed between elements [SPARK-46382][SQL] XML: Refactor inference and parsing Jan 5, 2024
@HyukjinKwon
Copy link
Member

#44601 had a conflict with this PR. would you mind resolving the conflicts please?

…w-up

# Conflicts:
#	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala
@HyukjinKwon
Copy link
Member

Merged to master.

@HyukjinKwon HyukjinKwon changed the title [SPARK-46382][SQL] XML: Refactor inference and parsing [SPARK-46382][SQL][FOLLOW-UP] XML: Refactor inference and parsing Jan 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants