-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-46382][SQL][FOLLOW-UP] XML: Refactor inference and parsing #44571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-46382][SQL][FOLLOW-UP] XML: Refactor inference and parsing #44571
Conversation
| shouldStop = parser.peek().isInstanceOf[EndDocument] || StaxXmlParserUtils | ||
| .getName(e.getName, options) == startElementName |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| shouldStop = parser.peek().isInstanceOf[EndDocument] || StaxXmlParserUtils | |
| .getName(e.getName, options) == startElementName | |
| shouldStop = if (firstEventIsStartElement) { | |
| StaxXmlParserUtils.checkEndElement(parser) | |
| } else { | |
| true | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed offline and decided to simplify to shouldStop = true. With that, we will also make sure that the entire entry, including the starting tag, value, and ending tag are all consumed when we complete the parsing.
| */ | ||
| private def inferObject( | ||
| parser: XMLEventReader, | ||
| startElementName: String, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't be necessary after the change below.
| case _: EndElement if data.isEmpty => NullType | ||
| case _: EndElement if options.nullValue == "" => NullType | ||
| case _: EndElement => StringType | ||
| case _: EndElement if data.trim.isEmpty => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't data.trim.isEmpty will always be true because c.isWhiteSpace is true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data.trim.isEmpty isn't always true. We do a parser.nextEvent(). c.isWhiteSpace is always true for the last event and we're checking the current event.
| case _: EndElement if data.trim.isEmpty => | ||
| StaxXmlParserUtils.consumeNextEndElement(parser) | ||
| NullType | ||
| case _: EndElement if options.nullValue == "" => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that c.isWhiteSpace is true, for EndElement, the previous case will be true and this will never be reached.
| // if this is really data or space between other elements. | ||
| val data = c.getData | ||
| parser.nextEvent() | ||
| parser.peek match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This case (case c: Characters if c.isWhiteSpace) can be combined with the next one with a minor change suggested below
| && isPrimitiveType(structType.fields.head.dataType) | ||
| && isValueTagField(structType.fields.head) => | ||
| simpleType.fields.head.dataType | ||
| case _ => structType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the case (case c: Characters if c.isWhiteSpace) and add the following case here:
| case _ => structType | |
| case simpleType if structType.isEmpty => NullType | |
| case _ => structType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Added case _ if structType.fields.isEmpty =>
| (parser.peek, dataType) match { | ||
| case (_: StartElement, dt: DataType) => convertComplicatedType(dt, attributes) | ||
| case (_: EndElement, _: StringType) => | ||
| StaxXmlParserUtils.consumeNextEndElement(parser) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| StaxXmlParserUtils.consumeNextEndElement(parser) | |
| parser.next |
| } | ||
| case (_: EndElement, _: DataType) => null | ||
| case (_: EndElement, _: DataType) => | ||
| StaxXmlParserUtils.consumeNextEndElement(parser) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| StaxXmlParserUtils.consumeNextEndElement(parser) | |
| parser.next |
| } | ||
| } | ||
| val value = convertTo(c.getData, st) | ||
| StaxXmlParserUtils.consumeNextEndElement(parser) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check in this function will fail if the next event is Comment, Cdata, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got you! I guess if the next event is Comment, Cdata, etc., we wanted to skip those events as well and also the end element.
| val value = convertTo(c.getData, dt) | ||
| parser.next | ||
| convertTo(c.getData, dt) | ||
| StaxXmlParserUtils.consumeNextEndElement(parser) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check in this function will fail if the next event is Comment, Cdata, etc.
<ROW><foo attr="a">1<!--This is a comment--></foo></ROW>
| StaxXmlParserUtils.skipChildren(parser) | ||
| StaxXmlParserUtils.consumeNextEndElement(parser) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not skip EndElement in skipChildren itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
skipChildren is called recursively. We just wanted to skip EndElement after skipping all its children
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParserUtils.scala
Outdated
Show resolved
Hide resolved
| val valueTagType = inferFrom(c.getData) | ||
| addOrUpdateType(nameToDataType, options.valueTag, valueTagType) | ||
|
|
||
| case _: EndElement => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is EndDocument needed here as well?
| | <array2>4</array2><!--A comment within tags--> | ||
| | </struct3> | ||
| | <string>string</string> | ||
| | value12 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GitHub doesn't allow to add review comment to any arbirary line. So commenting here.
Add a comment between value16 and </ROW> with a space before and after the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added 👍
| | <struct3> | ||
| | value5 | ||
| | <array2>1</array2> | ||
| | <array2>1<!--A comment within tags--></array2> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding multiple comments with spaces between them:
| | <array2>1<!--A comment within tags--></array2> | |
| | <array2>1 <!--First comment--> <!--Second comment--> </array2> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added 👍
| | <array2>3</array2> | ||
| | value11 | ||
| | <array2>4</array2> | ||
| | <array2>4</array2><!--A comment within tags--> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding multiple comments with spaces between them:
| | <array2>4</array2><!--A comment within tags--> | |
| | <array2>4</array2> <!--First comment--> <!--Second comment--> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added 👍
| } | ||
| } | ||
|
|
||
| def consumeNextEndElement(parser: XMLEventReader): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For defensive purpose, consider passing the name of the StartElement and assert that it matches with the EndElement:
| def consumeNextEndElement(parser: XMLEventReader): Unit = { | |
| def skipNextEndElement(parser: XMLEventReader): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added an assertion on the endElementName
| parser.nextEvent() match { | ||
| case _: EndElement => // do nothing | ||
| case _ => throw new IllegalStateException("Invalid state") | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra events (comments, etc.) before EndElement need to be skipped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think comments are removed before it reach parser. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. We are filtering comments here. Do we need special handling for CDATA or EntityReferences?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline. The current implementation should work fine with CDATA and we will filter out EntityReferences
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala
Outdated
Show resolved
Hide resolved
…pature-values-follow-up
…XmlInferSchema.scala Co-authored-by: Sandip Agarwala <131817656+sandip-db@users.noreply.github.com>
…spark into cpature-values-follow-up
|
#44601 had a conflict with this PR. would you mind resolving the conflicts please? |
…w-up # Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala
|
Merged to master. |
What changes were proposed in this pull request?
This follow-up refactors the handling of value tags and endElement.
Why are the changes needed?
This follow-up simplifies the handling of value tags.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Unit tests
Was this patch authored or co-authored using generative AI tooling?
No