[SPARK-44732][SQL] Built-in XML data source support #41832

sandip-db · 2023-07-03T17:25:49Z

What changes were proposed in this pull request?

XML is a widely used data format. An external spark-xml package (https://github.com/databricks/spark-xml) is available to read and write XML data in spark. Making spark-xml built-in will provide a better user experience for Spark SQL and structured streaming. The proposal is to inline code from spark-xml package.

The PR has the following commits:
i) The first commit has the following:

Copy of spark-xml src files.
Update license
Scala style and format fixes
Change AnyFunSuite to SparkFunSuite

ii) Miscellaneous import and scala style fixes.
iii) Add library dependencies
iv) Resource file path fixes and change AnyFunSuite to SharedSparkSession or SQLTestUtils
v) Exclude XML test resource files from license check
vi) Change import from scala.jdk.Collections to scala.collection.JavaConverters

Why are the changes needed?

Built-in support for XML data source would provide better user experience than having to import an external package.

Does this PR introduce any user-facing change?

Yes, Add built-in support for XML data source.

How was this patch tested?

Tested the new unit-tests that came with the imported spark-xml package.
Also ran ./dev/run-test

sandip-db · 2023-07-03T22:11:45Z

@HyukjinKwon

HyukjinKwon · 2023-07-03T23:39:17Z

This would need an SPIP.

sandip-db · 2023-08-03T00:48:53Z

This would need an SPIP.

SPIP link
JIRA
Discussion Thread
Vote thread

HyukjinKwon · 2023-08-03T03:34:49Z

Seems the test failure is flakiness. Mind retriggering https://github.com/sandip-db/spark/actions/runs/5745270542/job/15573043078 please @sandip-db ?

gengliangwang · 2023-08-03T03:37:06Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/xml/XmlInputFormat.scala

+  }
+
+  private def readUntilStartElement(): Boolean = {
+    currentStartTag = startTag


QQ: is there a existing lib we can use for the parsing?

XMLEventReader is being used to parse fields from XML records. It may be possible to use XMLEventReader to extract records as well. The current scope of the PR is to inline the spark-xml as is.

sandip-db · 2023-08-03T18:02:06Z

Seems the test failure is flakiness. Mind retriggering https://github.com/sandip-db/spark/actions/runs/5745270542/job/15573043078 please @sandip-db ?

The following test (unrelated to this PR) is failing consistently. It was supposedly fixed by the Jira/PR. But its continuing to fail even after the fix. How do we get past this? @HyukjinKwon

2023-08-03T17:34:49.5298626Z [info] *** 1 TEST FAILED ***
2023-08-03T17:34:49.5412181Z [error] Failed: Total 9521, Failed 1, Errors 0, Passed 9520, Ignored 27
2023-08-03T17:34:49.5674077Z [error] Failed tests:
2023-08-03T17:34:49.5674974Z [error] org.apache.spark.sql.execution.QueryExecutionSuite

StackTrace:
2023-08-03T17:07:15.5739831Z [info] - Logging plan changes for execution *** FAILED *** (11 milliseconds)
2023-08-03T17:07:15.5740917Z [info] testAppender.loggingEvents.exists(((x$10: org.apache.logging.log4j.core.LogEvent) => x$10.getMessage().getFormattedMessage().contains(expectedMsg))) was false (QueryExecutionSuite.scala:233)
2023-08-03T17:07:15.5741729Z [info] org.scalatest.exceptions.TestFailedException:
2023-08-03T17:07:15.5742406Z [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
2023-08-03T17:07:15.5743140Z [info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
2023-08-03T17:07:15.5743968Z [info] at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
2023-08-03T17:07:15.5744873Z [info] at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
2023-08-03T17:07:15.5745601Z [info] at org.apache.spark.sql.execution.QueryExecutionSuite.$anonfun$new$34(QueryExecutionSuite.scala:233)
2023-08-03T17:07:15.5746281Z [info] at scala.collection.immutable.List.foreach(List.scala:431)
2023-08-03T17:07:15.5746979Z [info] at org.apache.spark.sql.execution.QueryExecutionSuite.$anonfun$new$31(QueryExecutionSuite.scala:232)

- Copy spark-xml sources - Update license - Scala style and format fixes - AnyFunSuite --> SparkFunSuite

… or SQLTestUtils Also converted AnyFunSuite to SharedSparkSession or SQLTestUtils The following test in XmlSuite.scala failed with SharedSparkSession, but works with SQLTestUtils. test("from_xml array basic test")

…erters

HyukjinKwon · 2023-08-09T00:20:00Z

@sandip-db would you mind adding some comments about what's new codes (e.g., pom.xml)? Then I think it'd be much easier to merge since most of the codes are already exiting one.

I took a cursory look, and it seems pretty fine to me.

sandip-db · 2023-08-09T00:24:58Z

@sandip-db would you mind adding some comments about what's new codes (e.g., pom.xml)? Then I think it'd be much easier to merge since most of the codes are already exiting one.

I took a cursory look, and it seems pretty fine to me.

Hi @HyukjinKwon and reviewers,
As indicated in the description, there is very little new code in this PR. Most of it is picked up from spark-xml package.
The first commit is the vanilla copy of spark-xml with license and scala style fixes.
Reviewers should focus only on the new code in other commits, which doesn't change any functionality. These changes addresses the dependencies and test issues. If it helps, I can squash those other commits into one.

HyukjinKwon · 2023-08-09T00:28:05Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/xml/XmlDataToCatalyst.scala

+import org.apache.spark.sql.types._
+import org.apache.spark.unsafe.types.UTF8String
+
+case class XmlDataToCatalyst(


Would need to register this to SQL expression. Can be done in a followup.

HyukjinKwon · 2023-08-09T00:29:51Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/xml/XmlReader.scala

+/**
+ * A collection of static functions for working with XML files in Spark SQL
+ */
+class XmlReader(private var schema: StructType,


Actually I think we can remove the whole file but can be done separately in a followup.

Should probably add a couple of methods in DataFrameReader/DataStreamReader

I have a PR ready to be sent out with these and other changes you have suggested. Just have to get this merged first :-)

HyukjinKwon · 2023-08-09T00:31:51Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/xml/XmlRelation.scala

+        fs.delete(filesystemPath, true)
+      } catch {
+        case e: IOException =>
+          throw new IOException(


As a followup, we should leverage error framework we added (see QueryExecutionErrors)

HyukjinKwon · 2023-08-09T00:32:12Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/xml/functions.scala

+ * Support functions for working with XML columns directly.
+ */
+// scalastyle:off: object.name
+object functions {


As a followup, we should move this function to org.apache.spark.sql.functions

HyukjinKwon · 2023-08-09T00:32:59Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/xml/package.scala

+import org.apache.spark.sql.execution.datasources.xml.util.{InferSchema, XmlFile}
+import org.apache.spark.sql.types.{ArrayType, StructType}
+
+package object xml {


Would have to remove this file too, and move those utils such as schema_of_xml to somewhere else.

HyukjinKwon · 2023-08-09T00:33:39Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/xml/util/InferSchema.scala

@@ -0,0 +1,336 @@
+/*


Would have to be placed together with CSVInferSchema at catalyst.

the parsers too

HyukjinKwon · 2023-08-09T00:34:11Z

.../main/scala/org/apache/spark/sql/execution/datasources/xml/util/PartialResultException.scala

+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.datasources.xml.util


We actually have it in the codebase already :-).

HyukjinKwon · 2023-08-09T00:34:53Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/xml/util/XmlFile.scala

+import org.apache.spark.sql.execution.datasources.xml.{XmlInputFormat, XmlOptions}
+import org.apache.spark.sql.execution.datasources.xml.parsers.StaxXmlGenerator
+
+private[xml] object XmlFile {


we won't likely need this file too

All of the above comments will be addressed in this sub-task:
https://issues.apache.org/jira/browse/SPARK-44751

HyukjinKwon · 2023-08-09T00:35:24Z

sql/core/src/test/resources/test-data/xml-resources/ages-with-spaces.xml

@@ -0,0 +1,20 @@
+<people>
+  <person>
+    <age born=" 1990-02-24 "> 25 </age>


I was so young :-)

HyukjinKwon · 2023-08-09T00:36:37Z

sql/core/src/test/java/test/org/apache/spark/sql/execution/datasources/xml/JavaXmlSuite.java

+public final class JavaXmlSuite {
+
+    private static final int numBooks = 12;
+    private static final String booksFile = "src/test/resources/test-data/xml-resources/books.xml";


I wonder if we should just name it resources instead of xml-resources. Or create a sub directory?

HyukjinKwon · 2023-08-09T00:37:33Z

I think I'd prefer to merge this first, and address them as a followup so other people can work together. I will review one more time tmr but would be great if others can review this too.

HyukjinKwon · 2023-08-09T01:50:37Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/xml/XmlOptions.scala

+
+  def this() = this(Map.empty)
+
+  val charset = parameters.getOrElse("charset", XmlOptions.DEFAULT_CHARSET)


should document it in https://spark.apache.org/docs/latest/sql-data-sources.html

Created a sub-task for documentation: https://issues.apache.org/jira/browse/SPARK-44752

HyukjinKwon · 2023-08-09T01:51:09Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/xml/functions.scala

+   *   column is string-valued, or ArrayType[StructType] if column is an array of strings
+   * @param options key-value pairs that correspond to those supported by [[XmlOptions]]
+   */
+  def from_xml(e: Column, schema: DataType, options: Map[String, String] = Map.empty): Column = {


should add Python and sparkR binding including Spark Connect

Added a sub-task: https://issues.apache.org/jira/browse/SPARK-44753

HyukjinKwon

I have no more highlevel comments. I will file a JIRAs for my own comments tmr.

HyukjinKwon · 2023-08-10T02:23:14Z

Merged to master.

HyukjinKwon · 2023-08-10T02:23:40Z

@sandip-db I would appreciate if you create sub-tasks JIRAs based on my comments when you find some time.

### What changes were proposed in this pull request? XML is a widely used data format. An external spark-xml package (https://github.com/databricks/spark-xml) is available to read and write XML data in spark. Making spark-xml built-in will provide a better user experience for Spark SQL and structured streaming. The proposal is to inline code from spark-xml package. The PR has the following commits: i) The first commit has the following: - Copy of spark-xml src files. - Update license - Scala style and format fixes - Change AnyFunSuite to SparkFunSuite ii) Miscellaneous import and scala style fixes. iii) Add library dependencies iv) Resource file path fixes and change AnyFunSuite to SharedSparkSession or SQLTestUtils v) Exclude XML test resource files from license check vi) Change import from scala.jdk.Collections to scala.collection.JavaConverters ### Why are the changes needed? Built-in support for XML data source would provide better user experience than having to import an external package. ### Does this PR introduce _any_ user-facing change? Yes, Add built-in support for XML data source. ### How was this patch tested? Tested the new unit-tests that came with the imported spark-xml package. Also ran ./dev/run-test Closes apache#41832 from sandip-db/spark-xml-master. Authored-by: Sandip Agarwala <131817656+sandip-db@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? This is the second PR related to the built-in XML data source implementation ([jira](https://issues.apache.org/jira/browse/SPARK-44751)). The previous [PR](#41832) ported the spark-xml package. This PR addresses the following: - Implement FileFormat interface - Address the review comments in the previous [XML PR](#41832) - Moved from_xml and schema_of_xml to sql/functions - Moved ".xml" to DataFrameReader/DataFrameWriter - Removed old APIs like XmlRelation, XmlReader, etc. - StaxXmlParser changes: - Use FailureSafeParser - Convert 'Row' usage to 'InternalRow' - Convert String to UTF8String - Handle MapData and ArrayData for MapType and ArrayType respectively - Use TimestampFormatter to parse timestamp - Use DateFormatter to parse date - StaxXmlGenerator changes: - Convert 'Row' usage to 'InternalRow' - Handle UTF8String for StringType - Handle MapData and ArrayData for MapType and ArrayType respectively - Use TimestampFormatter to format timestamp - Use DateFormatter to format date - Update XML tests accordingly because of the above changes ### Why are the changes needed? These changes are required to bring XML data source capability at par with CSV and JSON and supports features like streaming, which requires FileFormat interface to be implemented. ### Does this PR introduce _any_ user-facing change? Yes, it adds support for XML data source. ### How was this patch tested? - Ran all the XML unit tests. - Github Action Closes #42462 from sandip-db/xml-file-format-master. Authored-by: Sandip Agarwala <131817656+sandip-db@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? XML is a widely used data format. An external spark-xml package (https://github.com/databricks/spark-xml) is available to read and write XML data in spark. Making spark-xml built-in will provide a better user experience for Spark SQL and structured streaming. The proposal is to inline code from spark-xml package. The PR has the following commits: i) The first commit has the following: - Copy of spark-xml src files. - Update license - Scala style and format fixes - Change AnyFunSuite to SparkFunSuite ii) Miscellaneous import and scala style fixes. iii) Add library dependencies iv) Resource file path fixes and change AnyFunSuite to SharedSparkSession or SQLTestUtils v) Exclude XML test resource files from license check vi) Change import from scala.jdk.Collections to scala.collection.JavaConverters ### Why are the changes needed? Built-in support for XML data source would provide better user experience than having to import an external package. ### Does this PR introduce _any_ user-facing change? Yes, Add built-in support for XML data source. ### How was this patch tested? Tested the new unit-tests that came with the imported spark-xml package. Also ran ./dev/run-test Closes apache#41832 from sandip-db/spark-xml-master. Authored-by: Sandip Agarwala <131817656+sandip-db@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? This is the second PR related to the built-in XML data source implementation ([jira](https://issues.apache.org/jira/browse/SPARK-44751)). The previous [PR](apache#41832) ported the spark-xml package. This PR addresses the following: - Implement FileFormat interface - Address the review comments in the previous [XML PR](apache#41832) - Moved from_xml and schema_of_xml to sql/functions - Moved ".xml" to DataFrameReader/DataFrameWriter - Removed old APIs like XmlRelation, XmlReader, etc. - StaxXmlParser changes: - Use FailureSafeParser - Convert 'Row' usage to 'InternalRow' - Convert String to UTF8String - Handle MapData and ArrayData for MapType and ArrayType respectively - Use TimestampFormatter to parse timestamp - Use DateFormatter to parse date - StaxXmlGenerator changes: - Convert 'Row' usage to 'InternalRow' - Handle UTF8String for StringType - Handle MapData and ArrayData for MapType and ArrayType respectively - Use TimestampFormatter to format timestamp - Use DateFormatter to format date - Update XML tests accordingly because of the above changes ### Why are the changes needed? These changes are required to bring XML data source capability at par with CSV and JSON and supports features like streaming, which requires FileFormat interface to be implemented. ### Does this PR introduce _any_ user-facing change? Yes, it adds support for XML data source. ### How was this patch tested? - Ran all the XML unit tests. - Github Action Closes apache#42462 from sandip-db/xml-file-format-master. Authored-by: Sandip Agarwala <131817656+sandip-db@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? XML is a widely used data format. An external spark-xml package (https://github.com/databricks/spark-xml) is available to read and write XML data in spark. Making spark-xml built-in will provide a better user experience for Spark SQL and structured streaming. The proposal is to inline code from spark-xml package. The PR has the following commits: i) The first commit has the following: - Copy of spark-xml src files. - Update license - Scala style and format fixes - Change AnyFunSuite to SparkFunSuite ii) Miscellaneous import and scala style fixes. iii) Add library dependencies iv) Resource file path fixes and change AnyFunSuite to SharedSparkSession or SQLTestUtils v) Exclude XML test resource files from license check vi) Change import from scala.jdk.Collections to scala.collection.JavaConverters ### Why are the changes needed? Built-in support for XML data source would provide better user experience than having to import an external package. ### Does this PR introduce _any_ user-facing change? Yes, Add built-in support for XML data source. ### How was this patch tested? Tested the new unit-tests that came with the imported spark-xml package. Also ran ./dev/run-test Closes apache#41832 from sandip-db/spark-xml-master. Authored-by: Sandip Agarwala <131817656+sandip-db@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? This is the second PR related to the built-in XML data source implementation ([jira](https://issues.apache.org/jira/browse/SPARK-44751)). The previous [PR](apache#41832) ported the spark-xml package. This PR addresses the following: - Implement FileFormat interface - Address the review comments in the previous [XML PR](apache#41832) - Moved from_xml and schema_of_xml to sql/functions - Moved ".xml" to DataFrameReader/DataFrameWriter - Removed old APIs like XmlRelation, XmlReader, etc. - StaxXmlParser changes: - Use FailureSafeParser - Convert 'Row' usage to 'InternalRow' - Convert String to UTF8String - Handle MapData and ArrayData for MapType and ArrayType respectively - Use TimestampFormatter to parse timestamp - Use DateFormatter to parse date - StaxXmlGenerator changes: - Convert 'Row' usage to 'InternalRow' - Handle UTF8String for StringType - Handle MapData and ArrayData for MapType and ArrayType respectively - Use TimestampFormatter to format timestamp - Use DateFormatter to format date - Update XML tests accordingly because of the above changes ### Why are the changes needed? These changes are required to bring XML data source capability at par with CSV and JSON and supports features like streaming, which requires FileFormat interface to be implemented. ### Does this PR introduce _any_ user-facing change? Yes, it adds support for XML data source. ### How was this patch tested? - Ran all the XML unit tests. - Github Action Closes apache#42462 from sandip-db/xml-file-format-master. Authored-by: Sandip Agarwala <131817656+sandip-db@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

github-actions bot added SQL BUILD labels Jul 3, 2023

HyukjinKwon marked this pull request as draft July 3, 2023 23:45

sandip-db force-pushed the spark-xml-master branch from 96156c1 to b0fdc6a Compare July 11, 2023 20:08

sandip-db marked this pull request as ready for review July 11, 2023 20:09

sandip-db force-pushed the spark-xml-master branch from b0fdc6a to cc0807b Compare August 3, 2023 00:45

gengliangwang reviewed Aug 3, 2023

View reviewed changes

sandip-db added 6 commits August 7, 2023 09:34

Built-in xml data source implementation

bef243c

- Copy spark-xml sources - Update license - Scala style and format fixes - AnyFunSuite --> SparkFunSuite

Some import and scala style fixes

5db61e8

Add library dependencies

e5b6cfd

Exclude XML test resource files from license check

21766cc

Change import from scala.jdk.Collections to scala.collection.JavaConv…

e4a3300

…erters

sandip-db force-pushed the spark-xml-master branch from cc0807b to e4a3300 Compare August 7, 2023 16:34

sandip-db added 2 commits August 7, 2023 15:15

Switch to default logger

ba62d47

Switch to using SharedSparkSession in XmlSuite and few other XML tests

bccc6b1

HyukjinKwon mentioned this pull request Aug 9, 2023

Document that spark-xml is in maintenance mode databricks/spark-xml#653

Closed

HyukjinKwon reviewed Aug 9, 2023

View reviewed changes

HyukjinKwon changed the title ~~[SPARK-44265][SQL] Built-in XML data source support~~ [SPARK-44732][SQL] Built-in XML data source support Aug 9, 2023

HyukjinKwon approved these changes Aug 9, 2023

View reviewed changes

HyukjinKwon reviewed Aug 9, 2023

View reviewed changes

HyukjinKwon approved these changes Aug 9, 2023

View reviewed changes

HyukjinKwon closed this in 1846991 Aug 10, 2023

sandip-db mentioned this pull request Aug 11, 2023

[SPARK-44751][SQL] XML FileFormat Interface implementation #42462

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-44732][SQL] Built-in XML data source support #41832

[SPARK-44732][SQL] Built-in XML data source support #41832

sandip-db commented Jul 3, 2023 •

edited

Loading

sandip-db commented Jul 3, 2023

HyukjinKwon commented Jul 3, 2023

sandip-db commented Aug 3, 2023

HyukjinKwon commented Aug 3, 2023

gengliangwang Aug 3, 2023

sandip-db Aug 3, 2023

sandip-db commented Aug 3, 2023

HyukjinKwon commented Aug 9, 2023

sandip-db commented Aug 9, 2023 •

edited

Loading

HyukjinKwon Aug 9, 2023

HyukjinKwon Aug 9, 2023

HyukjinKwon Aug 9, 2023

sandip-db Aug 9, 2023

HyukjinKwon Aug 9, 2023

HyukjinKwon Aug 9, 2023

HyukjinKwon Aug 9, 2023

HyukjinKwon Aug 9, 2023

HyukjinKwon Aug 9, 2023

HyukjinKwon Aug 9, 2023

HyukjinKwon Aug 9, 2023

sandip-db Aug 10, 2023

HyukjinKwon Aug 9, 2023

HyukjinKwon Aug 9, 2023

HyukjinKwon commented Aug 9, 2023

HyukjinKwon Aug 9, 2023

sandip-db Aug 10, 2023

HyukjinKwon Aug 9, 2023

sandip-db Aug 10, 2023

HyukjinKwon left a comment

HyukjinKwon commented Aug 10, 2023

HyukjinKwon commented Aug 10, 2023


		def this() = this(Map.empty)

		val charset = parameters.getOrElse("charset", XmlOptions.DEFAULT_CHARSET)

[SPARK-44732][SQL] Built-in XML data source support #41832

[SPARK-44732][SQL] Built-in XML data source support #41832

Conversation

sandip-db commented Jul 3, 2023 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

sandip-db commented Jul 3, 2023

HyukjinKwon commented Jul 3, 2023

sandip-db commented Aug 3, 2023

HyukjinKwon commented Aug 3, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sandip-db commented Aug 3, 2023

HyukjinKwon commented Aug 9, 2023

sandip-db commented Aug 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Aug 9, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon left a comment

Choose a reason for hiding this comment

HyukjinKwon commented Aug 10, 2023

HyukjinKwon commented Aug 10, 2023

sandip-db commented Jul 3, 2023 •

edited

Loading

sandip-db commented Aug 9, 2023 •

edited

Loading