Skip to content
This repository was archived by the owner on Mar 24, 2025. It is now read-only.
This repository was archived by the owner on Mar 24, 2025. It is now read-only.

Fails on some comments with Scala: MatchError #164

@metador

Description

@metador

I'm trying to load a xml file and it fails while parsing the some of the comments.
The comment <!-- 1 Beefy Fritos Bur --> fails with the error Scala: Match Error
but looks like <!-- TLD Generator V1.28c --> is parsed without any problems.

Here is the XML

<?xml version="1.0" encoding="utf-8"?>
<!-- TLD Generator V1.28c -->
<POSLog xmlns="http://www.nrf-arts.org/IXRetail/namespace/" xmlns:ns1="xxx" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.nrf-arts.org/IXRetail/namespace/ ../XSD/POSLogForFoodserviceV3.0.0.xsd" xmlns:xyz="xxx">
  <Transaction MajorVersion="3" MinorVersion="0" FixVersion="0" TrainingModeFlag="false">
    <RetailStoreID>027505</RetailStoreID>
    <OrganizationHierarchy Level="OperatingCompany">NMBR</OrganizationHierarchy>
    <OrganizationHierarchy Level="Division">TBC</OrganizationHierarchy>
    <OrganizationHierarchy Level="Concept">TBC</OrganizationHierarchy>
    <WorkstationID>1</WorkstationID>
    <SequenceNumber>131629</SequenceNumber>
    <BusinessDayDate>2016-03-18</BusinessDayDate>
    <BeginDateTime>2016-03-18T22:13:18</BeginDateTime>
    <EndDateTime>2016-03-18T22:14:18</EndDateTime>
    <OperatorID OperatorType="Cashier" OperatorName="R, R">32</OperatorID>
    <ReceiptDateTime>2016-03-18T22:13:21</ReceiptDateTime>
    <RetailTransaction TransactionStatus="Finished">
      <PriceDerivationResult DiscountBenefit="Manager">
        <SequenceNumber>1</SequenceNumber>
        <Percent Action="Subtract">100.00</Percent>
        <PriceDerivationRule>
          <PriceDerivationRuleID>100% Manager Disc.</PriceDerivationRuleID>
          <Amount Action="Subtract">2.00</Amount>
        </PriceDerivationRule>
        <ReasonCode>FIXED DISCOUNT</ReasonCode>
        <OperatorBypassApproval>
          <SequenceNumber>1</SequenceNumber>
          <ApproverID>39</ApproverID>
        </OperatorBypassApproval>
      </PriceDerivationResult>
      <LineItem CancelFlag="false" VoidFlag="false">
        <!-- 1 Beefy Fritos Bur -->
        <SequenceNumber>2</SequenceNumber>
        <Sale>
          <POSIdentity>
            <POSItemID>B-BFFT</POSItemID>
          </POSIdentity>
          <RegularSalesUnitPrice>1.00</RegularSalesUnitPrice>
          <ActualSalesUnitPrice>1.00</ActualSalesUnitPrice>
          <ExtendedAmount>1.00</ExtendedAmount>
          <Quantity>1</Quantity>
        </Sale>
      </LineItem>
      <LineItem CancelFlag="false" VoidFlag="false">
        <!-- 1 Shrd Chk Mini Qu -->
        <SequenceNumber>3</SequenceNumber>
        <Sale>
          <POSIdentity>
            <POSItemID>MQ-SSC</POSItemID>
          </POSIdentity>
          <RegularSalesUnitPrice>1.00</RegularSalesUnitPrice>
          <ActualSalesUnitPrice>1.00</ActualSalesUnitPrice>
          <ExtendedAmount>1.00</ExtendedAmount>
          <Quantity>1</Quantity>
        </Sale>
      </LineItem>
      <!-- SOS -->
      <Total TotalType="TransactionGrossAmount">2.00</Total>
      <Total TotalType="TransactionNetAmount">0.00</Total>
      <Total TotalType="TransactionTaxAmount">0.00</Total>
      <Total TotalType="TransactionNonSalesAmount">2.00</Total>
      <Total TotalType="X:TransactionPromoAmount">0.00</Total>
      <Customer>
        <CustomerID>
        </CustomerID>
        <CustomerName>
          <FullName>
          </FullName>
        </CustomerName>
      </Customer>
      <Foodservice DestinationType="XXX">
      </Foodservice>
    </RetailTransaction>
  </Transaction>
 </POSLog>

Here is the code that I am trying to run:

trans = sqlContext.read.format('com.databricks.spark.xml').option("rowTag","Transaction").load("testing/sampleTLDnocomments.xml")

The error I am running into is the Scala:MatchError
Here is the Log:

>>> trans1 = sqlContext.read.format('com.databricks.spark.xml').option("rowTag", "Transaction").load("testing/sampleTLD.xml")
16/08/29 21:45:48 INFO storage.MemoryStore: Block broadcast_31 stored as values in memory (estimated size 202.6 KB, free 202.6 KB)
16/08/29 21:45:48 INFO storage.MemoryStore: Block broadcast_31_piece0 stored as bytes in memory (estimated size 23.5 KB, free 226.1 KB)
16/08/29 21:45:48 INFO storage.BlockManagerInfo: Added broadcast_31_piece0 in memory on 172.31.30.114:60795 (size: 23.5 KB, free: 530.3 MB)
16/08/29 21:45:48 INFO spark.SparkContext: Created broadcast 31 from newAPIHadoopFile at XmlFile.scala:39
16/08/29 21:45:48 INFO input.FileInputFormat: Total input paths to process : 1
16/08/29 21:45:48 INFO spark.SparkContext: Starting job: treeAggregate at InferSchema.scala:103
16/08/29 21:45:48 INFO scheduler.DAGScheduler: Got job 13 (treeAggregate at InferSchema.scala:103) with 1 output partitions
16/08/29 21:45:48 INFO scheduler.DAGScheduler: Final stage: ResultStage 13 (treeAggregate at InferSchema.scala:103)
16/08/29 21:45:48 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/08/29 21:45:48 INFO scheduler.DAGScheduler: Missing parents: List()
16/08/29 21:45:48 INFO scheduler.DAGScheduler: Submitting ResultStage 13 (MapPartitionsRDD[77] at treeAggregate at InferSchema.scala:103), which has no missing parents
16/08/29 21:45:48 INFO storage.MemoryStore: Block broadcast_32 stored as values in memory (estimated size 4.1 KB, free 230.1 KB)
16/08/29 21:45:48 INFO storage.MemoryStore: Block broadcast_32_piece0 stored as bytes in memory (estimated size 2.3 KB, free 232.4 KB)
16/08/29 21:45:48 INFO storage.BlockManagerInfo: Added broadcast_32_piece0 in memory on 172.31.30.114:60795 (size: 2.3 KB, free: 530.3 MB)
16/08/29 21:45:48 INFO spark.SparkContext: Created broadcast 32 from broadcast at DAGScheduler.scala:1006
16/08/29 21:45:48 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 13 (MapPartitionsRDD[77] at treeAggregate at InferSchema.scala:103)
16/08/29 21:45:48 INFO cluster.YarnScheduler: Adding task set 13.0 with 1 tasks
16/08/29 21:45:49 INFO spark.ExecutorAllocationManager: Requesting 1 new executor because tasks are backlogged (new desired total will be 1)
16/08/29 21:45:51 INFO cluster.YarnClientSchedulerBackend: Registered executor NettyRpcEndpointRef(null) (ip-172-31-30-113.ec2.internal:56600) with ID 12
16/08/29 21:45:51 INFO spark.ExecutorAllocationManager: New executor 12 has registered (new total is 1)
16/08/29 21:45:51 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 13.0 (TID 28, ip-172-31-30-113.ec2.internal, partition 0,NODE_LOCAL, 2293 bytes)
16/08/29 21:45:51 INFO storage.BlockManagerMasterEndpoint: Registering block manager ip-172-31-30-113.ec2.internal:46617 with 530.3 MB RAM, BlockManagerId(12, ip-172-31-30-113.ec2.internal, 46617)
16/08/29 21:45:52 INFO storage.BlockManagerInfo: Added broadcast_32_piece0 in memory on ip-172-31-30-113.ec2.internal:46617 (size: 2.3 KB, free: 530.3 MB)
16/08/29 21:45:52 INFO storage.BlockManagerInfo: Added broadcast_31_piece0 in memory on ip-172-31-30-113.ec2.internal:46617 (size: 23.5 KB, free: 530.3 MB)
16/08/29 21:45:53 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 13.0 (TID 28, ip-172-31-30-113.ec2.internal): scala.MatchError: <!-- 1 Beefy Fritos Bur --> (of class com.sun.xml.internal.stream.events.CommentEvent)
        at com.databricks.spark.xml.util.InferSchema$.inferField(InferSchema.scala:134)
        at com.databricks.spark.xml.util.InferSchema$.com$databricks$spark$xml$util$InferSchema$$inferObject(InferSchema.scala:171)
        at com.databricks.spark.xml.util.InferSchema$.inferField(InferSchema.scala:135)
        at com.databricks.spark.xml.util.InferSchema$.com$databricks$spark$xml$util$InferSchema$$inferObject(InferSchema.scala:171)
        at com.databricks.spark.xml.util.InferSchema$$anonfun$3$$anonfun$apply$2.apply(InferSchema.scala:94)
        at com.databricks.spark.xml.util.InferSchema$$anonfun$3$$anonfun$apply$2.apply(InferSchema.scala:83)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
        at scala.collection.Iterator$class.foreach(Iterator.scala:727)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
        at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
        at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)
        at scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)
        at scala.collection.AbstractIterator.aggregate(Iterator.scala:1157)
        at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$23.apply(RDD.scala:1135)
        at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$23.apply(RDD.scala:1135)
        at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$24.apply(RDD.scala:1136)
        at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$24.apply(RDD.scala:1136)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
        at org.apache.spark.scheduler.Task.run(Task.scala:89)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

16/08/29 21:45:53 INFO scheduler.TaskSetManager: Starting task 0.1 in stage 13.0 (TID 29, ip-172-31-30-113.ec2.internal, partition 0,NODE_LOCAL, 2293 bytes)
16/08/29 21:45:53 INFO scheduler.TaskSetManager: Lost task 0.1 in stage 13.0 (TID 29) on executor ip-172-31-30-113.ec2.internal: scala.MatchError (<!-- 1 Beefy Fritos Bur --> (of class com.sun.xml.internal.stream.events.CommentEvent)) [duplicate 1]
16/08/29 21:45:53 INFO scheduler.TaskSetManager: Starting task 0.2 in stage 13.0 (TID 30, ip-172-31-30-113.ec2.internal, partition 0,NODE_LOCAL, 2293 bytes)
16/08/29 21:45:53 INFO scheduler.TaskSetManager: Lost task 0.2 in stage 13.0 (TID 30) on executor ip-172-31-30-113.ec2.internal: scala.MatchError (<!-- 1 Beefy Fritos Bur --> (of class com.sun.xml.internal.stream.events.CommentEvent)) [duplicate 2]
16/08/29 21:45:53 INFO scheduler.TaskSetManager: Starting task 0.3 in stage 13.0 (TID 31, ip-172-31-30-113.ec2.internal, partition 0,NODE_LOCAL, 2293 bytes)
16/08/29 21:45:53 INFO scheduler.TaskSetManager: Lost task 0.3 in stage 13.0 (TID 31) on executor ip-172-31-30-113.ec2.internal: scala.MatchError (<!-- 1 Beefy Fritos Bur --> (of class com.sun.xml.internal.stream.events.CommentEvent)) [duplicate 3]
16/08/29 21:45:53 ERROR scheduler.TaskSetManager: Task 0 in stage 13.0 failed 4 times; aborting job
16/08/29 21:45:53 INFO cluster.YarnScheduler: Removed TaskSet 13.0, whose tasks have all completed, from pool
16/08/29 21:45:53 INFO cluster.YarnScheduler: Cancelling stage 13
16/08/29 21:45:53 INFO scheduler.DAGScheduler: ResultStage 13 (treeAggregate at InferSchema.scala:103) failed in 5.268 s
16/08/29 21:45:53 INFO scheduler.DAGScheduler: Job 13 failed: treeAggregate at InferSchema.scala:103, took 5.286607 s
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/lib/spark/python/pyspark/sql/readwriter.py", line 137, in load
    return self._df(self._jreader.load(path))
  File "/opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
  File "/opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/lib/spark/python/pyspark/sql/utils.py", line 45, in deco
    return f(*a, **kw)
  File "/opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/lib/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o622.load.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 13.0 failed 4 times, most recent failure: Lost task 0.3 in stage 13.0 (TID 31, ip-172-31-30-113.ec2.internal): scala.MatchError: <!-- 1 Beefy Fritos Bur --> (of class com.sun.xml.internal.stream.events.CommentEvent)
        at com.databricks.spark.xml.util.InferSchema$.inferField(InferSchema.scala:134)
        at com.databricks.spark.xml.util.InferSchema$.com$databricks$spark$xml$util$InferSchema$$inferObject(InferSchema.scala:171)
        at com.databricks.spark.xml.util.InferSchema$.inferField(InferSchema.scala:135)
        at com.databricks.spark.xml.util.InferSchema$.com$databricks$spark$xml$util$InferSchema$$inferObject(InferSchema.scala:171)
        at com.databricks.spark.xml.util.InferSchema$$anonfun$3$$anonfun$apply$2.apply(InferSchema.scala:94)
        at com.databricks.spark.xml.util.InferSchema$$anonfun$3$$anonfun$apply$2.apply(InferSchema.scala:83)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
        at scala.collection.Iterator$class.foreach(Iterator.scala:727)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
        at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
        at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)
        at scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)
        at scala.collection.AbstractIterator.aggregate(Iterator.scala:1157)
        at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$23.apply(RDD.scala:1135)
        at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$23.apply(RDD.scala:1135)
        at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$24.apply(RDD.scala:1136)
        at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$24.apply(RDD.scala:1136)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
        at org.apache.spark.scheduler.Task.run(Task.scala:89)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
        at scala.Option.foreach(Option.scala:236)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1843)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1963)
        at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1025)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
        at org.apache.spark.rdd.RDD.reduce(RDD.scala:1007)
        at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1150)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
        at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1127)
        at com.databricks.spark.xml.util.InferSchema$.infer(InferSchema.scala:103)
        at com.databricks.spark.xml.XmlRelation$$anonfun$1.apply(XmlRelation.scala:46)
        at com.databricks.spark.xml.XmlRelation$$anonfun$1.apply(XmlRelation.scala:46)
        at scala.Option.getOrElse(Option.scala:120)
        at com.databricks.spark.xml.XmlRelation.<init>(XmlRelation.scala:45)
        at com.databricks.spark.xml.DefaultSource.createRelation(DefaultSource.scala:66)
        at com.databricks.spark.xml.DefaultSource.createRelation(DefaultSource.scala:44)
        at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:158)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
        at py4j.Gateway.invoke(Gateway.java:259)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:209)
        at java.lang.Thread.run(Thread.java:745)

Any idea why this is happening and how we can solve this?
Thanks

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions