Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Read an Excel File]: GC overhead limit exceeded #322

Open
cedtnee opened this issue Nov 26, 2020 · 14 comments
Open

[Read an Excel File]: GC overhead limit exceeded #322

cedtnee opened this issue Nov 26, 2020 · 14 comments
Labels

Comments

@cedtnee
Copy link

cedtnee commented Nov 26, 2020

Expected Behavior

I try to read an excel file size 35MB and write the result as orc files
The input file have one sheet which have only values, not functions or macros

Current Behavior

On my readerExcel.scala , i do this :
val df = spark.read
.format("com.crealytics.spark.excel")
.option("header", "true") // Required
.option("treatEmptyValuesAsNulls", "false") // Optional, default: true
.option("inferSchema", "false") // Optional, default: false
.option("sheetName", "Feuil1")
.schema(StructType(
fields
.map(fieldName => StructField(fieldName, StringType, nullable = true)))) // Optional, default: Either inferred schema, or all columns are Strings
.load("file.xlsx")

I have this error
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.regex.Pattern.matcher(Pattern.java:1093)
at shadeio.poi.ss.util.CellReference.separateRefParts(CellReference.java:396)
at shadeio.poi.ss.util.CellReference.(CellReference.java:113)
at shadeio.poi.xssf.usermodel.XSSFCell.(XSSFCell.java:118)
at shadeio.poi.xssf.usermodel.XSSFRow.(XSSFRow.java:75)
at shadeio.poi.xssf.usermodel.XSSFSheet.initRows(XSSFSheet.java:268)
at shadeio.poi.xssf.usermodel.XSSFSheet.read(XSSFSheet.java:231)
at shadeio.poi.xssf.usermodel.XSSFSheet.onDocumentRead(XSSFSheet.java:218)
at shadeio.poi.xssf.usermodel.XSSFWorkbook.parseSheet(XSSFWorkbook.java:448)
at shadeio.poi.xssf.usermodel.XSSFWorkbook.onDocumentRead(XSSFWorkbook.java:413)
at shadeio.poi.ooxml.POIXMLDocument.load(POIXMLDocument.java:184)
at shadeio.poi.xssf.usermodel.XSSFWorkbook.(XSSFWorkbook.java:282)
at shadeio.poi.xssf.usermodel.XSSFWorkbookFactory.createWorkbook(XSSFWorkbookFactory.java:88)
at shadeio.poi.xssf.usermodel.XSSFWorkbookFactory.createWorkbook(XSSFWorkbookFactory.java:135)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at shadeio.poi.ss.usermodel.WorkbookFactory.createWorkbook(WorkbookFactory.java:339)
at shadeio.poi.ss.usermodel.WorkbookFactory.createXSSFWorkbook(WorkbookFactory.java:314)
at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:232)
at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:198)
at com.crealytics.spark.excel.DefaultWorkbookReader$$anonfun$openWorkbook$1.apply(WorkbookReader.scala:50)
at com.crealytics.spark.excel.DefaultWorkbookReader$$anonfun$openWorkbook$1.apply(WorkbookReader.scala:50)
at scala.Option.fold(Option.scala:158)
at com.crealytics.spark.excel.DefaultWorkbookReader.openWorkbook(WorkbookReader.scala:50)
at com.crealytics.spark.excel.WorkbookReader$class.withWorkbook(WorkbookReader.scala:14)
at com.crealytics.spark.excel.DefaultWorkbookReader.withWorkbook(WorkbookReader.scala:46)
at com.crealytics.spark.excel.ExcelRelation.excerpt$lzycompute(ExcelRelation.scala:30)
at com.crealytics.spark.excel.ExcelRelation.excerpt(ExcelRelation.scala:30)
at com.crealytics.spark.excel.ExcelRelation.headerColumns$lzycompute(ExcelRelation.scala:104)
at com.crealytics.spark.excel.ExcelRelation.headerColumns(ExcelRelation.scala:103)
2020-11-26 21:36:57,729 [Driver] INFO org.apache.spark.deploy.yarn.ApplicationMaster - Final app status: FAILED, exitCode: 15, (reason: User class threw exception: java.io.IOException: GC overhead limit exceeded
at shadeio.poi.ss.usermodel.WorkbookFactory.createWorkbook(WorkbookFactory.java:351)
at shadeio.poi.ss.usermodel.WorkbookFactory.createXSSFWorkbook(WorkbookFactory.java:314)
at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:232)
at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:198)
at com.crealytics.spark.excel.DefaultWorkbookReader$$anonfun$openWorkbook$1.apply(WorkbookReader.scala:50)
at com.crealytics.spark.excel.DefaultWorkbookReader$$anonfun$openWorkbook$1.apply(WorkbookReader.scala:50)
at scala.Option.fold(Option.scala:158)
at com.crealytics.spark.excel.DefaultWorkbookReader.openWorkbook(WorkbookReader.scala:50)
at com.crealytics.spark.excel.WorkbookReader$class.withWorkbook(WorkbookReader.scala:14)
at com.crealytics.spark.excel.DefaultWorkbookReader.withWorkbook(WorkbookReader.scala:46)
at com.crealytics.spark.excel.ExcelRelation.excerpt$lzycompute(ExcelRelation.scala:30)
at com.crealytics.spark.excel.ExcelRelation.excerpt(ExcelRelation.scala:30)
at com.crealytics.spark.excel.ExcelRelation.headerColumns$lzycompute(ExcelRelation.scala:104)
at com.crealytics.spark.excel.ExcelRelation.headerColumns(ExcelRelation.scala:103)
at com.crealytics.spark.excel.ExcelRelation.headerColumnForName$lzycompute(ExcelRelation.scala:32)
at com.crealytics.spark.excel.ExcelRelation.headerColumnForName(ExcelRelation.scala:32)
at com.crealytics.spark.excel.ExcelRelation.com$crealytics$spark$excel$ExcelRelation$$columnExtractor(ExcelRelation.scala:51)
at com.crealytics.spark.excel.ExcelRelation$$anonfun$4.apply(ExcelRelation.scala:61)
at com.crealytics.spark.excel.ExcelRelation$$anonfun$4.apply(ExcelRelation.scala:61)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at com.crealytics.spark.excel.ExcelRelation.buildScan(ExcelRelation.scala:61)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$11.apply(DataSourceStrategy.scala:300)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$11.apply(DataSourceStrategy.scala:300)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:338)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:337)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy.pruneFilterProjectRaw(DataSourceStrategy.scala:415)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy.pruneFilterProject(DataSourceStrategy.scala:333)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy.apply(DataSourceStrategy.scala:296)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:78)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:75)
at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:75)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:67)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:72)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:68)
at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:77)
at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:77)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3248)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2484)

Troubleshouting

I have increased the memory size but still getting the same issue
I set spark options relative to Heap, but still getting the same issue

Your Environment

Spark version : Spark 2.3 language: Scala

  • Spark-Excel version:
com.crealytics spark-excel_2.11 0.13.1
@nightscape
Copy link
Owner

Can you try something along the lines of .option("maxRowsInMemory", 20) (see the README)?

@cedtnee
Copy link
Author

cedtnee commented Nov 30, 2020

When i used .option("maxRowsInMemory", 20),the result dataframe is empty.
When i increased the .option("maxRowsInMemory", 30000) i have this error:
An exception or error caused a run to abort: GC overhead limit exceeded
java.lang.OutOfMemoryError: GC overhead limit exceeded
at com.sun.xml.internal.stream.events.StartElementEvent.init(StartElementEvent.java:76)
at com.sun.xml.internal.stream.events.StartElementEvent.(StartElementEvent.java:64)
at com.sun.xml.internal.stream.events.XMLEventAllocatorImpl.getXMLEvent(XMLEventAllocatorImpl.java:76)
at com.sun.xml.internal.stream.events.XMLEventAllocatorImpl.allocate(XMLEventAllocatorImpl.java:53)
at com.sun.xml.internal.stream.XMLEventReaderImpl.nextEvent(XMLEventReaderImpl.java:84)
at shadeio.monitorjbl.xlsx.impl.StreamingSheetReader.getRow(StreamingSheetReader.java:71)
at shadeio.monitorjbl.xlsx.impl.StreamingSheetReader.access$200(StreamingSheetReader.java:32)
at shadeio.monitorjbl.xlsx.impl.StreamingSheetReader$StreamingRowIterator.hasNext(StreamingSheetReader.java:402)
at shadeio.monitorjbl.xlsx.impl.StreamingSheetReader$StreamingRowIterator.(StreamingSheetReader.java:396)
at shadeio.monitorjbl.xlsx.impl.StreamingSheetReader.iterator(StreamingSheetReader.java:382)
at shadeio.monitorjbl.xlsx.impl.StreamingSheet.iterator(StreamingSheet.java:49)
at com.crealytics.spark.excel.AreaDataLocator$class.readFromSheet(DataLocator.scala:91)
at com.crealytics.spark.excel.CellRangeAddressDataLocator.readFromSheet(DataLocator.scala:134)
at com.crealytics.spark.excel.CellRangeAddressDataLocator.readFrom(DataLocator.scala:144)
at com.crealytics.spark.excel.ExcelRelation$$anonfun$buildScan$2.apply(ExcelRelation.scala:63)
at com.crealytics.spark.excel.ExcelRelation$$anonfun$buildScan$2.apply(ExcelRelation.scala:62)
at com.crealytics.spark.excel.WorkbookReader$class.withWorkbook(WorkbookReader.scala:15)
at com.crealytics.spark.excel.StreamingWorkbookReader.withWorkbook(WorkbookReader.scala:55)
at com.crealytics.spark.excel.ExcelRelation.buildScan(ExcelRelation.scala:62)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$11.apply(DataSourceStrategy.scala:300)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$11.apply(DataSourceStrategy.scala:300)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:338)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:337)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy.pruneFilterProjectRaw(DataSourceStrategy.scala:415)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy.pruneFilterProject(DataSourceStrategy.scala:333)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy.apply(DataSourceStrategy.scala:296)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)

@nightscape
Copy link
Owner

I'm not surprised by the OutOfMemoryError, but getting an empty result is weird...
Could you try some more values in between 20 and 30000?

@AlexZhang267
Copy link

AlexZhang267 commented Mar 18, 2021

I used .option("maxRowsInMemory", 200), but it's still often to get an OOM error. My excel file is only 16MB. This is error message.


diagnostics: User class threw exception: java.lang.OutOfMemoryError: Java heap space
at org.apache.xmlbeans.impl.store.Cur.createElementXobj(Cur.java:260)
at org.apache.xmlbeans.impl.store.Cur$CurLoadContext.startElement(Cur.java:2997)
at org.apache.xmlbeans.impl.store.Locale$SaxHandler.startElement(Locale.java:3164)
at org.apache.xerces.parsers.AbstractSAXParser.startElement(Unknown Source)
at org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanStartElement(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
at org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3422)
at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1272)
at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1259)
at org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345)
at org.openxmlformats.schemas.spreadsheetml.x2006.main.SstDocument$Factory.parse(Unknown Source)
at shadeio.poi.xssf.model.SharedStringsTable.readFrom(SharedStringsTable.java:123)
at shadeio.poi.xssf.model.SharedStringsTable.(SharedStringsTable.java:111)
at shadeio.poi.xssf.eventusermodel.XSSFReader.getSharedStringsTable(XSSFReader.java:115)
at shadeio.pjfanning.xlsx.impl.StreamingWorkbookReader.init(StreamingWorkbookReader.java:132)
at shadeio.pjfanning.xlsx.impl.StreamingWorkbookReader.init(StreamingWorkbookReader.java:90)
at shadeio.pjfanning.xlsx.StreamingReader$Builder.open(StreamingReader.java:307)
at com.crealytics.spark.excel.StreamingWorkbookReader.openWorkbook(WorkbookReader.scala:63)
at com.crealytics.spark.excel.WorkbookReader$class.withWorkbook(WorkbookReader.scala:14)
at com.crealytics.spark.excel.StreamingWorkbookReader.withWorkbook(WorkbookReader.scala:54)
at com.crealytics.spark.excel.ExcelRelation.excerpt$lzycompute(ExcelRelation.scala:31)
at com.crealytics.spark.excel.ExcelRelation.excerpt(ExcelRelation.scala:31)
at com.crealytics.spark.excel.ExcelRelation.headerColumns$lzycompute(ExcelRelation.scala:102)
at com.crealytics.spark.excel.ExcelRelation.headerColumns(ExcelRelation.scala:101)
at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1.apply(ExcelRelation.scala:163)
at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1.apply(ExcelRelation.scala:162)

@nightscape
Copy link
Owner

@AlexZhang267 I unfortunately can't invest much time into spark-excel at the moment.
If possible, I would recommend to increase Spark's memory to work around this.

@kennydataml
Copy link

kennydataml commented Aug 5, 2021

I'm getting the same problem here.
tried using "maxRowsinMemory" with values like 1000, 10000, 100000.
It works fine for the first 3 excel files but I have 8 to process and the driver node always dies on the 4th excel file. Each file is roughly 600 MB each.
I tried spark.driver.extraJavaOptions -XX:+UseG1GC and increased driver memory to 56 GB but the driver node still crashes.
My understanding is the driver will crash if there's too many occurrences of full garbage collection.
I also noticed the cluster doesn't scale up beyond 1 worker node even though I have auto scale set to 4 worker nodes.
So clearly there is some type of collect() happening on the driver node with this library? and this library doesn't seem to use the spark parallelism very well?

@quanghgx
Copy link
Collaborator

quanghgx commented Aug 6, 2021

Hi @kennydataml ,
I am not sure if this is related. How do you read these 8 files? It will help if we have simplified (without sensitive data) and reproducible code. I am preparing a PR for native support of multiple files loading.
Sincerely,

@kennydataml
Copy link

kennydataml commented Aug 6, 2021

I unfortunately can't get a scrubbed excel for you since it's on a client laptop.
the code is simple. I only read in one excel at a time with a for loop. so basically

for xlpath in excels:
  csvpath = xlpath split join yadayda
  try:  # exception handling since we don't know the number of sheets
    for i in range(15):  # dynamic number of sheets
      df = (spark.read
        .format("crealytics ... spark excel yada yada")
        .option("dataAddress", f"{i}!A1")  # sub sheet index here
        .option("header", "true")
        .option("maxRowsInMemory", 100000)
        .load(xlpath)
      # write excel to csv
      (df.write
      .repartition(200)  # attempting to circumvent memory issues
      .format("csv")
      .mode("append")
      .option("header", "true")
      .save(csvpath)
      )
  except Exception as err:
    print(repr(err))

I've narrowed down the problem to only 1 of 8 excel files. I can consistently reproduce it on that particular excel file. It opens up just fine using microsoft excel, so I'm puzzled why only 1 particular excel file gives me an issue.
The behaviour I'm observing is I can read the first 6 sheets, but it hangs on the 7th sheet when it tries to append write to the csv path. There is no spark job initiated for writing as expected, and it just hangs for an hour straight.
I've tried running the code without looping through all excels, targeting only that particular excel file and it will hang consistently on the 7th sheet.
It's really weird, I'm going to do some more testing tomorrow with a fresh copy of the data.

@NestorAGC123
Copy link

I have the same issue and my excel file is only 2 MB, It also happens on some specific files

@quanghgx
Copy link
Collaborator

Hi @NestorAGC123 ,
Is it possible to share your excel file after removing sensible data?

@pjfanning
Copy link
Collaborator

the problem is that spark-excel reads the file as an input stream and that uses far more memory than reading it as java.io.File - I have logged https://bz.apache.org/bugzilla/show_bug.cgi?id=65581 for a possible solution

@quanghgx quanghgx added the large label Oct 3, 2021
@yanghong
Copy link

Does this problem still unclosed? I have the same problem.

@pjfanning
Copy link
Collaborator

@yanghong maybe you could do a PR based on https://bz.apache.org/bugzilla/show_bug.cgi?id=65581 changes ?

@yanghong
Copy link

@yanghong maybe you could do a PR based on https://bz.apache.org/bugzilla/show_bug.cgi?id=65581 changes ?

Fine, it‘s poi’s limitaion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants