Java Heap Out of Memory Error #11

soleyjh · 2021-12-02T06:20:22Z

Hi! And thanks so much for writing this great package!

When I run the following command:

df = spark.read.format("com.elastacloud.spark.excel").option("cellAddress", "A1").load(file_location)

I get the following error:

java.lang.OutOfMemoryError: Java heap space

The excel file is 218MB roughly 750K rows with ~50 or so fields (no long text strings).

and I'm running in Azure Databricks: 9.1 LTS (includes Apache Spark 3.1.2, Scala 2.12)
Running: Standard_F4 with 8GB Memory and 4 Cores

I downloaded the following JAR and Installed on the Cluster (Above): spark_excel_3_1_2_0_1_7.jar

Any Help or Advice would be appreciated.

The text was updated successfully, but these errors were encountered:

dazfuller · 2021-12-03T08:35:09Z

Hi, and thanks for the nice feedback 🙂

I'm guessing that the file you're opening is an .xlsx file which means that 218MB is it's compressed size. The library has to decompress the entire file so that it can read it (this is so the formula evaluation works), which can lead to OOM errors. I'm looking at having a simplified version which lets the library stream the data out without the formula evaluation.

One of the things you could try is to use a memory optimized sku, or a standard sku with more memory available. There's also the option of setting the JVM options to allow for a bigger heap size, -Xmx4096m for instance.

dazfuller · 2021-12-20T19:59:42Z

Hi @soleyjh did this help out at all?

josecsotomorales · 2024-06-26T17:47:58Z

I also faced this issue, trying to think of a workaround, can we disable the formula evaluation as part of the configuration and just read static data from Excel?

dazfuller · 2024-06-27T08:30:10Z

It's definitely doable, I'll need to handle the cell evaluation slightly differently based on the config, but let me have a look and see what I can do with it

dazfuller · 2024-06-27T16:37:06Z

Okay, the code is in there to handle this now in the 0.1.13 branch, so you can pass in an option like this

.option("evaluateFormulae", "false")

What this will do instead is extract the formula itself (e.g. "A7*2"), but it won't attempt to evaluate it. If you let me know which spark version you're targeting I can create a jar for you from that branch to test with. Or check out the branch and build it yourself if fancy :)

josecsotomorales · 2024-06-27T18:15:54Z

I'm going to test it on my local by checking out the branch

josecsotomorales · 2024-06-27T18:33:09Z

Alright, I have some insights, it helped a LOT but I still have a Java Heap OOM. The file is XLSX approx 300MB. I processed it with https://github.com/crealytics/spark-excel using the streaming reader config and it works:

.option("maxRowsInMemory", 20) // Optional, default None. If set, uses a streaming reader which can help with big files (will fail if used with xls format files)

I don't know how we can add this streaming as an option to this project.

dazfuller · 2024-06-27T19:14:09Z

It is something I've looked at before, but you lose a lot of the features using the streaming reader. My current thought is to have a different parser which uses the stream reader and a heavy check on the options provided.

Is processing locally your usage environment?

josecsotomorales · 2024-06-27T20:25:04Z

That makes sense. I'm currently performing my testing by running Spark on K8S driver 8CPU 64GB RAM + executors same size.

dazfuller · 2024-06-28T07:29:45Z

What's the JVM heap size set to?

josecsotomorales · 2024-06-28T14:15:46Z

It's set up to 4GB

dazfuller · 2024-06-28T15:45:16Z

Any success if you increase that size?

dazfuller · 2024-06-29T12:23:09Z

I'm going to get the 0.1.13 release out and then start looking at the 0.1.14 with a view to creating a companion parser which uses the SXSSF (streaming) reader. It means that it won't have the ability to do formula evaluation, and merged cells over a certain size, but it should reduce the memory required by the POI library

dazfuller · 2024-07-01T10:55:47Z

For info, there's a branch called dazfuller/streaming-reading which has a working version of the streaming reader. Needs quite a lot more testing yet, but the core if it is in place

josecsotomorales · 2024-07-01T13:14:24Z

Excellent work @dazfuller!! Will review and perform some testing on that branch

dazfuller self-assigned this Dec 3, 2021

dazfuller added the question Further information is requested label Dec 3, 2021

dazfuller closed this as completed Feb 6, 2022

dazfuller reopened this Jun 28, 2024

dazfuller added enhancement New feature or request and removed question Further information is requested labels Jun 28, 2024

dazfuller added this to the 0.1.14 Release milestone Jun 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Java Heap Out of Memory Error #11

Java Heap Out of Memory Error #11

soleyjh commented Dec 2, 2021 •

edited

Loading

dazfuller commented Dec 3, 2021

dazfuller commented Dec 20, 2021

josecsotomorales commented Jun 26, 2024

dazfuller commented Jun 27, 2024

dazfuller commented Jun 27, 2024

josecsotomorales commented Jun 27, 2024

josecsotomorales commented Jun 27, 2024

dazfuller commented Jun 27, 2024

josecsotomorales commented Jun 27, 2024

dazfuller commented Jun 28, 2024

josecsotomorales commented Jun 28, 2024

dazfuller commented Jun 28, 2024

dazfuller commented Jun 29, 2024

dazfuller commented Jul 1, 2024

josecsotomorales commented Jul 1, 2024

Java Heap Out of Memory Error #11

Java Heap Out of Memory Error #11

Comments

soleyjh commented Dec 2, 2021 • edited Loading

dazfuller commented Dec 3, 2021

dazfuller commented Dec 20, 2021

josecsotomorales commented Jun 26, 2024

dazfuller commented Jun 27, 2024

dazfuller commented Jun 27, 2024

josecsotomorales commented Jun 27, 2024

josecsotomorales commented Jun 27, 2024

dazfuller commented Jun 27, 2024

josecsotomorales commented Jun 27, 2024

dazfuller commented Jun 28, 2024

josecsotomorales commented Jun 28, 2024

dazfuller commented Jun 28, 2024

dazfuller commented Jun 29, 2024

dazfuller commented Jul 1, 2024

josecsotomorales commented Jul 1, 2024

soleyjh commented Dec 2, 2021 •

edited

Loading