Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error while reading mounted xlsx: Could not initialize class shadeio.poi.xssf.model.SharedStringsTable #438

Open
ghost opened this issue Oct 15, 2021 · 5 comments
Labels
cloud Usage of spark-excel on cloud storage & platform

Comments

@ghost
Copy link

ghost commented Oct 15, 2021

I am using Azure Databricks and I am trying to read an Excel file (xlsx) from a Storage account (ADLS Gen2). Because I get an 'Anonymous access' error when I connect to the file using the wasbs path I mounted it and tried to read the excel from there. This is my code:

`df = spark.read
.format("csv")
.option("header", "true")
.option("delimiter", ";")
.load("/mnt/mountPoint/Budget.csv")

df = spark.read
.format("com.crealytics.spark.excel")
.option("header", "true")
.option("sheetName", "Sheet1")
.load("/mnt/mountPoint/Budget.xls")

df = spark.read
.format("com.crealytics.spark.excel")
.option("header", "true")
.option("sheetName", "Sheet1")
.load("/mnt/mountPoint/Budget.xlsx") `

The first command succeeds and I get the headers from the file. A df.show() will show me the content. The second command (using the xls) succeeds as well and I get the schema and content. The third command fails with this error:
java.lang.NoClassDefFoundError: Could not initialize class shadeio.poi.xssf.model.SharedStringsTable

I am using Databricks runtime 8.3 with Apache Spark 3.1.1 and Scala 2.12. What I have tried so far (all with the same error):

  • Different version of the crealytics library. I tries 14.0, 13.7 and 13.6. All of them for scala 2.12
  • The above code is in Python; I also tried it in scala
  • I copied the content of the file (just the cells with data) to a new file and stored as xlsx and xls.
  • Use different sheet names. The file has just one sheet named 'Sheet1'

This this the full stack trace. Any help is very much appreciated!'
`---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
in
11 .load("/mnt/mountPoint/Budget.xls")
12
---> 13 df = spark.read
14 .format("com.crealytics.spark.excel")
15 .option("header", "true") \

/databricks/spark/python/pyspark/sql/readwriter.py in load(self, path, format, schema, **options)
202 self.options(**options)
203 if isinstance(path, str):
--> 204 return self._df(self._jreader.load(path))
205 elif path is not None:
206 if type(path) != list:

/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in call(self, *args)
1302
1303 answer = self.gateway_client.send_command(command)
-> 1304 return_value = get_return_value(
1305 answer, self.gateway_client, self.target_id, self.name)
1306

/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
115 def deco(*a, **kw):
116 try:
--> 117 return f(*a, **kw)
118 except py4j.protocol.Py4JJavaError as e:
119 converted = convert_exception(e.java_exception)

/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
325 if answer[1] == REFERENCE_TYPE:
--> 326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
328 format(target_id, ".", name), value)

Py4JJavaError: An error occurred while calling o714.load.
: java.lang.NoClassDefFoundError: Could not initialize class shadeio.poi.xssf.model.SharedStringsTable
at shadeio.poi.ooxml.POIXMLFactory.createDocumentPart(POIXMLFactory.java:61)
at shadeio.poi.ooxml.POIXMLDocumentPart.read(POIXMLDocumentPart.java:684)
at shadeio.poi.ooxml.POIXMLDocument.load(POIXMLDocument.java:180)
at shadeio.poi.xssf.usermodel.XSSFWorkbook.(XSSFWorkbook.java:288)
at shadeio.poi.xssf.usermodel.XSSFWorkbookFactory.createWorkbook(XSSFWorkbookFactory.java:97)
at shadeio.poi.xssf.usermodel.XSSFWorkbookFactory.createWorkbook(XSSFWorkbookFactory.java:147)
at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:256)
at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:221)
at com.crealytics.spark.excel.DefaultWorkbookReader.$anonfun$openWorkbook$1(WorkbookReader.scala:49)
at scala.Option.fold(Option.scala:251)
at com.crealytics.spark.excel.DefaultWorkbookReader.openWorkbook(WorkbookReader.scala:49)
at com.crealytics.spark.excel.WorkbookReader.withWorkbook(WorkbookReader.scala:14)
at com.crealytics.spark.excel.WorkbookReader.withWorkbook$(WorkbookReader.scala:13)
at com.crealytics.spark.excel.DefaultWorkbookReader.withWorkbook(WorkbookReader.scala:45)
at com.crealytics.spark.excel.ExcelRelation.excerpt$lzycompute(ExcelRelation.scala:31)
at com.crealytics.spark.excel.ExcelRelation.excerpt(ExcelRelation.scala:31)
at com.crealytics.spark.excel.ExcelRelation.headerColumns$lzycompute(ExcelRelation.scala:102)
at com.crealytics.spark.excel.ExcelRelation.headerColumns(ExcelRelation.scala:101)
at com.crealytics.spark.excel.ExcelRelation.$anonfun$inferSchema$1(ExcelRelation.scala:163)
at scala.Option.getOrElse(Option.scala:189)
at com.crealytics.spark.excel.ExcelRelation.inferSchema(ExcelRelation.scala:162)
at com.crealytics.spark.excel.ExcelRelation.(ExcelRelation.scala:35)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:35)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:13)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:8)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:390)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:432)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:399)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:399)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286)
at sun.reflect.GeneratedMethodAccessor274.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:295)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:251)
at java.lang.Thread.run(Thread.java:748)`

@udossa
Copy link

udossa commented Nov 18, 2021

Hi guys, any update on this error? I have the same issue

@quanghgx quanghgx added the cloud Usage of spark-excel on cloud storage & platform label Nov 18, 2021
@quanghgx
Copy link
Collaborator

Hi @thijsnijhuis and @udossa

  • Could you please try again with the format from: "com.crealytics.spark.excel" -> "excel"?
    .format("excel")
  • And, please help take a look for list of dependencies for spark-excel to work. This wiki might has some useful idea

Credit to #133 Apache commons dependency issue by @jakeatmsft and @fwani solution

@ghost
Copy link
Author

ghost commented Dec 10, 2021

@quanghgx , thanks for your reply.
I have changed it but now I simply get this eror:
java.lang.ClassNotFoundException: Failed to find data source: excel. Please find packages at http://spark.apache.org/third-party-projects.html

I will need to take a look at the wiki link later on. Thanks!

@fwani
Copy link

fwani commented Dec 13, 2021

@thijsnijhuis
I think, you should add a dependency for excel that is com.crealytics:spark-excel_2.12 with specific version, first.
(because the error is java.lang.ClassNotFoundException: Failed to find data source: excel)
https://github.com/crealytics/spark-excel#linking

@abhisrphoenix
Copy link

Please try and change the library installation to Maven, that resolved my issue.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cloud Usage of spark-excel on cloud storage & platform
Projects
None yet
Development

No branches or pull requests

4 participants