Reading in pyspark #240

vbethams · 2020-04-29T10:45:44Z

How to read excel files in python in azure databricks

Your issue may already be reported!
Please search on the issue track before creating one.
Moreover, please read the CHANGELOG.md file for any changes you might have missed.

Expected Behavior

If you're describing a bug, tell us what should happen
If you're suggesting a change/improvement, tell us how it should work

Current Behavior

If describing a bug, tell us what happens instead of the expected behavior
If suggesting a change/improvement, explain the difference from current behavior.
If you have a stack trace or any helpful information from the console, paste it in its entirety.
If the problem happens with a certain file, upload it somewhere and paste a link.

Possible Solution

Not obligatory, but suggest a fix/reason for the bug,
or ideas how to implement the addition or change

Steps to Reproduce (for bugs)

Provide a link to a live example, or an unambiguous set of steps to
reproduce this bug. Include code to reproduce, if relevant.
Example:

Download the example file uploaded here
Start Spark from command line as spark-shell --packages com.crealytics:spark-excel_2.12:x.y.z --foo=bar

Read the downloaded example file

val df = spark.read
.format("com.crealytics.spark.excel")
.option("dataAddress", "'My Sheet'!B3:C35")
.load("example_file_exhibiting_bug.xlsx")

Context

How has this issue affected you? What are you trying to accomplish?
Providing context helps us come up with a solution that is most useful in the real world

Your Environment

Include as many relevant details about the environment you experienced the bug in

Spark version and language (Scala, Java, Python, R, ...):
Spark-Excel version:
Operating System and version, cluster environment, ...:

The text was updated successfully, but these errors were encountered:

nightscape · 2020-04-29T21:45:47Z

Hi @vbethams, the issue template is meant to be filled out 😉
What did you try? What did not work?

vbethams · 2020-04-30T05:22:31Z

Hi @nightscape
I have been using azure databricks with cluster configuration spark 2.4.5 and scala 2.11.
I have installed spark excel jar of 2.11:0.11.1.
But when i tried to read the excel file in the pyspark script it gives ClassNotFoundException.
Again I have removed previous version and installed 2.11:0.13.1
This time i am getting "java.lang.IllegalArgumentException: Parameter "header" is missing in options." this error

below is my code:

spark.read.format("com.crealytics.spark.excel"). option("useHeader", "true"). option("treatEmptyValuesAsNulls", "false"). option("inferSchema", "false"). option("addColorColumns", "false").load(path)

nightscape · 2020-04-30T09:30:23Z

Yes, that's why the issue template recommends reading the CHANGELOG 😉

vbethams · 2020-04-30T10:31:50Z

I have chaanged to "Header" now i am getting the following error.

"java.io.IOException: org/apache/commons/collections4/IteratorUtils"

nightscape · 2020-04-30T13:37:34Z

Please post the full stack trace. This line by itself doesn't help much.

vbethams · 2020-04-30T14:05:24Z

posting the entire stacktrace:
Py4JJavaError: An error occurred while calling o392.load.
: java.io.IOException: org/apache/commons/collections4/IteratorUtils
at shadeio.poi.ss.usermodel.WorkbookFactory.createWorkbook(WorkbookFactory.java:351)
at shadeio.poi.ss.usermodel.WorkbookFactory.createXSSFWorkbook(WorkbookFactory.java:314)
at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:232)
at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:198)
at com.crealytics.spark.excel.DefaultWorkbookReader$$anonfun$openWorkbook$1.apply(WorkbookReader.scala:50)
at com.crealytics.spark.excel.DefaultWorkbookReader$$anonfun$openWorkbook$1.apply(WorkbookReader.scala:50)
at scala.Option.fold(Option.scala:158)
at com.crealytics.spark.excel.DefaultWorkbookReader.openWorkbook(WorkbookReader.scala:50)
at com.crealytics.spark.excel.WorkbookReader$class.withWorkbook(WorkbookReader.scala:14)
at com.crealytics.spark.excel.DefaultWorkbookReader.withWorkbook(WorkbookReader.scala:46)
at com.crealytics.spark.excel.ExcelRelation.excerpt$lzycompute(ExcelRelation.scala:30)
at com.crealytics.spark.excel.ExcelRelation.excerpt(ExcelRelation.scala:30)
at com.crealytics.spark.excel.ExcelRelation.headerColumns$lzycompute(ExcelRelation.scala:104)
at com.crealytics.spark.excel.ExcelRelation.headerColumns(ExcelRelation.scala:103)
at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1.apply(ExcelRelation.scala:168)
at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1.apply(ExcelRelation.scala:167)
at scala.Option.getOrElse(Option.scala:121)
at com.crealytics.spark.excel.ExcelRelation.inferSchema(ExcelRelation.scala:167)
at com.crealytics.spark.excel.ExcelRelation.(ExcelRelation.scala:34)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:40)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:18)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:12)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:347)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:311)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:297)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:214)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:295)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:251)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NoClassDefFoundError: org/apache/commons/collections4/IteratorUtils
at shadeio.poi.openxml4j.util.ZipInputStreamZipEntrySource.getEntries(ZipInputStreamZipEntrySource.java:61)
at shadeio.poi.openxml4j.opc.ZipPackage.getPartsImpl(ZipPackage.java:288)
at shadeio.poi.openxml4j.opc.OPCPackage.getParts(OPCPackage.java:726)
at shadeio.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:304)
at shadeio.poi.xssf.usermodel.XSSFWorkbookFactory.createWorkbook(XSSFWorkbookFactory.java:134)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at shadeio.poi.ss.usermodel.WorkbookFactory.createWorkbook(WorkbookFactory.java:339)
... 36 more
Caused by: java.lang.ClassNotFoundException: org.apache.commons.collections4.IteratorUtils
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at com.databricks.backend.daemon.driver.ClassLoaders$LibraryClassLoader.loadClass(ClassLoaders.scala:151)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
... 46 more

nightscape · 2020-04-30T15:48:32Z

This seems to be the same problem as described here:
#133
It would be great if someone could come up with a PR to fix this 😃

nightscape closed this as completed Apr 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading in pyspark #240

Reading in pyspark #240

vbethams commented Apr 29, 2020

nightscape commented Apr 29, 2020

vbethams commented Apr 30, 2020

nightscape commented Apr 30, 2020

vbethams commented Apr 30, 2020

nightscape commented Apr 30, 2020

vbethams commented Apr 30, 2020

nightscape commented Apr 30, 2020

Reading in pyspark #240

Reading in pyspark #240

Comments

vbethams commented Apr 29, 2020

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

nightscape commented Apr 29, 2020

vbethams commented Apr 30, 2020

nightscape commented Apr 30, 2020

vbethams commented Apr 30, 2020

nightscape commented Apr 30, 2020

vbethams commented Apr 30, 2020

nightscape commented Apr 30, 2020