Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading in pyspark #240

Closed
vbethams opened this issue Apr 29, 2020 · 7 comments
Closed

Reading in pyspark #240

vbethams opened this issue Apr 29, 2020 · 7 comments

Comments

@vbethams
Copy link

How to read excel files in python in azure databricks

Your issue may already be reported!
Please search on the issue track before creating one.
Moreover, please read the CHANGELOG.md file for any changes you might have missed.

Expected Behavior

If you're describing a bug, tell us what should happen
If you're suggesting a change/improvement, tell us how it should work

Current Behavior

If describing a bug, tell us what happens instead of the expected behavior
If suggesting a change/improvement, explain the difference from current behavior.
If you have a stack trace or any helpful information from the console, paste it in its entirety.
If the problem happens with a certain file, upload it somewhere and paste a link.

Possible Solution

Not obligatory, but suggest a fix/reason for the bug,
or ideas how to implement the addition or change

Steps to Reproduce (for bugs)

Provide a link to a live example, or an unambiguous set of steps to
reproduce this bug. Include code to reproduce, if relevant.
Example:

  1. Download the example file uploaded here
  2. Start Spark from command line as spark-shell --packages com.crealytics:spark-excel_2.12:x.y.z --foo=bar
  3. Read the downloaded example file
    val df = spark.read
    .format("com.crealytics.spark.excel")
    .option("dataAddress", "'My Sheet'!B3:C35")
    .load("example_file_exhibiting_bug.xlsx")
    

Context

How has this issue affected you? What are you trying to accomplish?
Providing context helps us come up with a solution that is most useful in the real world

Your Environment

Include as many relevant details about the environment you experienced the bug in

  • Spark version and language (Scala, Java, Python, R, ...):
  • Spark-Excel version:
  • Operating System and version, cluster environment, ...:
@nightscape
Copy link
Owner

Hi @vbethams, the issue template is meant to be filled out 😉
What did you try? What did not work?

@vbethams
Copy link
Author

Hi @nightscape
I have been using azure databricks with cluster configuration spark 2.4.5 and scala 2.11.
I have installed spark excel jar of 2.11:0.11.1.
But when i tried to read the excel file in the pyspark script it gives ClassNotFoundException.
Again I have removed previous version and installed 2.11:0.13.1
This time i am getting "java.lang.IllegalArgumentException: Parameter "header" is missing in options." this error

below is my code:

spark.read.format("com.crealytics.spark.excel"). option("useHeader", "true"). option("treatEmptyValuesAsNulls", "false"). option("inferSchema", "false"). option("addColorColumns", "false").load(path)

@nightscape
Copy link
Owner

Yes, that's why the issue template recommends reading the CHANGELOG 😉

@vbethams
Copy link
Author

I have chaanged to "Header" now i am getting the following error.

"java.io.IOException: org/apache/commons/collections4/IteratorUtils"

@nightscape
Copy link
Owner

Please post the full stack trace. This line by itself doesn't help much.

@vbethams
Copy link
Author

posting the entire stacktrace:
Py4JJavaError: An error occurred while calling o392.load.
: java.io.IOException: org/apache/commons/collections4/IteratorUtils
at shadeio.poi.ss.usermodel.WorkbookFactory.createWorkbook(WorkbookFactory.java:351)
at shadeio.poi.ss.usermodel.WorkbookFactory.createXSSFWorkbook(WorkbookFactory.java:314)
at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:232)
at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:198)
at com.crealytics.spark.excel.DefaultWorkbookReader$$anonfun$openWorkbook$1.apply(WorkbookReader.scala:50)
at com.crealytics.spark.excel.DefaultWorkbookReader$$anonfun$openWorkbook$1.apply(WorkbookReader.scala:50)
at scala.Option.fold(Option.scala:158)
at com.crealytics.spark.excel.DefaultWorkbookReader.openWorkbook(WorkbookReader.scala:50)
at com.crealytics.spark.excel.WorkbookReader$class.withWorkbook(WorkbookReader.scala:14)
at com.crealytics.spark.excel.DefaultWorkbookReader.withWorkbook(WorkbookReader.scala:46)
at com.crealytics.spark.excel.ExcelRelation.excerpt$lzycompute(ExcelRelation.scala:30)
at com.crealytics.spark.excel.ExcelRelation.excerpt(ExcelRelation.scala:30)
at com.crealytics.spark.excel.ExcelRelation.headerColumns$lzycompute(ExcelRelation.scala:104)
at com.crealytics.spark.excel.ExcelRelation.headerColumns(ExcelRelation.scala:103)
at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1.apply(ExcelRelation.scala:168)
at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1.apply(ExcelRelation.scala:167)
at scala.Option.getOrElse(Option.scala:121)
at com.crealytics.spark.excel.ExcelRelation.inferSchema(ExcelRelation.scala:167)
at com.crealytics.spark.excel.ExcelRelation.(ExcelRelation.scala:34)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:40)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:18)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:12)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:347)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:311)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:297)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:214)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:295)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:251)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NoClassDefFoundError: org/apache/commons/collections4/IteratorUtils
at shadeio.poi.openxml4j.util.ZipInputStreamZipEntrySource.getEntries(ZipInputStreamZipEntrySource.java:61)
at shadeio.poi.openxml4j.opc.ZipPackage.getPartsImpl(ZipPackage.java:288)
at shadeio.poi.openxml4j.opc.OPCPackage.getParts(OPCPackage.java:726)
at shadeio.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:304)
at shadeio.poi.xssf.usermodel.XSSFWorkbookFactory.createWorkbook(XSSFWorkbookFactory.java:134)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at shadeio.poi.ss.usermodel.WorkbookFactory.createWorkbook(WorkbookFactory.java:339)
... 36 more
Caused by: java.lang.ClassNotFoundException: org.apache.commons.collections4.IteratorUtils
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at com.databricks.backend.daemon.driver.ClassLoaders$LibraryClassLoader.loadClass(ClassLoaders.scala:151)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
... 46 more

@nightscape
Copy link
Owner

This seems to be the same problem as described here:
#133
It would be great if someone could come up with a PR to fix this 😃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants