How to use spark-excel_2.11-0.13.5.jar setup Azure Synapse Spark Pool batch job? #282

yang-jiayi · 2020-08-16T17:13:13Z

Hi.

Azure Synapse Spark Pool does support importing third party packages.
Reference URL:.
https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-job-definitions#create-an-apache-spark-job-definition-for-apache-sparkscala

In this article (https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-azure-portal-add-libraries),
you will find ".jar based packages can be added at the Spark job definition level." states,
so I believe it can be an extension of the existing Spark Pool.

I downloaded spark-excel_2.11-0.13.5.jar from this URL (https://mvnrepository.com/artifact/com.crealytics/spark-excel_2.11/0.13.5).
In Azure Synapse Studio, I created a Spark job against the Spark Pool.
In the Main definition file value, I entered the ADLS Gen2 address (abfss://rawdata@xyz.dfs.core.windows.net/SparkExcelLibrary/spark-excel_2.11-0.13.5. jar) for the value of the main definition file.
In the Main class name value, I entered com.createalytics.spark.excel.

Unfortunately, I got an error.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/hdp/2.6.99.201-15911041/spark2/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/2.6.99.201-15911041/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
20/08/16 16:30:37 INFO SignalUtils: Registered signal handler for TERM
20/08/16 16:30:37 INFO SignalUtils: Registered signal handler for HUP
20/08/16 16:30:37 INFO SignalUtils: Registered signal handler for INT
20/08/16 16:30:37 INFO SecurityManager: Changing view acls to: trusted-service-user
20/08/16 16:30:37 INFO SecurityManager: Changing modify acls to: trusted-service-user
20/08/16 16:30:37 INFO SecurityManager: Changing view acls groups to:
20/08/16 16:30:37 INFO SecurityManager: Changing modify acls groups to:
20/08/16 16:30:37 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(trusted-service-user); groups with view permissions: Set(); users with modify permissions: Set(trusted-service-user); groups with modify permissions: Set()
20/08/16 16:30:37 INFO ApplicationMaster: Preparing Local resources
20/08/16 16:30:38 INFO MetricsConfig: loaded properties from hadoop-metrics2.properties
20/08/16 16:30:38 INFO MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
20/08/16 16:30:38 INFO MetricsSystemImpl: azure-file-system metrics system started
20/08/16 16:30:38 INFO ApplicationMaster: ApplicationAttemptId: appattempt_1597593360079_0004_000001
20/08/16 16:30:39 INFO ApplicationMaster: Starting the user application in a separate Thread
20/08/16 16:30:39 ERROR ApplicationMaster: Uncaught exception:
java.lang.ClassNotFoundException: com.crealytics.spark.excel
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
at org.apache.spark.deploy.yarn.ApplicationMaster.startUserApplication(ApplicationMaster.scala:674)
at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:461)
at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778)
at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803)
at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
20/08/16 16:30:39 INFO ApplicationMaster: Final app status: FAILED, exitCode: 13, (reason: Uncaught exception: java.lang.ClassNotFoundException: com.crealytics.spark.excel
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
at org.apache.spark.deploy.yarn.ApplicationMaster.startUserApplication(ApplicationMaster.scala:674)
at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:461)
at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778)
at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803)
at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
)
20/08/16 16:30:39 INFO ShutdownHookManager: Shutdown hook called
20/08/16 16:30:39 INFO MetricsSystemImpl: Stopping azure-file-system metrics system...
20/08/16 16:30:39 INFO MetricsSystemImpl: azure-file-system metrics system stopped.
20/08/16 16:30:39 INFO MetricsSystemImpl: azure-file-system metrics system shutdown complete.

End of LogType:stderr

Question:
How can I correctly add the Spark-excel_2.11-0.13.5.jar to the Azure Synapse Spark Pool?

I'm waiting for your advice, and wait your reply.
Thanks.

Best Regards,
Yang

nightscape · 2020-08-16T19:04:05Z

Hi Yang, I haven't worked with Azure Synapse, so I can't provide great insights...
What seems a little suspicious is the following line:

java.lang.ClassNotFoundException: com.crealytics.spark.excel

because com.crealytics.spark.excel is a package, not a class.
Might be unrelated though...
Can you try with any other Spark plugin and check if that makes a difference?

yang-jiayi · 2020-08-17T01:17:51Z

@nightscape
Hi (Martin Mauch)nightscape.

Thank you for your reply.
I am here tried Microsoft Docs Worldcount.jar.
The Main class was "WorldCount" as written in Microsoft Docs.
The fully qualified identifier or the main class that is in the main definition file.

I want to know the Main Class of "spark-excel_2.11-0.13.5.jar". How should I check?

yang-jiayi · 2020-08-17T02:18:49Z

@nightscape
Hi (Martin Mauch)nightscape.

java -jar "C:\Users\Administrator\Downloads\spark-excel_2.11-0.13.5.jar"
no main manifest attribute

At "C:\Users\Administrator\Downloads\spark-excel_2.11-0.13.5\META-INF\MANIFEST.MF" has not Main-Class:

Manifest-Version: 1.0
Implementation-Title: spark-excel
Implementation-Version: 0.13.5
Spark-HasRPackage: false
Specification-Vendor: com.crealytics
Specification-Title: spark-excel
Implementation-Vendor-Id: com.crealytics
Specification-Version: 0.13.5
Implementation-URL: https://github.com/crealytics/spark-excel
Implementation-Vendor: com.crealytics

nightscape · 2020-08-17T08:34:49Z

Hi @yang-jiayi, spark-excel does not have a Main-Class because it is not a standalone program, but a Spark plugin.
You would need to create a JAR with a main class yourself. This JAR would contain your code that does whatever business logic you want to achieve with Spark.

yang-jiayi · 2020-08-17T17:05:31Z

@nightscape
Thank you for your reply.
Can you provide instructions for rebuilding Spark Excel to standalone?
I recognized that I need to rebuild it myself. As a result, I will consider implementing Spark Excel with Azure Databricks as the current MS Docs in Azure Synapse Spark Pool do not have much information on the Spark Plugin.
Thank you so much. Please close the issue.

nightscape · 2020-08-17T20:43:29Z

Hi @yang-jiayi, you shouldn't have to rebuild spark-excel as standalone JAR with main class.
What you have to do is package the Spark code you write as JAR that either depends on or bundles spark-excel in a so-called "Fat JAR".
For the "depends on" option you would have to publish both the JAR and a pom.xml which specifies the dependencies (e.g. spark-excel).
For the Fat JAR option you could e.g. use sbt-assembly.
Here's a tutorial that shows how to do this: https://queirozf.com/entries/creating-scala-fat-jars-for-spark-on-sbt-with-sbt-assembly-plugin#spark2
You would then spark-submit the JAR to your cluster.
If you're running locally you actually have to write a main class that instantiates Spark and then executes your code.

yang-jiayi · 2020-08-19T01:00:06Z

Hi @nightscape
Thank you for your reply.
I don't have much experience with Java, so I looked at the URL you gave me, but I didn't understand how to fix Fat JAR or Pom.xml.
I would like to know the detailed steps to achieve Fat JAR. Please use details if you can.
Once I've created the JAR file with that, I want to run a Spark Job against the Azure Synapse Spark Pool, extend Spark, and validate Spark Excel.
By the way, I am currently validating Spark Excel with Azure Databaricks, and with Databricks I was able to install it straight from the library.
Thanks 1000's of them!

nightscape · 2020-08-19T11:59:09Z

Hi @yang-jiayi, you'd have to search for instructions specific to your programming language and build tool.
For Java you could e.g. search for "maven fat jar" and dig through some tutorials.
After reading those, my hints above should make more sense 😉

yang-jiayi · 2020-08-19T15:28:40Z

Hi @nightscape
OK.I will try it.
Thanks.

lewisdba · 2023-02-03T16:31:32Z

@yang-jiayi
Thanks for sharing this.
Are there full instructions to follow and upload spark-excel*.jar file to spark in Azure Synapse ?
Thanks

quanghgx added the cloud Usage of spark-excel on cloud storage & platform label Oct 3, 2021

yang-jiayi closed this as completed Feb 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use spark-excel_2.11-0.13.5.jar setup Azure Synapse Spark Pool batch job? #282

How to use spark-excel_2.11-0.13.5.jar setup Azure Synapse Spark Pool batch job? #282

yang-jiayi commented Aug 16, 2020 •

edited

Loading

nightscape commented Aug 16, 2020

yang-jiayi commented Aug 17, 2020 •

edited

Loading

yang-jiayi commented Aug 17, 2020 •

edited

Loading

nightscape commented Aug 17, 2020

yang-jiayi commented Aug 17, 2020

nightscape commented Aug 17, 2020 •

edited

Loading

yang-jiayi commented Aug 19, 2020 •

edited

Loading

nightscape commented Aug 19, 2020

yang-jiayi commented Aug 19, 2020

lewisdba commented Feb 3, 2023

How to use spark-excel_2.11-0.13.5.jar setup Azure Synapse Spark Pool batch job? #282

How to use spark-excel_2.11-0.13.5.jar setup Azure Synapse Spark Pool batch job? #282

Comments

yang-jiayi commented Aug 16, 2020 • edited Loading

nightscape commented Aug 16, 2020

yang-jiayi commented Aug 17, 2020 • edited Loading

yang-jiayi commented Aug 17, 2020 • edited Loading

nightscape commented Aug 17, 2020

yang-jiayi commented Aug 17, 2020

nightscape commented Aug 17, 2020 • edited Loading

yang-jiayi commented Aug 19, 2020 • edited Loading

nightscape commented Aug 19, 2020

yang-jiayi commented Aug 19, 2020

lewisdba commented Feb 3, 2023

yang-jiayi commented Aug 16, 2020 •

edited

Loading

yang-jiayi commented Aug 17, 2020 •

edited

Loading

yang-jiayi commented Aug 17, 2020 •

edited

Loading

nightscape commented Aug 17, 2020 •

edited

Loading

yang-jiayi commented Aug 19, 2020 •

edited

Loading