Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use spark-excel_2.11-0.13.5.jar setup Azure Synapse Spark Pool batch job? #282

Closed
yang-jiayi opened this issue Aug 16, 2020 · 10 comments
Labels
cloud Usage of spark-excel on cloud storage & platform

Comments

@yang-jiayi
Copy link

yang-jiayi commented Aug 16, 2020

Hi.

Azure Synapse Spark Pool does support importing third party packages.
Reference URL:.
https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-job-definitions#create-an-apache-spark-job-definition-for-apache-sparkscala

In this article (https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-azure-portal-add-libraries),
you will find ".jar based packages can be added at the Spark job definition level." states,
so I believe it can be an extension of the existing Spark Pool.

I downloaded spark-excel_2.11-0.13.5.jar from this URL (https://mvnrepository.com/artifact/com.crealytics/spark-excel_2.11/0.13.5).
In Azure Synapse Studio, I created a Spark job against the Spark Pool.
In the Main definition file value, I entered the ADLS Gen2 address (abfss://rawdata@xyz.dfs.core.windows.net/SparkExcelLibrary/spark-excel_2.11-0.13.5. jar) for the value of the main definition file.
In the Main class name value, I entered com.createalytics.spark.excel.

Unfortunately, I got an error.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/hdp/2.6.99.201-15911041/spark2/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/2.6.99.201-15911041/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
20/08/16 16:30:37 INFO SignalUtils: Registered signal handler for TERM
20/08/16 16:30:37 INFO SignalUtils: Registered signal handler for HUP
20/08/16 16:30:37 INFO SignalUtils: Registered signal handler for INT
20/08/16 16:30:37 INFO SecurityManager: Changing view acls to: trusted-service-user
20/08/16 16:30:37 INFO SecurityManager: Changing modify acls to: trusted-service-user
20/08/16 16:30:37 INFO SecurityManager: Changing view acls groups to:
20/08/16 16:30:37 INFO SecurityManager: Changing modify acls groups to:
20/08/16 16:30:37 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(trusted-service-user); groups with view permissions: Set(); users with modify permissions: Set(trusted-service-user); groups with modify permissions: Set()
20/08/16 16:30:37 INFO ApplicationMaster: Preparing Local resources
20/08/16 16:30:38 INFO MetricsConfig: loaded properties from hadoop-metrics2.properties
20/08/16 16:30:38 INFO MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
20/08/16 16:30:38 INFO MetricsSystemImpl: azure-file-system metrics system started
20/08/16 16:30:38 INFO ApplicationMaster: ApplicationAttemptId: appattempt_1597593360079_0004_000001
20/08/16 16:30:39 INFO ApplicationMaster: Starting the user application in a separate Thread
20/08/16 16:30:39 ERROR ApplicationMaster: Uncaught exception:
java.lang.ClassNotFoundException: com.crealytics.spark.excel
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
at org.apache.spark.deploy.yarn.ApplicationMaster.startUserApplication(ApplicationMaster.scala:674)
at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:461)
at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778)
at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803)
at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
20/08/16 16:30:39 INFO ApplicationMaster: Final app status: FAILED, exitCode: 13, (reason: Uncaught exception: java.lang.ClassNotFoundException: com.crealytics.spark.excel
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
at org.apache.spark.deploy.yarn.ApplicationMaster.startUserApplication(ApplicationMaster.scala:674)
at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:461)
at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778)
at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803)
at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
)
20/08/16 16:30:39 INFO ShutdownHookManager: Shutdown hook called
20/08/16 16:30:39 INFO MetricsSystemImpl: Stopping azure-file-system metrics system...
20/08/16 16:30:39 INFO MetricsSystemImpl: azure-file-system metrics system stopped.
20/08/16 16:30:39 INFO MetricsSystemImpl: azure-file-system metrics system shutdown complete.

End of LogType:stderr


Question:
How can I correctly add the Spark-excel_2.11-0.13.5.jar to the Azure Synapse Spark Pool?

I'm waiting for your advice, and wait your reply.
Thanks.

Best Regards,
Yang

@nightscape
Copy link
Owner

Hi Yang, I haven't worked with Azure Synapse, so I can't provide great insights...
What seems a little suspicious is the following line:

java.lang.ClassNotFoundException: com.crealytics.spark.excel

because com.crealytics.spark.excel is a package, not a class.
Might be unrelated though...
Can you try with any other Spark plugin and check if that makes a difference?

@yang-jiayi
Copy link
Author

yang-jiayi commented Aug 17, 2020

@nightscape
Hi (Martin Mauch)nightscape.

Thank you for your reply.
I am here tried Microsoft Docs Worldcount.jar.
The Main class was "WorldCount" as written in Microsoft Docs.
The fully qualified identifier or the main class that is in the main definition file.

I want to know the Main Class of "spark-excel_2.11-0.13.5.jar". How should I check?

@yang-jiayi
Copy link
Author

yang-jiayi commented Aug 17, 2020

@nightscape
Hi (Martin Mauch)nightscape.

java -jar "C:\Users\Administrator\Downloads\spark-excel_2.11-0.13.5.jar"
no main manifest attribute

At "C:\Users\Administrator\Downloads\spark-excel_2.11-0.13.5\META-INF\MANIFEST.MF" has not Main-Class:

Manifest-Version: 1.0
Implementation-Title: spark-excel
Implementation-Version: 0.13.5
Spark-HasRPackage: false
Specification-Vendor: com.crealytics
Specification-Title: spark-excel
Implementation-Vendor-Id: com.crealytics
Specification-Version: 0.13.5
Implementation-URL: https://github.com/crealytics/spark-excel
Implementation-Vendor: com.crealytics

@nightscape
Copy link
Owner

Hi @yang-jiayi, spark-excel does not have a Main-Class because it is not a standalone program, but a Spark plugin.
You would need to create a JAR with a main class yourself. This JAR would contain your code that does whatever business logic you want to achieve with Spark.

@yang-jiayi
Copy link
Author

@nightscape
Thank you for your reply.
Can you provide instructions for rebuilding Spark Excel to standalone?
I recognized that I need to rebuild it myself. As a result, I will consider implementing Spark Excel with Azure Databricks as the current MS Docs in Azure Synapse Spark Pool do not have much information on the Spark Plugin.
Thank you so much. Please close the issue.

@nightscape
Copy link
Owner

nightscape commented Aug 17, 2020

Hi @yang-jiayi, you shouldn't have to rebuild spark-excel as standalone JAR with main class.
What you have to do is package the Spark code you write as JAR that either depends on or bundles spark-excel in a so-called "Fat JAR".
For the "depends on" option you would have to publish both the JAR and a pom.xml which specifies the dependencies (e.g. spark-excel).
For the Fat JAR option you could e.g. use sbt-assembly.
Here's a tutorial that shows how to do this: https://queirozf.com/entries/creating-scala-fat-jars-for-spark-on-sbt-with-sbt-assembly-plugin#spark2
You would then spark-submit the JAR to your cluster.
If you're running locally you actually have to write a main class that instantiates Spark and then executes your code.

@yang-jiayi
Copy link
Author

yang-jiayi commented Aug 19, 2020

Hi @nightscape
Thank you for your reply.
I don't have much experience with Java, so I looked at the URL you gave me, but I didn't understand how to fix Fat JAR or Pom.xml.
I would like to know the detailed steps to achieve Fat JAR. Please use details if you can.
Once I've created the JAR file with that, I want to run a Spark Job against the Azure Synapse Spark Pool, extend Spark, and validate Spark Excel.
By the way, I am currently validating Spark Excel with Azure Databaricks, and with Databricks I was able to install it straight from the library.
Thanks 1000's of them!

@nightscape
Copy link
Owner

Hi @yang-jiayi, you'd have to search for instructions specific to your programming language and build tool.
For Java you could e.g. search for "maven fat jar" and dig through some tutorials.
After reading those, my hints above should make more sense 😉

@yang-jiayi
Copy link
Author

Hi @nightscape
OK.I will try it.
Thanks.

@quanghgx quanghgx added the cloud Usage of spark-excel on cloud storage & platform label Oct 3, 2021
@lewisdba
Copy link

lewisdba commented Feb 3, 2023

@yang-jiayi
Thanks for sharing this.
Are there full instructions to follow and upload spark-excel*.jar file to spark in Azure Synapse ?
Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cloud Usage of spark-excel on cloud storage & platform
Projects
None yet
Development

No branches or pull requests

4 participants