This example includes:
- Two Spark/Scala ML example docker containers and sample data
- Can be run locally (instructions below)
- Can be run on AWS EKS (Kubernetes) & S3 (no EMR/HDFS needed)
AWS-Setup-Guide-Spark-EKS.md
- lists setup steps for Spark on AWS EKS- NOTE: we used the 'kops' service for this example, as it was required by EKS at the time we wrote this example.
Derived from these sources:
- SGD Linear Regression Example with Apache Spark by Walker Rowe published May 23, 2017.
- The example shows a linear regression example and has been modified to run as an app rather than in the interactive shell. Update, new example: Linear Regression from Spark Documentation. The new example has been updated to add serialization/deserialization and a split between training and test data.
- Further reference Predicting Breast Cancer Using Apache Spark Machine Learning Logistic Regression by Carol McDonald published October 17, 2016.
This is an sbt project. Assuming we have a working scala and sbt, then execute sbt run
from the project root. In
addition to the log4j output from spark, you should also see a few lines of output from our example:
18/04/17 15:15:56 INFO TaskSetManager: Finished task 0.0 in stage 14.0 (TID 17) in 21 ms on localhost (executor driver) (1/1)
18/04/17 15:15:56 INFO TaskSchedulerImpl: Removed TaskSet 14.0, whose tasks have all completed, from pool
18/04/17 15:15:56 INFO DAGScheduler: ResultStage 14 (show at Main.scala:55) finished in 0.022 s
18/04/17 15:15:56 INFO DAGScheduler: Job 12 finished: show at Main.scala:55, took 0.026455 s
18/04/17 15:15:56 INFO CodeGenerator: Code generated in 8.628902 ms
+-------------------+--------------------+-------------------+
| label| features| prediction|
+-------------------+--------------------+-------------------+
|-28.571478869743427|(10,[0,1,2,3,4,5,...|-1.5332357772511678|
|-26.736207182601724|(10,[0,1,2,3,4,5,...|-3.1990639907463776|
|-22.949825936196074|(10,[0,1,2,3,4,5,...| 2.068559275392233|
|-20.212077258958672|(10,[0,1,2,3,4,5,...| 0.5963989456221626|
|-17.026492264209548|(10,[0,1,2,3,4,5,...|-0.7387387189956682|
|-15.348871155379253|(10,[0,1,2,3,4,5,...| -1.98575929759793|
|-13.039928064104615|(10,[0,1,2,3,4,5,...| 0.5942050121612523|
| -12.92222310337042|(10,[0,1,2,3,4,5,...| 2.203905559769596|
|-12.773226999251197|(10,[0,1,2,3,4,5,...| -2.736222698097398|
|-12.558575788856189|(10,[0,1,2,3,4,5,...|0.10007973294293643|
|-12.479280211451497|(10,[0,1,2,3,4,5,...|-0.9022515201372355|
| -12.46765638103286|(10,[0,1,2,3,4,5,...|-1.4621820914334354|
|-11.904986902675114|(10,[0,1,2,3,4,5,...|-0.3122307364002444|
| -11.87816749996684|(10,[0,1,2,3,4,5,...| 0.1338819458914437|
| -11.43180236554046|(10,[0,1,2,3,4,5,...| 0.5248457739492374|
|-11.328415936777782|(10,[0,1,2,3,4,5,...| 0.1542916456260936|
|-11.039347808253828|(10,[0,1,2,3,4,5,...|-1.3518353509995789|
|-10.600130341909033|(10,[0,1,2,3,4,5,...| 0.4030016168294734|
|-10.293714040655924|(10,[0,1,2,3,4,5,...| -1.364529194363915|
| -9.892155927826222|(10,[0,1,2,3,4,5,...| -1.068980736429676|
+-------------------+--------------------+-------------------+
only showing top 20 rows
18/04/17 15:15:56 INFO SparkUI: Stopped Spark web UI at http://192.168.86.21:4040
18/04/17 15:15:56 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
18/04/17 15:15:56 INFO MemoryStore: MemoryStore cleared
18/04/17 15:15:56 INFO BlockManager: BlockManager stopped
18/04/17 15:15:56 INFO BlockManagerMaster: BlockManagerMaster stopped
18/04/17 15:15:56 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
18/04/17 15:15:56 INFO SparkContext: Successfully stopped SparkContext
Assuming we have a working local Docker installation execute sbt docker:publishLocal
to create the Docker image.
Once the command completes, execute docker images
to view the docker image. See output similar to the following:
REPOSITORY TAG IMAGE ID CREATED SIZE
sagemaker-spark 0.1 29dc8b3b2dc8 20 seconds ago 379MB
openjdk jre-alpine b1bd879ca9b3 2 months ago 82MB
Now start a container to run the image by executing docker run --rm sagemaker-spark:0.1
. You should see output very
similar to the output from the local run.
Note: You may see an error related to insuficient memory, like the one shown below. In which case try increasing your docker engine's memory allocation to 4GB.
18/04/05 18:40:29 ERROR SparkContext: Error initializing SparkContext.
java.lang.IllegalArgumentException: System memory 466092032 must be at least 471859200. Please increase heap size using the --driver-memory option or spark.driver.memory in Spark configuration.
at org.apache.spark.memory.UnifiedMemoryManager$.getMaxMemory(UnifiedMemoryManager.scala:216)
at org.apache.spark.memory.UnifiedMemoryManager$.apply(UnifiedMemoryManager.scala:198)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:330)
at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:174)
at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:257)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:432)
at FullModel.Main$.main(Main.scala:17)
at FullModel.Main.main(Main.scala)
Reference: Lightweight docker containers for Scala apps by Jeroen Rosenberg published August 14, 2017
- Create an
object Main
with amain
function. Applications that extendscala.App
will not work correctly1. - Add the example code.
- Since the example code does not demonstrate how to establish a
SparkContext
add the following:// Startup val conf = new SparkConf() .setMaster("local[2]") .setAppName("SGD") .set("spark.executor.memory", "1g") val sc = new SparkContext(conf)
- Make sure to cleanup the context by adding the following to the example at the end:
// Shut down sc.stop()
- Update
build.sbt
to run locally- Scala version must be no greather than
2.11.x
2 - Spark version should be
2.1.0
and should look like the following:libraryDependencies ++= { val sparkVer = "2.1.0" Seq( "org.apache.spark" %% "spark-core" % sparkVer, "org.apache.spark" %% "spark-mllib" % sparkVer ) }
- Update
project/build.properties
, set thesbt.version
to0.13.17
3
- Scala version must be no greather than
- Add docker support
- Add
project/plugins.sbt
file with the following contentsaddSbtPlugin("com.typesafe.sbt" % "sbt-native-packager" % "1.2.1")
- Update
build.sbt
- Add the following to the bottom of the file:
enablePlugins(JavaAppPackaging) enablePlugins(DockerPlugin) enablePlugins(AshScriptPlugin) mainClass in Compile := Some("FullModel.Main") dockerBaseImage := "openjdk:jre-alpine" mappings in Universal += file("lpsa.data") -> "lpsa.data"
- The SBT commands enable plugins that
- Allow our app to be packaged as a jar(s) with an executable shell script to run it.
- Publish a docker image with the packaged app.
- Use
ash
as the shell instead ofbash
(needed foralpine
based images)
- Next we declare the
mainClass
so that the generated app script knows which class to execute. - Then we instruct the Docker plugin to use a smaller
alpine
based image rather than the defaultDebian
based image. - Finally we provide a file mapping which instructs the packaging system to include our data file in the staging directory that is later used to construct the image.4
- Add the following to the bottom of the file:
- Add