- Module Description - What the module does and why it is useful
- Setup - The basics of getting started with spark
- Usage - Configuration options and additional functionality
- Reference - An under-the-hood peek at what the module is doing and how
- Limitations - OS compatibility, etc.
- Development - Guide for contributing to the module
This puppet module installs and setup Apache Spark cluster, optionally with security. YARN and Spark Master cluster modes are supported.
- Packages: installs Spark packages as needed (core, python, history server, ...)
- Files modified:
- /etc/spark/conf/spark-default.conf
- /etc/spark/conf/spark-env.sh (modified, when environment parameter set)
- /etc/default/spark
- /etc/profile.d/hadoop-spark.csh (frontend)
- /etc/profile.d/hadoop-spark.sh (frontend)
- Permissions modified:
- /etc/security/keytab/spark.service.keytab (historyserver)
- Alternatives:
- alternatives are used for /etc/spark/conf in Cloudera
- this module switches to the new alternative by default, so the Cloudera original configuration can be kept intact
- Services:
- master server (when spark::master or spark::master::service included)
- history server (when spark::historyserver or spark::historyserver::service included)
- worker node (when spark::worker or spark::worker::service included)
- Helper files:
- /var/lib/hadoop-hdfs/.puppet-spark-*
There are several known or intended limitations in this module.
Be aware of:
-
Hadoop repositories
-
neither Cloudera nor Hortonworks repositories are configured in this module (for Cloudera you can find list and key files here: http://archive.cloudera.com/cdh5/debian/wheezy/amd64/cdh/)
-
java is not installed by this module (openjdk-7-jre-headless is OK for Debian 7/wheezy)
-
No inter-node dependencies: working HDFS is required before deploying of Spark History Server, dependency of Spark HDFS initialization on HDFS namenode is handled properly (if the class spark::hdfs is included on the HDFS namenode, see examples)
There are two cluster modes, how to use Spark (these modes can be both enabled):
- YARN mode: Hadoop is used for computing and scheduling
- Spark mode: Spark Master Server and Worker Nodes are used for computing and scheduling
Optionally Spark History Server can be used (for both YARN or Spark modes), which would also require Hadoop HDFS.
The Spark mode doesn't support security, only YARN mode can be used with secured Hadoop cluster.
Puppet classes to include:
- everywhere: spark
- YARN mode (requires Hadoop cluster with YARN, see CESNET Hadoop puppet module):
- client: spark::frontend
- Spark mode:
- master: spark::master
- slaves: spark::worker
- optionally History Server (requires Hadoop cluster with HDFS, see CESNET Hadoop puppet module):
- spark::historyserver
- on HDFS namenode: spark::hdfs
Example: Apache Spark over Hadoop cluster:
For simplicity one-machine Hadoop cluster is used (everything is on $::fqdn, replication factor 1).
class{'hadoop':
hdfs_hostname => $::fqdn,
yarn_hostname => $::fqdn,
slaves => [ $::fqdn ],
frontends => [ $::fqdn ],
realm => '',
properties => {
'dfs.replication' => 1,
},
}
class{'spark':
# defaultFS is taken from hadoop class
}
node default {
include stdlib
include hadoop::namenode
include hadoop::resourcemanager
include hadoop::historyserver
include hadoop::datanode
include hadoop::nodemanager
include hadoop::frontend
include spark::frontend
# should be collocated with hadoop::namenode
include spark::hdfs
}
Notes:
- if collocated with HDFS namenode, add dependency Class['hadoop::namenode::service'] -> Class['spark::historyserver::service']
- if not collocated, it is needed to have HDFS namenode running first (puppet should be launched later again, if Spark History Server won't start because of HDFS)
- for Spark clients (in YARN mode): user must logout and login again or launch ". /etc/profile.d/hadoop-spark.sh"
Now you can submit spark jobs in the cluster mode over Hadoop YARN:
spark-submit --class org.apache.spark.examples.SparkPi --deploy-mode cluster --master yarn /usr/lib/spark/lib/spark-examples-1.2.0-cdh5.3.1-hadoop2.5.0-cdh5.3.1.jar 10
Example: Apache Spark in Spark cluster mode:
Two-nodes cluster is used here.
$master_hostname='spark-master.example.com'
class{'hadoop':
realm => '',
hdfs_hostname => $master_hostname,
slaves => ['spark1.example.com', 'spark2.example.com'],
}
class{'spark':
master_hostname => $master_hostname,
historyserver_hostname => $master_hostname,
yarn_enable => false,
}
node 'spark-master.example.com' {
include spark::master
include spark::historyserver
include hadoop::namenode
include spark::hdfs
}
node /spark(1|2).example.com/ {
include spark::worker
include hadoop::datanode
}
node 'client.example.com' {
include hadoop::frontend
include spark::frontend
}
Notes:
- there is also enabled Spark History Server (spark::historyserver), which requires HDFS (master: hadoop::namenode, slaves: hadoop::datanode)
- YARN is disabled completely, to enable YARN: include also hadoop::nodemanager on the slave nodes (collocation with spark::worker is not needed) and hadoop::resourcemanager on master (see previous example, or CESNET Hadoop puppet module)
The spark-assembly.jar file is copied into HDFS on each job submit. It is possible to optimize this by copying it beforehand. Keep in mind the jar file needs to be refreshed on HDFS with each Spark SW update.
...
class{'spark':
jar_enable => true,
}
...
Copy the jar file after installation and deployment (superuser credentials are needed if security in Hadoop is enabled):
hdfs dfs -put /usr/lib/spark/spark-assembly.jar /user/spark/share/lib/spark-assembly.jar
Spark History server stores details about Spark jobs. It is provided by the class spark::historyserver. The parameter historiserver_hostname needs to be also specified (replace $::fqdn by real hostname), and HDFS cluster is required:
...
class{'spark':
...
historyserver_hostname => $::fqdn,
}
node default {
...
include spark::historyserver
}
Multihome is not supported.
You may also need to set SPARK_LOCAL_IP to bind RPC listen address to the default interface:
environment => {
'SPARK_LOCAL_IP' => '0.0.0.0',
#'SPARK_LOCAL_IP' => $::ipaddress_eth0,
}
If there are used more HDFS namenodes in the Hadoop cluster (high availability, namespaces, ...), it is needed to have 'spark' system user on all of them to autorization work properly. You could install full Spark client (using spark::frontend::install), but just creating the user is enough (using spark::user).
Note, the spark::hdfs class must be used too, but only on one of the HDFS namenodes. It includes the spark::user.
Example:
node <HDFS_NAMENODE> {
include spark::hdfs
}
node <HDFS_OTHER_NAMENODE> {
include spark::user
}
The best way is to refresh configurations from the new original (=remove the old) and relaunch puppet on top of it. There is also problem with start-up scripts on Debian, which needs to be worked around, where Spark history server is used.
For example:
alternative='cluster'
d='spark'
mv /etc/{d}$/conf.${alternative} /etc/${d}/conf.cdhXXX
update-alternatives --auto ${d}-conf
service spark-history-server stop || :
mv /etc/init.d/spark-history-server /etc/init.d/spark-history-server.prev
# upgrade
...
puppet agent --test
#or: puppet apply ...
# restore start-up script from spark-history-server.dpkg-new or spark-history-server.prev
...
service spark-history-server start
spark
: Main configuration class for CESNET Apache Spark puppet modulespark::common
:spark::common::config
spark::common::postinstall
spark::frontend
: Apache Spark Clientspark::frontend::config
spark::frontend::install
spark::hdfs
: HDFS initializationspark::historyserver
: Apache Spark History Serverspark::historyserver::config
spark::historyserver::install
spark::historyserver::service
spark::master
: Apache Spark Master Serverspark::master::config
spark::master::install
spark::master::service
spark::worker
: Apache Spark Worker Nodespark::worker::config
spark::worker::install
spark::worker::service
spark::params
spark::user
: Create spark system user
####Parameters
#####alternatives
Switches the alternatives used for the configuration. Default: 'cluster' (Debian) or undef.
It can be used only when supported (for example with Cloudera distribution).
#####confdir
Spark config directory. Default: platform specific ('/etc/spark/conf' or '/etc/spark').
#####defaultFS
Filesystem URI. Default: '::default' (from $::hadoop::_defaultFS).
Examples:
- hdfs://hdfs.example.com:8020
- hdfs://mycluster
#####hive_configfile
Hive config file. Default: platform specific ('../../hive/conf/hive-site.xml' or '../etc/hive/hive-site.xml').
#####keytab
Spark Historyserver keytab file. Default: '/etc/security/keytab/spark.service.keytab'.
#####keytab_source
Puppet source for the Spark keytab file. Default: undef.
When specified, the Spark keytab file is created using this puppet source(s). Otherwise only persmissions are set on the keytab file.
#####logdir
Event log directory and history server log directory without the defaultFS prefix. Default: '/user/spark/applicationHistory'.
Note, this is parameter is ignored by spark::hdfs class. When using non=default value, this directory must be explicitly created.
#####master_hostname
Spark Master hostname. Default: undef.
#####master_port
Spark Master port. Default: '7077'.
#####master_ui_port
Spark Master Web UI port. Default: '18080'.
#####historyserver_hostname
Spark History server hostname. Default: undef.
#####historyserver_port
Spark History Server Web UI port. Default: '180088'.
Notes:
- the Spark default value is 18080, which conflicts with default for Master server
- no historyserver_ui_port parameter (Web UI port is the same as the RPC port)
#####worker_port
Spark Worker node port. Default: '7078'.
#####worker_ui_port
Spark Worker node Web UI port. Default: '18081'.
#####environment
Environments to set for Apache Spark. Default: undef.
The value is a hash. The '::undef' values will unset the particular variables.
Example: you may need to increase memory in case of big amount of jobs:
environment => {
'SPARK_DAEMON_MEMORY' => '4096m',
}
#####properties
Spark properties to set. Default: undef.
#####realm
Kerberos realm. Default: undef.
Non-empty string enables security.
#####hive_enable
Enable support for Hive metastore. Default: true.
This just create the symlink of the Hive configuration file in the Spark configuration directory on the frontend.
There is required to install also Hive JDBC (or Spark assembly with Hive JDBC) at all worker nodes.
#####jar_enable
Configure Apache Spark to search Spark jar file in $hdfs_hostname/user/spark/share/lib/spark-assembly.jar. Default: false.
The jar needs to be copied to HDFS manually after installation, and also manually updated after each Spark SW update:
hdfs dfs -put /usr/lib/spark/spark-assembly.jar /user/spark/share/lib/spark-assembly.jar
#####yarn_enable
Enable YARN mode. Default: true.
This requires configured Hadoop using CESNET Hadoop puppet module.
Tested with Cloudera distribution.
See also Setup requirements.
- Repository: https://github.com/MetaCenterCloudPuppet/cesnet-spark
- Tests:
- basic: see .travis.yml
- vagrant: https://github.com/MetaCenterCloudPuppet/hadoop-tests
- Email: František Dvořák <valtri@civ.zcu.cz>