This project is a modernized version of Spelk which was published by IBM on GitHub under an Apache 2 license. The license permits derived work and as such this project was created to make Spelk compatible with recent Elasticsearch versions.
The main issue with the IBM version of Spelk is that it only works with older Elasticsearch versions. Elasticsearch has changed its way how types in indexes work (only one type per index). This project provides Spelk working against an Elasticsearch 7 instance.
In addition, a few minor extensions have been implemented including capturing of fully qualified hostname, process IDs and using rolling indexes (i.e. one index per day).
See: https://github.com/dgloeckner/spelk See: https://github.com/ibmruntimes/spelk
Spelk (spark-elk) is an add-on to Apache Spark (http://spark.apache.org/) to report metrics to an Elasticsearch server and allow visualizations using Kibana.
Spelk is built using Apache Maven. You can specify the Spark and Scala versions.
mvn -Dspark.version=2.0.0 -Dscala.binary.version=2.11 clean package
The latest release of Spelk is available on GitHub.
To use Spelk you need to add the Spelk jar to the Spark driver/executor classpaths and enable the metrics sink.
Add Spelk jar to classpaths
Update [spark home]/conf/spark-defaults.conf eg.
spark.driver.extraClassPath=<path to Spelk>/spelk-0.1.0.jar
spark.executor.extraClassPath=<path to Spelk>/spelk-0.1.0.jar
Enable the Elasticsearch sink
Update metrics.properties in [spark home]/conf/metrics.properties eg.
# org.apache.spark.elk.metrics.sink.ElasticsearchSink
driver.sink.elk.class=org.apache.spark.elk.metrics.sink.ElasticsearchSink
executor.sink.elk.class=org.apache.spark.elk.metrics.sink.ElasticsearchSink
# You may enable the ElasticsearchSink for all components also:
# *.sink.elk.class=org.apache.spark.elk.metrics.sink.ElasticsearchSink
# Name: Default: Description:
# host none Elasticsearch server host
# port none Elasticsearch server port
# index spark Elasticsearch index name
# period 10 polling period
# units seconds polling period units
*.sink.elk.host=localhost
*.sink.elk.port=9200
# Index name supports a formatting pattern to format the current date.
*.sink.elk.index=spark-%tF
*.sink.elk.period=10
*.sink.elk.unit=seconds
# Optional path of Elasticsearch API ("http://host:port/path/...")
# *.sink.elk.path=test
Run Kibana for testing
docker run -e "ELASTICSEARCH_HOSTS=http://mbp:9200" -p 5601:5601 docker.elastic.co/kibana/kibana:7.4.2
There's a Docker compose file (src/main/docker/docker-compose.yml
) which brings up Elasticsearch
and Kibana. Kibana will be set up with Spark-related dashboards.
cd src/main/docker; docker-compose up
- Listing indexes:
curl http://localhost:9200/_cat/indices
- Deleting an index by name: `curl -XDELETE http://localhost:9200/spark-2020-02-02
Save output to file:
curl -XGET "http://localhost:5601/api/kibana/dashboards/export?dashboard=$id"
A dockerized Spark cluster with the following features is provided
- Version of Spark and Hadoop can be configured through
src/main/docker/.env
- Master, worker and history server are configured and started automatically
- Data and code can be provided to the Spark cluster via Docker mounts. See compose file for examples.
- Spark master, worker and Spark shell are configured for reporting to Elasticsearch
Running the whole cluster is as easy as
cd src/main/docker; docker-compose -f docker-compose-spark-cluster.yml up
Once the cluster is up, you can start a Spark shell
docker exec -it docker_spark_1 /spark/bin/spark-shell
Pointing your (local) Spark installations to your dockerized Spark cluster is easy, just set the
master URL to spark://localhost:7077