Skip to content

Commit

Permalink
Merge pull request #1145 from romainx/spark_version
Browse files Browse the repository at this point in the history
Resolves #1131: Allow alternative Spark version
  • Loading branch information
romainx authored Aug 17, 2020
2 parents 7d0e50e + c288e77 commit 13b866f
Show file tree
Hide file tree
Showing 2 changed files with 88 additions and 56 deletions.
120 changes: 71 additions & 49 deletions docs/using/specifics.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,21 +2,81 @@

This page provides details about features specific to one or more images.

## Apache Spark
## Apache Spark

**Specific Docker Image Options**
### Specific Docker Image Options

* `-p 4040:4040` - The `jupyter/pyspark-notebook` and `jupyter/all-spark-notebook` images open [SparkUI (Spark Monitoring and Instrumentation UI)](http://spark.apache.org/docs/latest/monitoring.html) at default port `4040`, this option map `4040` port inside docker container to `4040` port on host machine . Note every new spark context that is created is put onto an incrementing port (ie. 4040, 4041, 4042, etc.), and it might be necessary to open multiple ports. For example: `docker run -d -p 8888:8888 -p 4040:4040 -p 4041:4041 jupyter/pyspark-notebook`.

**Usage Examples**
### Build an Image with a Different Version of Spark

You can build a `pyspark-notebook` image (and also the downstream `all-spark-notebook` image) with a different version of Spark by overriding the default value of the following arguments at build time.

* Spark distribution is defined by the combination of the Spark and the Hadoop version and verified by the package checksum, see [Download Apache Spark](https://spark.apache.org/downloads.html) for more information.
* `spark_version`: The Spark version to install (`3.0.0`).
* `hadoop_version`: The Hadoop version (`3.2`).
* `spark_checksum`: The package checksum (`BFE4540...`).
* Spark is shipped with a version of Py4J that has to be referenced in the `PYTHONPATH`.
* `py4j_version`: The Py4J version (`0.10.9`), see the tip below.
* Spark can run with different OpenJDK versions.
* `openjdk_version`: The version of (JRE headless) the OpenJDK distribution (`11`), see [Ubuntu packages](https://packages.ubuntu.com/search?keywords=openjdk).

For example here is how to build a `pyspark-notebook` image with Spark `2.4.6`, Hadoop `2.7` and OpenJDK `8`.

```bash
# From the root of the project
# Build the image with different arguments
docker build --rm --force-rm \
-t jupyter/pyspark-notebook:spark-2.4.6 ./pyspark-notebook \
--build-arg spark_version=2.4.6 \
--build-arg hadoop_version=2.7 \
--build-arg spark_checksum=3A9F401EDA9B5749CDAFD246B1D14219229C26387017791C345A23A65782FB8B25A302BF4AC1ED7C16A1FE83108E94E55DAD9639A51C751D81C8C0534A4A9641 \
--build-arg openjdk_version=8 \
--build-arg py4j_version=0.10.7

# Check the newly built image
docker images jupyter/pyspark-notebook:spark-2.4.6

# REPOSITORY TAG IMAGE ID CREATED SIZE
# jupyter/pyspark-notebook spark-2.4.6 7ad7b5a9dbcd 4 minutes ago 3.44GB

# Check the Spark version
docker run -it --rm jupyter/pyspark-notebook:spark-2.4.6 pyspark --version

# Welcome to
# ____ __
# / __/__ ___ _____/ /__
# _\ \/ _ \/ _ `/ __/ '_/
# /___/ .__/\_,_/_/ /_/\_\ version 2.4.6
# /_/
#
# Using Scala version 2.11.12, OpenJDK 64-Bit Server VM, 1.8.0_265
```

**Tip**: to get the version of Py4J shipped with Spark:

* Build a first image without changing `py4j_version` (it will not prevent the image to build it will just prevent Python to find the `pyspark` module),
* get the version (`ls /usr/local/spark/python/lib/`),
* set the version `--build-arg py4j_version=0.10.7`.

*Note: At the time of writing there is an issue preventing to use Spark `2.4.6` with Python `3.8`, see [this answer on SO](https://stackoverflow.com/a/62173969/4413446) for more information.*

```bash
docker run -it --rm jupyter/pyspark-notebook:spark-2.4.6 ls /usr/local/spark/python/lib/
# py4j-0.10.7-src.zip PY4J_LICENSE.txt pyspark.zip
# You can now set the build-arg
# --build-arg py4j_version=
```

### Usage Examples

The `jupyter/pyspark-notebook` and `jupyter/all-spark-notebook` images support the use of [Apache Spark](https://spark.apache.org/) in Python, R, and Scala notebooks. The following sections provide some examples of how to get started using them.

### Using Spark Local Mode
#### Using Spark Local Mode

Spark **local mode** is useful for experimentation on small data when you do not have a Spark cluster available.

#### In Python
##### In Python

In a Python notebook.

Expand All @@ -33,7 +93,7 @@ rdd.sum()
# 5050
```

#### In R
##### In R

In a R notebook with [SparkR][sparkr].

Expand Down Expand Up @@ -71,9 +131,7 @@ sdf_len(sc, 100, repartition = 1) %>%
# 5050
```

#### In Scala

##### In a Spylon Kernel
##### In Scala

Spylon kernel instantiates a `SparkContext` for you in variable `sc` after you configure Spark
options in a `%%init_spark` magic cell.
Expand All @@ -91,18 +149,7 @@ rdd.sum()
// 5050
```

##### In an Apache Toree Kernel

Apache Toree instantiates a local `SparkContext` for you in variable `sc` when the kernel starts.

```scala
// Sum of the first 100 whole numbers
val rdd = sc.parallelize(0 to 100)
rdd.sum()
// 5050
```

### Connecting to a Spark Cluster in Standalone Mode
#### Connecting to a Spark Cluster in Standalone Mode

Connection to Spark Cluster on **[Standalone Mode](https://spark.apache.org/docs/latest/spark-standalone.html)** requires the following set of steps:

Expand All @@ -117,7 +164,7 @@ Connection to Spark Cluster on **[Standalone Mode](https://spark.apache.org/docs

**Note**: In the following examples we are using the Spark master URL `spark://master:7077` that shall be replaced by the URL of the Spark master.

#### In Python
##### In Python

The **same Python version** need to be used on the notebook (where the driver is located) and on the Spark workers.
The python version used at driver and worker side can be adjusted by setting the environment variables `PYSPARK_PYTHON` and / or `PYSPARK_DRIVER_PYTHON`, see [Spark Configuration][spark-conf] for more information.
Expand All @@ -135,7 +182,7 @@ rdd.sum()
# 5050
```

#### In R
##### In R

In a R notebook with [SparkR][sparkr].

Expand Down Expand Up @@ -172,9 +219,7 @@ sdf_len(sc, 100, repartition = 1) %>%
# 5050
```

#### In Scala

##### In a Spylon Kernel
##### In Scala

Spylon kernel instantiates a `SparkContext` for you in variable `sc` after you configure Spark
options in a `%%init_spark` magic cell.
Expand All @@ -192,29 +237,6 @@ rdd.sum()
// 5050
```

##### In an Apache Toree Scala Notebook

The Apache Toree kernel automatically creates a `SparkContext` when it starts based on configuration information from its command line arguments and environment variables. You can pass information about your cluster via the `SPARK_OPTS` environment variable when you spawn a container.

For instance, to pass information about a standalone Spark master, you could start the container like so:

```bash
docker run -d -p 8888:8888 -e SPARK_OPTS='--master=spark://master:7077' \
jupyter/all-spark-notebook
```

Note that this is the same information expressed in a notebook in the Python case above. Once the kernel spec has your cluster information, you can test your cluster in an Apache Toree notebook like so:

```scala
// should print the value of --master in the kernel spec
println(sc.master)

// Sum of the first 100 whole numbers
val rdd = sc.parallelize(0 to 100)
rdd.sum()
// 5050
```

## Tensorflow

The `jupyter/tensorflow-notebook` image supports the use of
Expand Down
24 changes: 17 additions & 7 deletions pyspark-notebook/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -11,20 +11,30 @@ SHELL ["/bin/bash", "-o", "pipefail", "-c"]
USER root

# Spark dependencies
ENV APACHE_SPARK_VERSION=3.0.0 \
HADOOP_VERSION=3.2
# Default values can be overridden at build time
# (ARGS are in lower case to distinguish them from ENV)
ARG spark_version="3.0.0"
ARG hadoop_version="3.2"
ARG spark_checksum="BFE45406C67CC4AE00411AD18CC438F51E7D4B6F14EB61E7BF6B5450897C2E8D3AB020152657C0239F253735C263512FFABF538AC5B9FFFA38B8295736A9C387"
ARG py4j_version="0.10.9"
ARG openjdk_version="11"

ENV APACHE_SPARK_VERSION="${spark_version}" \
HADOOP_VERSION="${hadoop_version}"

RUN apt-get -y update && \
apt-get install --no-install-recommends -y openjdk-11-jre-headless ca-certificates-java && \
apt-get install --no-install-recommends -y \
"openjdk-${openjdk_version}-jre-headless" \
ca-certificates-java && \
rm -rf /var/lib/apt/lists/*

# Using the preferred mirror to download Spark
# Spark installation
WORKDIR /tmp

# Using the preferred mirror to download Spark
# hadolint ignore=SC2046
RUN wget -q $(wget -qO- https://www.apache.org/dyn/closer.lua/spark/spark-${APACHE_SPARK_VERSION}/spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz\?as_json | \
python -c "import sys, json; content=json.load(sys.stdin); print(content['preferred']+content['path_info'])") && \
echo "BFE45406C67CC4AE00411AD18CC438F51E7D4B6F14EB61E7BF6B5450897C2E8D3AB020152657C0239F253735C263512FFABF538AC5B9FFFA38B8295736A9C387 *spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz" | sha512sum -c - && \
echo "${spark_checksum} *spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz" | sha512sum -c - && \
tar xzf "spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz" -C /usr/local --owner root --group root --no-same-owner && \
rm "spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz"

Expand All @@ -33,7 +43,7 @@ RUN ln -s "spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}" spark

# Configure Spark
ENV SPARK_HOME=/usr/local/spark
ENV PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.9-src.zip \
ENV PYTHONPATH="${SPARK_HOME}/python:${SPARK_HOME}/python/lib/py4j-${py4j_version}-src.zip" \
SPARK_OPTS="--driver-java-options=-Xms1024M --driver-java-options=-Xmx4096M --driver-java-options=-Dlog4j.logLevel=info" \
PATH=$PATH:$SPARK_HOME/bin

Expand Down

0 comments on commit 13b866f

Please sign in to comment.