First make sure you have JDK8+ installed. You can download JDK from oracle. Currently (January 2019) Spark runs on Java 8+, Python 3.4+.
-
With
anaconda
:conda install -yn <env_name> -c conda-forge pyspark
where
<env_name>
is the name of your conda environment. -
Next download spark from spark website. Currently (January 2019) the latest version is
spark-2.4.0-bin-hadoop2.7
. Unzip the downloaded file and place it in/opt/spark-2.4.0-bin-hadoop2.7
. -
Set the environment variables in your
.bash_profile.sh
or.bashrc
depending on your system:export SPARK_HOME=/opt/spark-2.4.0-bin-hadoop2.7
export PATH=$SPARK_HOME:$PATH
-
Then update the paths env_var_setting.sh accordingly, and run
source env_var_setting.sh
. This will tell Spark to use the python and jupyter lab (or notebook) of anaconda <env_name>. Note that you should update the paths inenv_var_setting.sh
based on your own system. Here are sample paths in thisenv_var_setting.sh
.export PYSPARK_PYTHON=/anaconda3/envs/envPy36/bin/python
export PYSPARK_DRIVER_PYTHON=/anaconda3/envs/envPy36/bin/jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='lab'
The name of the environment in this example is
envPy36
. With the second and third exports the jupyter lab is triggered on running the commandpyspark
.
From the terminal type:
conda activate <env_name>
Then:
pyspark
This will open a Jupyter lab (or notebook) where you can use Spark. See data_processing_pyspark.ipynb
in the src
folder as an example.