For the Spark and Docker module, we need YARN, which comes together with Hadoop. So we need to install Hadoop
In this document, we'll assume you use Linux. For Windows, use WSL. It should work (supposedly) on MacOS as well.
We'll need to run it in a pseudo-distributed mode.
You need to run be able to ssh
to your localhost without having to type any password. In other words, you execute
ssh localhost
And you get ssh access.
If you don't have it, add your id_rsa.pub
key to the list of keys authorized to access your computer:
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys
(This assumes you already have id_rsa.pub
in ~/.ssh
)
On WSL, you may need to start the ssh service:
sudo service ssh start
We use Spark that expects Hadoop 3.2 version. So we'll install it.
Go to the Hadoop's website to get the closest mirrow. And then download it:
wget https://dlcdn.apache.org/hadoop/common/hadoop-3.2.3/hadoop-3.2.3.tar.gz
Unpack it and go to this directory
tar xzfv hadoop-3.2.3.tar.gz
cd hadoop-3.2.3/
Set JAVA_HOME
in etc/hadoop/hadoop-env.sh
:
echo "export JAVA_HOME=${JAVA_HOME}" >> etc/hadoop/hadoop-env.sh
Start YARN
./sbin/start-yarn.sh
YARN should work on port 8088: http://localhost:8088/
For submitting spark jobs, we'll need to use master="yarn"
.
Spark needs to know where to look for YARN config files, so we need to set it:
export HADOOP_HOME="${HOME}/spark/hadoop-3.2.3"
export YARN_CONF_DIR="${HADOOP_HOME}/etc/hadoop"
Then run Jupyter or use spark-submit.
Download the GCS connector:
gsutil cp gs://hadoop-lib/gcs/gcs-connector-hadoop3-2.2.5.jar .
Config changes:
- Change
${SPARK_HOME}/conf/spark-defaults.conf
(see here) - Change
${YARN_CONF_DIR}/core-site.xml
(see here)
Template for hadoop properties:
<property>
<name></name>
<value></value>
</property>
Copy the config from here
Running spark-submit:
MOUNTS="$HADOOP_HOME:$HADOOP_HOME:ro,/etc/passwd:/etc/passwd:ro,/etc/group:/etc/group:ro"
IMAGE_ID="pyspark-docker:test"
spark-submit \
--master yarn \
--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_TYPE=docker \
--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=${IMAGE_ID} \
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_TYPE=docker \
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=${IMAGE_ID} \
06_spark_sql.py \
--input_green=gs://dtc_data_lake_de-zoomcamp-nytaxi/pq/green/2021/*/ \
--input_yellow=gs://dtc_data_lake_de-zoomcamp-nytaxi/pq/yellow/2021/*/ \
--output=gs://dtc_data_lake_de-zoomcamp-nytaxi/report-2021