This repo is, describes how to build a Hadoop
eco system using the Oracle virtual machine
.
-
Auto Setup - Comming soon -
1. Setting VM
3. Test HDFS
5. Error
We will create 4 VMs.
VM Spec.
master-node
- Processor : 2
- System memory : 4096 MB
- Vedio memory : 16 MB
worker-node1~3
- Processor : 2
- System memory : 4096 MB
- Vedio memory : 16 MB
The reason for using two processors is so that we can use Kubernetes later. It doesnβt matter if you set it to default.
Now let's assign a static IP to each virtual machine.
I used a static IP like this:
master-node
- IP : 192.168.1.10
- Port : 10
worker-node1
- IP : 192.168.1.11
- Port : 11
worker-node2
- IP : 192.168.1.12
- Port : 12
worker-node3
- IP : 192.168.1.13
- Port : 13
This is the network config file path.
$ sudo vi /etc/netplan/00-installer-config.yaml
Open the configuration file and enter the IP values ββfor each node. The following is an example master node setup.
# This is the network config written by 'subiquity'
network:
ethernets:
enp0s3:
addresses:
- 192.168.1.10/24
nameservers:
addresses: [8.8.8.8, 8.8.4.4]
routes:
- to: default
via: 192.168.1.1
version: 2
$ sudo netplan apply
Complete the setup for the remaining virtual machines and apply your changes.
You can check the changed IP using the ifconfig
command.
And Setting Hosts. (All node)
$ sudo vi /etc/hosts
127.0.0.1 localhost
192.168.0.10 master-node
192.168.0.11 worker-node1
192.168.0.12 worker-node2
192.168.0.13 worker-node3
...
Next, download Open JDK and Python3.
$ sudo apt-get update
$ sudo apt-get install openjdk-8-jdk
$ sudo apt install python3-pip
Change worker node name. ( worker-node1, 2, 3 )
$ sudo vi /etc/hostname
worker-node1
$ sudo hostname -F /etc/hostname # all-node
$ sudo hostname apply
$ sudo reboot
Remove comment.
$ sudo vi /etc/ssh/sshd_config
...
PubkeyAuthentication yes
...
The Hadoop master node enables ssh without a password for all worker nodes. ALL NODE!
$ chmod 700 ~/.ssh
$ ssh-keygen -t rsa -P ""
# Enter
Copy the master node's ssh key to the worker nodes.
# master node only
$ ssh-copy-id -i ~/.ssh/id_rsa.pub master-node
$ ssh-copy-id -i ~/.ssh/id_rsa.pub worker-node1
$ ssh-copy-id -i ~/.ssh/id_rsa.pub worker-node2
$ ssh-copy-id -i ~/.ssh/id_rsa.pub worker-node3
# Check if ssh is connected properly
$ ssh worker-node1
Install hadoop ALL NODE!
$ wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
$ tar -xvf hadoop-3.3.6.tar.gz
$ mv hadoop-3.3.6 hadoop
Write down the worker and master.
$ vi hadoop/etc/hadoop/workers
# Workers File
worker-node1
worker-node2
worker-node3
# :wq!
$ vi hadoop/etc/hadoop/masters
# Masters File
master-node
# :wq!
Creates a basic frame.
Caution: You must create a namenode only on the master node
.
# master node
$ mkdir data
$ cd data
$ mkdir namenode datanode tmp userlogs
# worker nodes
$ mkdir data
$ cd data
$ mkdir datanode tmp userlogs
Specify the path at the end of the ~/.bashrc
file.
$ sudo vi ~/.bashrc
# .bashrc
...
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/home/master-node/hadoop
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar
# :wq!
$ source ~/.bashrc
Add hadoop-env.sh
$ vi hadoop/etc/hadoop/hadoop-env.sh
# hadoop-env.sh
...
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
# :wq!
Setting core-site.xml
, yarn-site.xml
, hdfs-site.xml
- core-site.xml
<!-- all node are the same -->
...
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://master-node:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/master-node/data/tmp</value>
</property>
</configration>
- yarn-site.xml
<!-- master node -->
...
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>master-node</value>
</property>
</configration>
<!-- worker nodes -->
...
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configration>
- hdfs-site.xml
<!-- master node -->
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/home/master-node/data/namenode</value>
</property>
</configration>
<!-- worker nodes -->
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/home/master-node/data/datanode</value>
</property>
</configration>
Finally the last step!!!
Initializes the namenode.
# master node
$ hdfs namenode -format
$ start-all.sh
# Check if this works!
$ jps
# If it works correctly, you will see the following command line:
1380 SecondaryNameNode
2391 Jps
1480 ResourceManager
1840 NameNode
# The same behavior can be seen on worker nodes.
4801 DataNode
4248 Jps
3059 NodeManager
$ hdfs dfs -mkdir /test
$ hdfs dfs -ls /
# Check your COMMEND!!!
Next Step is install SPARK
!
Download spark.
This time is only master-node.
$ wget https://dlcdn.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
$ tar -xvf spark-3.5.0-bin-hadoop3.tgz
$ mv spark-3.5.0-bin-hadoop3 spark
Edit ~/.bashrc
file.
...
export SPARK_HOME=/home/master/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
Apply
$ source ~/.bashrc
Configure Spark cluster.
$ cd $SPARK_HOME/conf
$ cp spark-env.sh.template spark-env.sh
$ vi spark-env.sh
Edit spark/conf/spark-env.sh
file.
...
export SPARK_MASTER_HOST=master
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_CORES=2
export SPARK_WORKER_MEMORY=4g
export SPARK_WORKER_INSTANCES=1
export JAVA_HOME=${JAVA_HOME}
export HADOOP_HOME=${HADOOP_HOME}
export YARN_CONF_DIR=${YARN_CONF_DIR}
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/etc/hadoop"}
Worker settings
$ vi ~/spark/conf/slaves
New file slaves
worker-node1
worker-node2
worker-node3
Now send it to each worker node.
$ scp -r spark master@worker-node1:/home/master/
$ scp -r spark master@worker-node2:/home/master/
$ scp -r spark master@worker-node3:/home/master/
Now letβs start the spark cluster.
$ ~/spark/sbin/start-all.sh
$ ~/spark/sbin/start-history-server.sh
How to stop spark cluster.
$ ~/spark/sbin/stop-all.sh
$ ~/spark/sbin/stop-history-server.sh