-
Notifications
You must be signed in to change notification settings - Fork 0
Hadoop Cluster Setup
At this Page are the installation instructions for a Hadoop Cluster. This is necessary to have access to a efficient distributed file system. You need it, when you want to use the spark sql or spark streaming back end.
Do the following steps only at the master node and also if you want to use only one node:
Download Apache Hadoop. You can use this link in your browser. Please select the 2.7.3
binary.
Extract it and move it to opt
:
user@master:~$ tar -xzf hadoop-2.7.3.tar.gz
user@master:~$ sudo mv hadoop-2.7.3 /opt/
Switch to the directory /opt/
:
user@master:~$ cd /opt/
Create folders for the namenode and the datanode:
user@master:~$ sudo mkdir -p /opt/hadoop_tmp/hdfs/namenode
user@master:~$ sudo mkdir -p /opt/hadoop_tmp/hdfs/datanode
Change the directory permissions:
user@master:~$ sudo chown starql:cluster -R /opt/hadoop-2.7.3/
user@master:~$ sudo chown starql:cluster -R /opt/hadoop_tmp/
Login as the new user:
user@master:~$ sudo su starql
Switch to the directory /opt/hadoop-2.7.3/etc/hadoop/
:
starql@master:~$ cd /opt/hadoop-2.7.3/etc/hadoop/
You can use nano to modify the configuration files (Use [Strg] + [o] + [Enter]
to save the file and [Strg] + [x]
to exit.). Open the file hadoop-env.sh
with nano hadoop-env.sh
and modify the following line:
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
The path should point to your Java 8 installation. After it modify the file core-site.xml
:
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://master:9000</value>
</property>
</configuration>
Then modify the file hdfs-site.xml
. You can set dfs.replication to more than 1 for a better reliability. It only makes sense, when you use more than one node.
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/opt/hadoop_tmp/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/opt/hadoop_tmp/hdfs/datanode</value>
</property>
</configuration>
If you want to use yarn too, then you have to edit the file yarn-site.xml
with nano yarn-site.xml
:
<configuration>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>master:8025</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>master:8035</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>master:8050</value>
</property>
</configuration>
Rename the file mapred-site.xml.template
to mapred-site.xml
with the command:
starql@master:~$ mv mapred-site.xml.template mapred-site.xml
Then edit mapred-site.xml
with nano mapred-site.xml
:
<configuration>
<property>
<name>mapreduce.job.tracker</name>
<value>master:5431</value>
</property>
<property>
<name>mapred.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Create a file with the command nano masters
and edit it:
## Name of the master node
master
Modify the slaves
file with nano slaves
and modify it:
## List with names of the slave nodes
master
Do the following steps only if you want to use more than one node:
Add the names of the slaves to the slaves
file with nano slaves
and remove the master. It should look like this:
## List with names of the slave nodes
slave00
slave01
...
Use scp
to copy the /opt/hadoop-2.7.3/
directory to the slave nodes, when you want to use more then one node. Repeat this command for all the slaves:
starql@master:~$ scp -r /opt/hadoop-2.7.3 slave00:/home/starql/
starql@master:~$ scp -r /opt/hadoop-2.7.3 slave01:/home/starql/
...
Now create the folder datanode
at every slave node and set the correct directory permissions:
user@master:~$ ssh slave00
user@slave00:~$ sudo mv /home/starql/hadoop-2.7.3 /opt/
user@slave00:~$ cd /opt/
user@slave00:~$ sudo mkdir -p /opt/hadoop_tmp/hdfs/datanode
user@slave00:~$ sudo chown starql:cluster -R hadoop_tmp/
user@slave00:~$ sudo chown starql:cluster -R hadoop-2.7.3/
...
Now switch to /opt/hadoop-2.7.3/
with:
starql@master:~$ cd /opt/hadoop-2.7.3/
Format the Hadoop File System with the command:
starql@master:~$ ./bin/hadoop namenode -format
You can now start the Hadoop File System with:
starql@master:~$ ./sbin/start-dfs.sh
Run the command jps
to test for everything is working fine. You should get the following, when you use only one node:
starql@master:~$ jps
8113 Jps
7818 DataNode
7677 NameNode
7997 SecondaryNameNode
...
When you use more than one node then it should look like this:
starql@master:~$ jps
8113 Jps
7677 NameNode
7997 SecondaryNameNode
...
If not, then something went wrong:
- Check that you have set the correct directory permissions.
- In the configuration files you have to set the name of the
master
andslaves
correctly. Do not usemaster
, when your master has a different name.
You can now visit http://localhost:50070
in the Browser at your master node for a overview of your namenodes, when everything is running fine.
And Now?
Goto First Run and continue with the next steps.