Hadoop Cluster Setup

At this Page are the installation instructions for a Hadoop Cluster. This is necessary to have access to a efficient distributed file system. You need it, when you want to use the spark sql or spark streaming back end.

Cluster example visualization

At the picture above you can see a example of a computer cluster which consists of several nodes. One master node and several slaves (slave00, slave01, ...). The master node is connected over ssh with the slave nodes. In the instructions are the names used as in the picture. Your nodes may have another names! You can see the names under `/etc/hosts`.

Master

Do the following steps only at the master node and also if you want to use only one node:

Download Apache Hadoop. You can use this link in your browser. Please select the 2.7.3 binary.

Extract it and move it to opt:

user@master:~$ tar -xzf hadoop-2.7.3.tar.gz
user@master:~$ sudo mv hadoop-2.7.3 /opt/

Switch to the directory /opt/:

user@master:~$ cd /opt/

Create folders for the namenode and the datanode:

user@master:~$ sudo mkdir -p /opt/hadoop_tmp/hdfs/namenode
user@master:~$ sudo mkdir -p /opt/hadoop_tmp/hdfs/datanode

Change the directory permissions:

user@master:~$ sudo chown starql:cluster -R /opt/hadoop-2.7.3/
user@master:~$ sudo chown starql:cluster -R /opt/hadoop_tmp/

Login as the new user:

user@master:~$ sudo su starql

Switch to the directory /opt/hadoop-2.7.3/etc/hadoop/:

starql@master:~$ cd /opt/hadoop-2.7.3/etc/hadoop/

You can use nano to modify the configuration files (Use [Strg] + [o] + [Enter] to save the file and [Strg] + [x] to exit.). Open the file hadoop-env.sh with nano hadoop-env.sh and modify the following line:

export JAVA_HOME=/usr/lib/jvm/java-8-oracle

The path should point to your Java 8 installation. After it modify the file core-site.xml:

<configuration>
   <property>
      <name>fs.default.name</name>
      <value>hdfs://master:9000</value>
   </property>
</configuration>

Then modify the file hdfs-site.xml. You can set dfs.replication to more than 1 for a better reliability. It only makes sense, when you use more than one node.

<configuration>
   <property>
      <name>dfs.replication</name>
      <value>1</value>
   </property>
   <property>
      <name>dfs.namenode.name.dir</name>
      <value>file:/opt/hadoop_tmp/hdfs/namenode</value>
   </property>
   <property>
      <name>dfs.datanode.data.dir</name>
      <value>file:/opt/hadoop_tmp/hdfs/datanode</value>
   </property>
</configuration>

If you want to use yarn too, then you have to edit the file yarn-site.xml with nano yarn-site.xml:

<configuration>
   <property>
      <name>yarn.resourcemanager.resource-tracker.address</name>
      <value>master:8025</value>
   </property>
   <property>
      <name>yarn.resourcemanager.scheduler.address</name>
      <value>master:8035</value>
   </property>
   <property>
      <name>yarn.resourcemanager.address</name>
      <value>master:8050</value>
   </property>
</configuration>

Rename the file mapred-site.xml.template to mapred-site.xml with the command:

starql@master:~$ mv mapred-site.xml.template mapred-site.xml

Then edit mapred-site.xml with nano mapred-site.xml:

<configuration>
   <property>
      <name>mapreduce.job.tracker</name>
      <value>master:5431</value>
   </property>
   <property>
      <name>mapred.framework.name</name>
      <value>yarn</value>
   </property>
</configuration>

Create a file with the command nano masters and edit it:

## Name of the master node
master

Modify the slaves file with nano slaves and modify it:

## List with names of the slave nodes
master

Slaves

Do the following steps only if you want to use more than one node:

Add the names of the slaves to the slaves file with nano slaves and remove the master. It should look like this:

## List with names of the slave nodes
slave00
slave01
...

Use scp to copy the /opt/hadoop-2.7.3/ directory to the slave nodes, when you want to use more then one node. Repeat this command for all the slaves:

starql@master:~$ scp -r /opt/hadoop-2.7.3 slave00:/home/starql/
starql@master:~$ scp -r /opt/hadoop-2.7.3 slave01:/home/starql/
...

Now create the folder datanode at every slave node and set the correct directory permissions:

user@master:~$ ssh slave00
user@slave00:~$ sudo mv /home/starql/hadoop-2.7.3 /opt/
user@slave00:~$ cd /opt/
user@slave00:~$ sudo mkdir -p /opt/hadoop_tmp/hdfs/datanode
user@slave00:~$ sudo chown starql:cluster -R hadoop_tmp/
user@slave00:~$ sudo chown starql:cluster -R hadoop-2.7.3/
...

Start

Now switch to /opt/hadoop-2.7.3/ with:

starql@master:~$ cd /opt/hadoop-2.7.3/

Format the Hadoop File System with the command:

starql@master:~$ ./bin/hadoop namenode -format

You can now start the Hadoop File System with:

starql@master:~$ ./sbin/start-dfs.sh

Run the command jps to test for everything is working fine. You should get the following, when you use only one node:

starql@master:~$ jps
8113 Jps
7818 DataNode
7677 NameNode
7997 SecondaryNameNode
...

When you use more than one node then it should look like this:

starql@master:~$ jps
8113 Jps
7677 NameNode
7997 SecondaryNameNode
...

If not, then something went wrong:

Check that you have set the correct directory permissions.
In the configuration files you have to set the name of the master and slaves correctly. Do not use master, when your master has a different name.

You can now visit http://localhost:50070 in the Browser at your master node for a overview of your namenodes, when everything is running fine.

And Now?
Goto First Run and continue with the next steps.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hadoop Cluster Setup

Master

Slaves

Start

Clone this wiki locally