Skip to content

Commit 43993a0

Browse files
committed
Initial commit
0 parents  commit 43993a0

10 files changed

+281
-0
lines changed

.env

+4
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
hadoop_version=2.8.1
2+
hadoop_root=/hadoop
3+
hadoop_mirror=http://mirror.dkd.de/apache/hadoop/common
4+
image_version=v0.1

.gitignore

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
.idea

LICENSE

+27
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
Copyright (c) 2017 - present, Georg Walther
2+
All rights reserved.
3+
4+
Redistribution and use in source and binary forms, with or without modification,
5+
are permitted provided that the following conditions are met:
6+
7+
* Redistributions of source code must retain the above copyright notice, this
8+
list of conditions and the following disclaimer.
9+
10+
* Redistributions in binary form must reproduce the above copyright notice, this
11+
list of conditions and the following disclaimer in the documentation and/or
12+
other materials provided with the distribution.
13+
14+
* Neither the name of the {organization} nor the names of its
15+
contributors may be used to endorse or promote products derived from
16+
this software without specific prior written permission.
17+
18+
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
19+
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
20+
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
21+
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR
22+
ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
23+
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
24+
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
25+
ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
26+
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
27+
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

README

+82
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
# Distributed Docker Hadoop
2+
3+
This repository demonstrates how to spin up a distributed Hadoop system.
4+
5+
## Prerequisites
6+
7+
Ensure you have Python Anaconda (the Python 3.6 flavor) installed: https://www.anaconda.com/download/.
8+
Further ensure you have a recent version of Docker installed.
9+
The Docker version I developed this example on is:
10+
11+
$ docker --version
12+
Docker version 17.05.0-ce, build 89658be
13+
14+
## Setup
15+
16+
We will use Docker Compose to spin up the various Docker containers constituting
17+
our Hadoop system.
18+
To this end let us create a clean Anaconda Python virtual environment and install
19+
a current version of Docker Compose in it:
20+
21+
$ conda create --name distributed_docker_hadoop python=3.6 --yes
22+
$ source activate distributed_docker_hadoop
23+
$ pip install -r requirements.txt
24+
25+
Make certain `docker-compose` points to this newly installed version in the virtual
26+
environment:
27+
28+
$ which docker-compose
29+
30+
In case this does not point to the `docker-compose` binary in your virtual environment,
31+
reload the virtual environment and check again:
32+
33+
$ source deactivate
34+
$ source activate distributed_docker_hadoop
35+
36+
## Start cluster
37+
38+
To start up the cluster:
39+
40+
$ docker-compose up --force-recreate
41+
42+
Once all Docker services are up you can visit a couple of GUIs in your browser
43+
to study the overall status of your cluster:
44+
45+
* [The name node](http://localhost:50070)
46+
* [The resource manager](http://localhost:8088)
47+
* [The MapReduce job history server](http://localhost:8088)
48+
49+
## Scaling out
50+
51+
Hadoop is well-known for allowing to scale out, i.e. easily run across numerous hosts.
52+
Since we are using Docker Compose to spin up our virtual hosts in this toy example, we can
53+
play around with scaling out by using the ability of Docker to scale up Docker services.
54+
55+
### Data nodes
56+
57+
Bring up the Hadoop cluster as described above.
58+
Browse the current list of data nodes by visiting the web interface of the name node:
59+
60+
`http://localhost:50070/dfshealth.html#tab-datanode`
61+
62+
You should see a single data like so:
63+
64+
![image](https://user-images.githubusercontent.com/3273502/29886791-98f586f4-8dbb-11e7-9bbb-ca6d8314de2f.png)
65+
66+
In a separate terminal window activate the Python virtual environment and scale
67+
up the data node service as follows:
68+
69+
$ source activate distributed_docker_hadoop
70+
$ docker-compose up --scale data-node=2
71+
72+
Back in the name node web interface you should now notice two data nodes:
73+
74+
![image](https://user-images.githubusercontent.com/3273502/29886878-e00ef7a0-8dbb-11e7-8e91-54117244b115.png)
75+
76+
## Notes
77+
78+
### Hostnames
79+
80+
Hostnames are not allowed to contain underscores `_`, therefore make certain
81+
to spell out longer hostnames with dashes `-` instead.
82+
In this example we ensure this by using dashes in the names of our Docker services.

docker-compose.yml

+89
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
version: '3.3'
2+
services:
3+
4+
base:
5+
build:
6+
context: ./images/base
7+
dockerfile: Dockerfile
8+
args:
9+
hadoop_version: ${hadoop_version}
10+
hadoop_root: ${hadoop_root}
11+
hadoop_mirror: ${hadoop_mirror}
12+
image: distributed_docker_hadoop:${image_version}
13+
14+
name-node:
15+
image: distributed_docker_hadoop:${image_version}
16+
networks:
17+
- hadoop_net
18+
command: >
19+
bash -c '
20+
yes | ${hadoop_root}/hadoop-${hadoop_version}/bin/hdfs namenode -format &&
21+
${hadoop_root}/hadoop-${hadoop_version}/bin/hdfs namenode
22+
'
23+
ports:
24+
- "50070:50070"
25+
depends_on:
26+
- base
27+
28+
data-node:
29+
image: distributed_docker_hadoop:${image_version}
30+
networks:
31+
- hadoop_net
32+
command: >
33+
bash -c '
34+
${hadoop_root}/hadoop-${hadoop_version}/bin/hdfs datanode
35+
'
36+
depends_on:
37+
- base
38+
39+
resource-manager:
40+
image: distributed_docker_hadoop:${image_version}
41+
networks:
42+
- hadoop_net
43+
command: >
44+
bash -c '
45+
${hadoop_root}/hadoop-${hadoop_version}/bin/yarn resourcemanager
46+
'
47+
ports:
48+
- "8088:8088"
49+
depends_on:
50+
- base
51+
52+
node-manager:
53+
image: distributed_docker_hadoop:${image_version}
54+
networks:
55+
- hadoop_net
56+
command: >
57+
bash -c '
58+
${hadoop_root}/hadoop-${hadoop_version}/bin/yarn nodemanager
59+
'
60+
depends_on:
61+
- base
62+
63+
web-app-proxy:
64+
image: distributed_docker_hadoop:${image_version}
65+
networks:
66+
- hadoop_net
67+
command: >
68+
bash -c '
69+
${hadoop_root}/hadoop-${hadoop_version}/bin/yarn proxyserver
70+
'
71+
depends_on:
72+
- base
73+
74+
map-reduce-job-history:
75+
image: distributed_docker_hadoop:${image_version}
76+
networks:
77+
- hadoop_net
78+
command: >
79+
bash -c '
80+
${hadoop_root}/hadoop-${hadoop_version}/bin/mapred historyserver
81+
'
82+
ports:
83+
- "19888:19888"
84+
depends_on:
85+
- base
86+
87+
networks:
88+
hadoop_net:
89+
driver: bridge

images/base/Dockerfile

+42
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
FROM ubuntu:17.10
2+
MAINTAINER Georg Walther (contact@georg.io)
3+
4+
ARG hadoop_version
5+
ARG hadoop_root
6+
ARG hadoop_mirror
7+
8+
ENV HADOOP_PREFIX=$hadoop_root/hadoop-$hadoop_version
9+
ENV HADOOP_HOME=$HADOOP_PREFIX
10+
ENV HADOOP_COMMON_HOME=$HADOOP_PREFIX
11+
ENV HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop
12+
ENV HADOOP_HDFS_HOME=$HADOOP_PREFIX
13+
ENV HADOOP_MAPRED_HOME=$HADOOP_PREFIX
14+
ENV HADOOP_YARN_HOME=$HADOOP_PREFIX
15+
16+
ENV JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/
17+
18+
USER root
19+
20+
RUN echo "Install required system packages ..." \
21+
&& apt-get update \
22+
&& apt-get --yes install \
23+
openjdk-8-jre \
24+
openssh-client \
25+
wget
26+
27+
RUN echo "Download and extract Hadoop source ..." \
28+
&& mkdir -p ${hadoop_root} \
29+
&& cd ${hadoop_root} \
30+
&& wget ${hadoop_mirror}/hadoop-${hadoop_version}/hadoop-${hadoop_version}.tar.gz \
31+
&& tar xvf hadoop-${hadoop_version}.tar.gz --gzip \
32+
&& rm hadoop-${hadoop_version}.tar.gz
33+
34+
RUN echo "Create directories for HDFS nodes and logging ..." \
35+
&& mkdir -p /hdfs_logs \
36+
&& mkdir -p /hdfs_data \
37+
&& mkdir -p ${hadoop_root}/hadoop-${hadoop_version}/logs \
38+
&& chown -R 755 ${hadoop_root}/hadoop-${hadoop_version}/logs
39+
40+
ADD ./configurations/core-site.xml ${hadoop_root}/hadoop-${hadoop_version}/etc/hadoop/core-site.xml
41+
ADD ./configurations/hdfs-site.xml ${hadoop_root}/hadoop-${hadoop_version}/etc/hadoop/hdfs-site.xml
42+
ADD ./configurations/yarn-site.xml ${hadoop_root}/hadoop-${hadoop_version}/etc/hadoop/yarn-site.xml
+9
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
<?xml version="1.0" encoding="UTF-8"?>
2+
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
3+
4+
<configuration>
5+
<property>
6+
<name>fs.defaultFS</name>
7+
<value>hdfs://name-node</value>
8+
</property>
9+
</configuration>
+13
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
<?xml version="1.0" encoding="UTF-8"?>
2+
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
3+
4+
<configuration>
5+
<property>
6+
<name>dfs.namenode.name.dir</name>
7+
<value>/hdfs_logs</value>
8+
</property>
9+
<property>
10+
<name>dfs.datanode.data.dir</name>
11+
<value>/hdfs_data</value>
12+
</property>
13+
</configuration>
+13
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
<?xml version="1.0" encoding="UTF-8"?>
2+
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
3+
4+
<configuration>
5+
<property>
6+
<name>yarn.resourcemanager.hostname</name>
7+
<value>resource-manager</value>
8+
</property>
9+
<property>
10+
<name>yarn.web-proxy.address</name>
11+
<value>web-app-proxy</value>
12+
</property>
13+
</configuration>

requirements.txt

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
docker-compose==1.15.0

0 commit comments

Comments
 (0)