|
| 1 | +# Distributed Docker Hadoop |
| 2 | + |
| 3 | +This repository demonstrates how to spin up a distributed Hadoop system. |
| 4 | + |
| 5 | +## Prerequisites |
| 6 | + |
| 7 | +Ensure you have Python Anaconda (the Python 3.6 flavor) installed: https://www.anaconda.com/download/. |
| 8 | +Further ensure you have a recent version of Docker installed. |
| 9 | +The Docker version I developed this example on is: |
| 10 | + |
| 11 | + $ docker --version |
| 12 | + Docker version 17.05.0-ce, build 89658be |
| 13 | + |
| 14 | +## Setup |
| 15 | + |
| 16 | +We will use Docker Compose to spin up the various Docker containers constituting |
| 17 | +our Hadoop system. |
| 18 | +To this end let us create a clean Anaconda Python virtual environment and install |
| 19 | +a current version of Docker Compose in it: |
| 20 | + |
| 21 | + $ conda create --name distributed_docker_hadoop python=3.6 --yes |
| 22 | + $ source activate distributed_docker_hadoop |
| 23 | + $ pip install -r requirements.txt |
| 24 | + |
| 25 | +Make certain `docker-compose` points to this newly installed version in the virtual |
| 26 | +environment: |
| 27 | + |
| 28 | + $ which docker-compose |
| 29 | + |
| 30 | +In case this does not point to the `docker-compose` binary in your virtual environment, |
| 31 | +reload the virtual environment and check again: |
| 32 | + |
| 33 | + $ source deactivate |
| 34 | + $ source activate distributed_docker_hadoop |
| 35 | + |
| 36 | +## Start cluster |
| 37 | + |
| 38 | +To start up the cluster: |
| 39 | + |
| 40 | + $ docker-compose up --force-recreate |
| 41 | + |
| 42 | +Once all Docker services are up you can visit a couple of GUIs in your browser |
| 43 | +to study the overall status of your cluster: |
| 44 | + |
| 45 | +* [The name node](http://localhost:50070) |
| 46 | +* [The resource manager](http://localhost:8088) |
| 47 | +* [The MapReduce job history server](http://localhost:8088) |
| 48 | + |
| 49 | +## Scaling out |
| 50 | + |
| 51 | +Hadoop is well-known for allowing to scale out, i.e. easily run across numerous hosts. |
| 52 | +Since we are using Docker Compose to spin up our virtual hosts in this toy example, we can |
| 53 | +play around with scaling out by using the ability of Docker to scale up Docker services. |
| 54 | + |
| 55 | +### Data nodes |
| 56 | + |
| 57 | +Bring up the Hadoop cluster as described above. |
| 58 | +Browse the current list of data nodes by visiting the web interface of the name node: |
| 59 | + |
| 60 | +`http://localhost:50070/dfshealth.html#tab-datanode` |
| 61 | + |
| 62 | +You should see a single data like so: |
| 63 | + |
| 64 | + |
| 65 | + |
| 66 | +In a separate terminal window activate the Python virtual environment and scale |
| 67 | +up the data node service as follows: |
| 68 | + |
| 69 | + $ source activate distributed_docker_hadoop |
| 70 | + $ docker-compose up --scale data-node=2 |
| 71 | + |
| 72 | +Back in the name node web interface you should now notice two data nodes: |
| 73 | + |
| 74 | + |
| 75 | + |
| 76 | +## Notes |
| 77 | + |
| 78 | +### Hostnames |
| 79 | + |
| 80 | +Hostnames are not allowed to contain underscores `_`, therefore make certain |
| 81 | +to spell out longer hostnames with dashes `-` instead. |
| 82 | +In this example we ensure this by using dashes in the names of our Docker services. |
0 commit comments