The project is to create and deploy a Virtual Environment, which can be used for the machine learning with Python and Scala. The coding is done in the Jupyter Notebook locally.
The setup is intended to be as automatic as possible.
To provision the Virtual Machine(VM) with the packages for the machine learning task the Pupper was chosen.
The task was solved using two appoaches:
- running calculations on the local machine with additional packages VirtualBox and Vagrant
- utilizing the Cloud resources
In the case of greater time slot for the task, I could imagine to write Puppet code more carefully, making it more versatile, including different Linux distributions. Currently it works on Ubuntu/Trusty64. Another option to add is to create a mirrow for the packages and setup a proxy server if there is a risk for the public repositories are being unavailable.
The Virtual Machine (VM) provisioning can be done in several ways. The Puppet scripts provided will work with the two following scenarios.
One could use a local workstation with VirtualBox and Vagrant installed.
(Vagrant must be from Vagrant site,
not from the linux distribution repo!) Follow the instructions there.
This setting was tested with Vagrant 1.8.6 and VirtualBox 5.1.8 on Ubuntu Xenial
as host.
VirtualBox installation instructions here.
Then in order to launch the VM the following code should be executed:
git clone https://github.com/raalesir/automated_environment.git
That will clone the repo to the specified directory. The repo contains:
alexey@alexey-iMac:~/Projects/combient$ tree -L 3 manifests/ modules/
├── manifests
│ └── site.pp
├── modules
│ └── dependencies
│ └── manifests
├── README.md
└── Vagrantfile
- The
Vagrantfile
from the repo should contain the following entries:
Vagrant.configure(2) do |config|
config.vm.box = "ubuntu/trusty64"
config.vm.provider "virtualbox" do |vb|
vb.memory = "4096" # the more the better...
end
config.vm.provision "shell",
inline: "sudo apt-get install -y puppet-common",
config.vm.provision "puppet" do |puppet|
puppet.manifests_path = "manifests"
puppet.module_path = "modules"
puppet.manifest_file = "site.pp"
end
end
-
vagrant box add ubuntu/trusty64
This will download the Vagrant box, so you can -
vagrant up
That will provision the VM and install all dependencies according to the instructions. -
After all the provisioning is finished, you can
vagrant ssh
and$ jupyter notebook
. That would launch the Jupyter. -
Make a desision about a firewall on the host machine. Either turn it off, or open ports at least 8887. The 8888 inside the VM should be open hopefully by default.
-
On the host machine open another terminal tab, go to the
Vagrantfile
directory and issue:
ssh -i .vagrant/machines/default/virtualbox/private_key -N -f -L localhost:8887:localhost:8888 -p 2200 vagrant@localhost
, where:
.vagrant/machines/default/virtualbox/private_key
is the SSH key created by Vagrant,-p 2200
the port to ssh to the VM (it could be 2222, depending on your situation)- 8888 and 8887 ports to launch Jupyter inside the VM and on the host machine correspondingly
-
Launch Web browser on the host machine and point it to
http://localhost:8887
-
You should find yourself in the Jupyter GUI, so you can start start uploading the Notebook and source files.
I have used this approach until I reached line 4 or 5 in the Notebook. After that memory demands started to be too severe for my workstation. Use this approach only if you have 8-16 GB RAM. Otherwise switch to the Scenario II.
Here you will use already existing VM at EC2. The EC2 use XEN as a virtualizer, so VirtualBox will not work there. It means it will not work to substitute the local workstation from the Scenario I with the EC2 node.
Luckily, we still can go with the repo you just cloned.
ssh -X -i ACE_Challenge.pem ubuntu@ec2-52-212-62-56.eu-west-1.compute.amazonaws.com
After logging in to/home/ubuntu
type:sudo apt-get install -y git && git clone https://github.com/combient/Challenge_Alexey_S.git
(That asks for a username and password. I guess that is because the repo is private...)sudo apt-get install -y puppet-common
cd Challenge_Alexey_S && sudo puppet apply --modulepath=/home/ubuntu/Challenge_Alexey_S/modules manifests/site.pp
This will tell Puppet to apply the rules from its scripts onto the current machine, i.e. onto the EC2 node.- Copy the source files and the test notebook to the
$HOME/Challenge_Alexey_S
:
scp test.csv.gz train.csv.gz -i ACE_Challenge.pem ubuntu@ec2-52-212-62-56.eu-west-1.compute.amazonaws.com:/home/ubuntu/Challenge_Alexey_S
or use Jupyter notebook GUI later. - Make a desision about a firewall. Either turn it off, or open ports at least 8888.
- If
$ env|grep SPARK_HOME
is not set, logout from the VM and repeat step 1:exit
andssh -X -i ACE_Challenge.pem ubuntu@ec2-52-212-62-56.eu-west-1.compute.amazonaws.com
or executesource /bin/profile
inside the VM.
If the jupyter
installation went well, we can try launch the notebook:
ubuntu@ip-172-31-20-22:~$ jupyter notebook
That should bring you to some ASCII GUI.- make sure that the port 8887 is not listening i.e.:
$ lsof -i :8887
. Otherwise change it to someting above 1024 and test again. - If everything went well, the
Jupyter
server is running, and we would like to try to connect to it from the local machine. In order to do that one should use local port forwarding. TheJupyter
by default will run athttp://localhost:8888/
(inside the EC2 node), so we could use 8887 on our local machine.
ssh -i /home/alexey/Downloads/ACE_Challenge.pem -N -f -L localhost:8887:localhost:8888 ubuntu@ec2-52-212-62-56.eu-west-1.compute.amazonaws.com
-
Now we can open the brower on the localhost and enter:
localhost:8887
. That should bring us to thehttp://localhost:8887/tree#notebooks
. -
Upload a notebook if nesessary or create one if needed. Upload/delete data files with the GUI etc.
Please estimate the memory consumption for the tasks, and provide that to
the date engineers, so they know what resources to allocate.
It would also be great if you would tell what packages will be needed with the
versions, especially installed by pip
. It could be a source of troubles, due to
the potential differences in the systax between different versions and due to
interpackage dependencies to be satisfied.
As soon as the infrastructure is ready, you should do the following:
- Ask how to access the machine with the installed infrastructure i.e.
something like:
ssh -X -i ACE_Challenge.pem ubuntu@ec2-52-212-62-56.eu-west-1.compute.amazonaws.com
orvagrant ssh
depending on the setup. - After login start the
Jupyter
:
ubuntu@ip-172-31-20-22:~$ jupyter notebook
That should bring you to some ASCII GUI. Just look at it. - From the another terminal tab on the local machine execute:
ssh -i ACE_Challenge.pem -N -f -L localhost:8887:localhost:8888 ubuntu@ec2-52-212-62-56.eu-west-1.compute.amazonaws.com
or something like:
ssh -i .vagrant/machines/default/virtualbox/private_key -N -f -L localhost:8887:localhost:8888 -p 2200 vagrant@localhost
depending on the setup - Start up your local Web Browser and point it to:
http://localhost:8887/tree#notebooks
- The notebook has an intuitive GUI on how to create/modify files and notebooks.
- One can upload source files, like
test.csv.gz
andtrain.csv.gz
from the local machine to the EC2 node directly from theJupyter
Web-interface