Builds a data science work environment for Russell Jurney's book Agile Data Science.
You will need Virtualbox and Vagrant installed and working. If you are using a version of Vagrant older than 1.3.0, you will also need Salty Vagrant. Salty Vagrant requires Ruby development libraries (ruby-dev on Ubuntu/Debian or ruby-devel on RHEL/Fedora/Centos).
- Clone this repo and edit the
Vagrantfile
to customize your VM to taste. - Edit
pillar/data.sls
and changeaccept_oracle_download_terms
totrue
. - Run
vagrant up
See the Installation notes section below for comments on Java versions, operating systems, and details on a misleading error message you may receive.
The method for agreeing to the Oracle terms and downloading Java is based on the Chef Java Cookbook.
During the intial run the components are downloaded, installed, and in some cases built. During subsequent runs only package/git updates (if any) are applied. On my machine with two CPUs assigned to the VM the initial run takes 21 minutes and subsequent runs take 1.5 minutes.
The VM environment includes the following major components:
Please be aware that Oracle JDK 6u45 is known to contain several security vulnerabilities so be careful if you access the internet from the virtual machine. See the Java versions section below for further comments on choosing a different version.
Also included are many libraries, dependencies, and build tools. For a more complete list see data.sls in this repo, and Russell Jurney's requirements.txt.
The book Agile Data Science contains instructions for the tools. This section documents small differences between the book and this environment.
The default base directory is /home/vagrant/agiledata
, which contains the following:
book-code
: a clone of the Agile_Data_Code repo.downloads
: tarfiles downloaded during installation.env.sh
: source this script to setJAVA_HOME
and add all tool binaries to yourPATH
.linkjars.sh
: see the Registering jarfiles in pig section below.software
: tools and libraries are installed in this directory.venv
: the python virtualenv used in the book.
The installation process creates and runs the script linkjars.sh
. This script finds all jarfiles in the software
directory and creates symlinks to them in software/lib
. The symlinks make it easier to register jarfiles in pig scripts. For example, to register MongoDB jars in your pig script, you can use
REGISTER /home/vagrant/agiledata/software/lib/mongo*.jar
rather than
REGISTER /home/vagrant/agiledata/software/mongo-hadoop/flume/target/mongo-flume-1.1.0-SNAPSHOT.jar
REGISTER /home/vagrant/agiledata/software/mongo-hadoop/target/mongo-hadoop-1.1.0-SNAPSHOT.jar
REGISTER /home/vagrant/agiledata/software/mongo-hadoop/core/target/mongo-hadoop-core-1.1.0-SNAPSHOT.jar
REGISTER /home/vagrant/agiledata/software/mongo-hadoop/pig/target/mongo-hadoop-pig-1.1.0-SNAPSHOT.jar
REGISTER /home/vagrant/agiledata/software/software/lib/mongo-java-driver-2.11.1.jar
The linkjars.sh
script is run during installation and each time the VM is rebooted. It is unlikely you will need to run it manually, but the script is provided just in case. Please note that the following jarfiles are actual files rather than symlinks, and will not be affected by running the script:
mongo-java-driver-2.11.1.jar
avro-1.7.4.jar
json-simple-1.1.1.jar
Many factors can influence your choice of Java version. Recommending a specific Java version is a dubious proposition, like providing health advice to strangers.
This project conservatively uses Oracle JDK 1.6, the version specified in Pig's Getting Started doc and historically used by enterprise Hadoop installations.
However, you do have other options:
- Pig has been compatible with 1.7 for a while
- CDH4 works with 1.7
- MapR and Hortonworks work with 1.7 and will even work with OpenJDK
You may need to consult your organization, your sysadmin, your vendor, and/or your conscience before making this decision.
This environment should work on any system that can run Virtualbox and Vagrant. If you experience problems installing on Windows related to changing file permissions (look for Failed to change mode to 755
) in the output from the installation process you could try to delete line 13 in oracle_java.sls related to
- mode: 755
Windows does not have the same concept of file permissions as Unix-like and POSIX-compliant operating systems.
The default VM (configured in the Vagrantfile
) is Ubuntu Precise x64. I have also tested with Fedora 18. The environment may work using other Redhat- or Debian-based distros as well.
Salt 0.15.x is affected by issue saltstack/salt#4904, causing it to exit with code 2 rather than code 0 on successful run. Vagrant interprets this code as an error, and displays the following message:
The following SSH command responded with a non-zero exit status.
Vagrant assumes that this means the command failed!
salt-call state.highstate -l debug
True errors in building the agiledata environment are much uglier than this. However, if you'd like to verify the installation, ssh into the VM with vagrant ssh
and then run sudo salt-call state.highstate -l debug
. This is a subsequent run, so it should take only a minute or two to complete. Since you are running the state directly rather than through Vagrant, you should see a true return code on success.