-
Notifications
You must be signed in to change notification settings - Fork 0
Home
instant (almost) gatk
This outlines how to create a Cloudera cluster in the cloud. You would do this if you want to do analysis with GATK but lack a compute environment of your own. The example here will create a 5 node Cloudera Hadoop cluster. If you have lots of data or want results really fast you probably will want more nodes. If you are new to the cloud environment you will need to have your cloud provider increase your limits for cpu, disk, network, memory as genomic files and processing can be very resource intensive.
The steps are as follows:
- Create an account with the cloud provider of your choice, for example: Google Compute Engine(GCE): https://cloud.google.com/compute/ or Amazon Web Services(AWS): https://aws.amazon.com/
- Prepare the configuration files
- Create a virtual computer in the cloud
- Log into that node and run the appropriate cluster creation script
- Install GATK4
- Solve your genomic challenge with GATK :-)
And now for the details...
- Prepare the Configuration files:
Let's assume you've provided your charge card to the cloud provider and you have an account.
-GCE
For GCE you'll need your ProjectID, serviceID, JSON key, and ssh.key file. The documentation is found here: https://www.cloudera.com/documentation/director/latest/topics/director_gcp_config_tools.html
The three critical pieces for generating the cluster are:
The projectID is found here:
The serviceID is generated here, be sure to save the json file, we will use it for generating the cluster.
The ssh.key file is generated with the following command:
ssh-keygen -f ~/.ssh/my_google_ssh_key -t rsa
If you have multiple ssh keys and need to make sure you're using the correct key, see this GCE documentation: https://cloud.google.com/compute/docs/instances/adding-removing-ssh-keys#instance-only
You'll also have to decide what region you want to use, the example uses us-east1 With these parameters copy the configuration file found here and update with your values: https://github.com/git4impatient/GATK4onCloudera/blob/master/ClouderaDirectorGCE_Cluster.conf
-AWS
For AWS you'll need your accessKeyId, secretAccessKey, subnetId, region, securityGroupsIds, ssh.pem file. Please note when creating a VPC make sure you use the "VPC Wizard." Do not select "create vpc" as this will make for routing problems as you work on your cluster. With these parameters copy the configuration file found here and update with your values: https://github.com/git4impatient/GATK4onCloudera/blob/master/ClouderaDirectorAWS_Cluster.conf
- Create a virtual computer in the cloud
This doesn't need to be a large instance. Two cores, 8 gig ram, 15gig root filesystem. You can find the supported Linux OS list here: https://www.cloudera.com/documentation/director/latest/topics/director_deployment_requirements.html
Create this computer using the admin console of GCE or AWS.
For GCE your screen will look like this. This example show setting the
- hostname
- zone
- size of the node at 2 vCPUs and 7.5 GB ram
- a 19Gig hard disk of standard storage (not SSD)
- Linux CentOS6 for the OS
Click on the blue "create" button on the bottom of the screen and wait for the node to be created by GCE.
If you are running Linux locally you don't have to create the remote node to instantiate the cluster. Just install Cloudera Director locally, and then run the bootstrap command. This is documented in the "go" script. Effectively you are running "go" from your local machine. https://github.com/git4impatient/GATK4onCloudera/blob/master/go_script_to_create_cluster.sh
- Log into that node and run the appropriate cluster creation script
You'll need to upload the cluster.conf file you edited with your values above, the script called "go", and the private SSH key. The "go" script is found here: https://github.com/git4impatient/GATK4onCloudera/blob/master/go_script_to_create_cluster.sh
Once you've edit the .conf file you need to copy it to the cloud node. The command will look something like this:
scp -i my_google_ssh_key go_script_to_create_cluster.sh ip_address_of_director_node:
scp -i my_google_ssh_key my_cluster.conf ip_address_of_director_node:
scp -i my_google_ssh_key my_google_ssh_key ip_address_of_director_node:
Be sure the cluster.conf has the location of where you uploaded the my_google_ssh_key eg /home/myuser/my_google_ssh_key and not the location of my_google_ssh_key on your local workstation.
With the three files scp uploaded to the director node and the .conf file edited to point to the location of the ssh private key just type in:
# to log in to the remote node
ssh -i my_google_ssh_key ip_address_of_director_node
# to run the creation script
$ ./go_script_to_create_cluster.sh
You will have time for the libation of your choice. GCE finishes in between 30 minutes to an hour. AWS is slower, typically around an hour. Yes, this isn't quite "instant" but compared to installing your own hardware it is amazingly fast.
- Install GATK4
Log into one of the worker nodes in the cluster, I typically use the last-created-node shown in the cloud console. In the image below you can see the highest value of the internal IP address and the corresponding external address. For the cluster below you would log in with:
ssh -i martygooglecloud.key 104.196.33.15
You could also use the console or the GCP command line tools to log into the node. Click on the button labeled "SSH" an you'll get a command prompt on that node.
From the command prompt you want to clone this git project to your current directory to install GATK4:
git clone https://github.com/git4impatient/GATK4onCloudera
cd GATK4onCloudera/
bash ./goInstallGATK4.sh
Cloudera help is available here: http://community.cloudera.com/ and of course there is subscription support: http://www.cloudera.com/services-support.html
- Solve your genomic challenge with GATK :-)
This is left as an exercise for the reader (sorry, this is Marty humor)
- What's next for this post?
Azure documentation, the process is very similar
- Reference Materials
Cloudera Director Documentation: https://www.cloudera.com/documentation/director/latest/topics/director_intro.html