-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add basic Documentation for setting up on Google Cloud #340
Conversation
Test PASSed. |
docs/cloud/google-cloud.rst
Outdated
|
||
.. code:: bash | ||
|
||
wget https://gist.githubusercontent.com/Georgehe4/6bb1c142a9f68f30f38d80cd9407120a/raw/9b903e3b8746ee8f25911fe98925b53e9777002f/mango_install.sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we put this script in Mango scripts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
docs/cloud/google-cloud.rst
Outdated
gs://mango-initialization-bucket/mango_install.sh | ||
|
||
|
||
Once the above steps are completed, simply ssh into the master node to run Mango. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have the scripts for running docker?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
docs/index.rst
Outdated
@@ -31,6 +31,12 @@ variety of platforms. | |||
|
|||
docker/docker-examples | |||
|
|||
.. toctree:: | |||
:caption: Cloud |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Google Cloud
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might add an AWS Section later?
Test PASSed. |
Test PASSed. |
# update the apt package index: | ||
sudo apt-get -y update | ||
# finally, install docker | ||
sudo apt-get -y install docker-ce |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
newline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
scripts/google_cloud_docker_run.sh
Outdated
-e HIVE_DIR=${HIVE_DIR} \ | ||
-e HIVE_CONF_DIR=${HIVE_CONF_DIR} \ | ||
-e SPARK_DIST_CLASSPATH="/usr/lib/hadoop/etc/hadoop:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/.//*:/usr/lib/hadoop-hdfs/./:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/.//*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/.//*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/.//*" \ | ||
--entrypoint=/opt/cgl-docker-lib/mango/bin/mango-notebook \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when you were accessing google storage through docker, did it need any access keys?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I still haven't resolved the access key issue among the spark workers yet for bam files in gs. For now, accessing hdfs works fine but gs still needs to be investigated.
@@ -0,0 +1,59 @@ | |||
#!/usr/bin/env bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's put these in bin/GCE/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
fi | ||
|
||
if [[ "${ROLE}" == 'Master' ]]; then | ||
conda install jupyter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you installing docker here? docker is installed in the container, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docker is not provided on a default gce vm.
Test PASSed. |
Test PASSed. |
docs/cloud/google-cloud.rst
Outdated
|
||
hdfs dfs -put /<local machime path> /<hdfs path> | ||
|
||
An example docker startup script is available in the Mango `scripts directory <https://github.com/bigdatagenomics/mango/blob/master/bin/gce/google_cloud_docker_run.sh>`__ for running mango notebook [run with root permissions to work with docker]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give the code to copy this from docker so a user can run the script?
Test PASSed. |
bin/gce/google_cloud_docker_run.sh
Outdated
@@ -0,0 +1,37 @@ | |||
export SPARK_HOME=/usr/lib/spark | |||
export SPARK_CONF_DIR=/usr/lib/spark/conf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is specifically for mango notebook. we can pass a user variable here that states whether to run notebook or browser based on the --entrypoint flag
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
separate into run-notebook.sh and run-browser.sh
echo "Jupyter Notebook extensions installed!" | ||
fi | ||
|
||
pip install cigar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't cigar be installed in the virtual env?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is so cigar is present on the worker nodes (it is also installed in the venv).
What configurations have you been setting to port python packages over to the worker nodes on the cluster? It looks like the python binary for the worker nodes is not being overwritten by the flags sent to Spark, which leads to "cigar not found" issues if it is not installed this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
won't this be installed outside of the virtual env? Do you know why cigar is causing an issue and not other imports?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cigar will be installed in the virtual env, just not through this process.
Since the pyspark jobs depend on cigar in the worker nodes, the python distribution on the worker nodes need to have cigar present. I don't believe we use any other imports in pyspark for mango-notebook.
Test PASSed. |
Test PASSed. |
Test PASSed. |
Hey @akmorrow13, I've added an example file and fixed an issue with block data reads (https://gist.github.com/Georgehe4/043ae9079349ac7b05bee1e0b3b4b4fa) - I believe this should be good to go. |
bin/gce/google_cloud_docker_run.sh
Outdated
export HADOOP_HDFS=/usr/lib/hadoop-hdfs | ||
export HADOOP_YARN=/usr/lib/hadoop-yarn | ||
export HADOOP_MAPREDUCE=/usr/lib/hadoop-mapreduce | ||
export HIVE_DIR=/usr/lib/hive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need hive export?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope - removed
bin/gce/google_cloud_docker_run.sh
Outdated
-e HADOOP_CONF_DIR=${HADOOP_CONF_DIR} \ | ||
-e HIVE_DIR=${HIVE_DIR} \ | ||
-e HIVE_CONF_DIR=${HIVE_CONF_DIR} \ | ||
-e SPARK_DIST_CLASSPATH="/usr/lib/hadoop/etc/hadoop:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/.//*:/usr/lib/hadoop-hdfs/./:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/.//*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/.//*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/.//*" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you comment the use of this line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added - this is necessary to specify the hadoop version (and other dependencies) to match the cluster environment in google cloud.
@@ -0,0 +1,60 @@ | |||
#!/usr/bin/env bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you simplify into 1 install script?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean? This script will be automatically run upon cluster creation.
@@ -0,0 +1,84 @@ | |||
Running Mango on Google Cloud | |||
============================= | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do you make a google cloud cluster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refer to the section below, there's a step to
Create the Cloud Dataproc Cluster
docs/cloud/google-cloud.rst
Outdated
|
||
More information about available public datasets on Google cloud can be found `online <https://cloud.google.com/genomics/v1/public-data>`__ | ||
|
||
More information on using the dataproc cluster's Spark interface is available through `Google Cloud documentation <https://cloud.google.com/dataproc/docs/concepts/accessing/cluster-web-interfaces>`__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there an example notebook you can push and reference on a non-1000g datasource?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will make some changes to mango-google-cloud.ipynb to reference a non-1000g datasource
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"alignmentFile = \"gs://genomics-public-data/ftp-trace.ncbi.nih.gov/1000genomes/ftp/data/150140/alignment/150140.chrom20.ILLUMINA.bwa.CHM1.20131218.bam\"" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what other datasets are hosted besides 1000g? I ask because 1000g is the only datasource on AWS, so it would be nice to reference a different one in the GCE examples
Test PASSed. |
install.sh, run-notebook.sh, run-browser.sh |
Test PASSed. |
bin/gce/run-submit.sh
Outdated
@@ -0,0 +1,68 @@ | |||
set -ex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change file to run-browser
Test PASSed. |
looks good @Georgehe4 can you rebase? |
Moved to #360. Can you enable squash merges for this repo? |
No description provided.