Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add basic Documentation for setting up on Google Cloud #340

Closed
wants to merge 18 commits into from

Conversation

Georgehe4
Copy link
Contributor

No description provided.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/mango-prb/515/
Test PASSed.


.. code:: bash

wget https://gist.githubusercontent.com/Georgehe4/6bb1c142a9f68f30f38d80cd9407120a/raw/9b903e3b8746ee8f25911fe98925b53e9777002f/mango_install.sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put this script in Mango scripts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

gs://mango-initialization-bucket/mango_install.sh


Once the above steps are completed, simply ssh into the master node to run Mango.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have the scripts for running docker?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

docs/index.rst Outdated
@@ -31,6 +31,12 @@ variety of platforms.

docker/docker-examples

.. toctree::
:caption: Cloud
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Google Cloud

Copy link
Contributor Author

@Georgehe4 Georgehe4 Dec 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might add an AWS Section later?

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/mango-prb/517/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/mango-prb/523/
Test PASSed.

# update the apt package index:
sudo apt-get -y update
# finally, install docker
sudo apt-get -y install docker-ce
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newline

Copy link
Contributor Author

@Georgehe4 Georgehe4 Dec 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

-e HIVE_DIR=${HIVE_DIR} \
-e HIVE_CONF_DIR=${HIVE_CONF_DIR} \
-e SPARK_DIST_CLASSPATH="/usr/lib/hadoop/etc/hadoop:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/.//*:/usr/lib/hadoop-hdfs/./:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/.//*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/.//*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/.//*" \
--entrypoint=/opt/cgl-docker-lib/mango/bin/mango-notebook \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when you were accessing google storage through docker, did it need any access keys?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I still haven't resolved the access key issue among the spark workers yet for bam files in gs. For now, accessing hdfs works fine but gs still needs to be investigated.

@@ -0,0 +1,59 @@
#!/usr/bin/env bash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's put these in bin/GCE/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

fi

if [[ "${ROLE}" == 'Master' ]]; then
conda install jupyter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you installing docker here? docker is installed in the container, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docker is not provided on a default gce vm.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/mango-prb/525/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/mango-prb/526/
Test PASSed.


hdfs dfs -put /<local machime path> /<hdfs path>

An example docker startup script is available in the Mango `scripts directory <https://github.com/bigdatagenomics/mango/blob/master/bin/gce/google_cloud_docker_run.sh>`__ for running mango notebook [run with root permissions to work with docker].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give the code to copy this from docker so a user can run the script?

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/mango-prb/533/
Test PASSed.

@@ -0,0 +1,37 @@
export SPARK_HOME=/usr/lib/spark
export SPARK_CONF_DIR=/usr/lib/spark/conf
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is specifically for mango notebook. we can pass a user variable here that states whether to run notebook or browser based on the --entrypoint flag

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

separate into run-notebook.sh and run-browser.sh

echo "Jupyter Notebook extensions installed!"
fi

pip install cigar
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't cigar be installed in the virtual env?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so cigar is present on the worker nodes (it is also installed in the venv).

What configurations have you been setting to port python packages over to the worker nodes on the cluster? It looks like the python binary for the worker nodes is not being overwritten by the flags sent to Spark, which leads to "cigar not found" issues if it is not installed this way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

won't this be installed outside of the virtual env? Do you know why cigar is causing an issue and not other imports?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cigar will be installed in the virtual env, just not through this process.

Since the pyspark jobs depend on cigar in the worker nodes, the python distribution on the worker nodes need to have cigar present. I don't believe we use any other imports in pyspark for mango-notebook.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/mango-prb/545/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/mango-prb/547/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/mango-prb/555/
Test PASSed.

@Georgehe4
Copy link
Contributor Author

Hey @akmorrow13, I've added an example file and fixed an issue with block data reads (https://gist.github.com/Georgehe4/043ae9079349ac7b05bee1e0b3b4b4fa) - I believe this should be good to go.

export HADOOP_HDFS=/usr/lib/hadoop-hdfs
export HADOOP_YARN=/usr/lib/hadoop-yarn
export HADOOP_MAPREDUCE=/usr/lib/hadoop-mapreduce
export HIVE_DIR=/usr/lib/hive
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need hive export?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope - removed

-e HADOOP_CONF_DIR=${HADOOP_CONF_DIR} \
-e HIVE_DIR=${HIVE_DIR} \
-e HIVE_CONF_DIR=${HIVE_CONF_DIR} \
-e SPARK_DIST_CLASSPATH="/usr/lib/hadoop/etc/hadoop:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/.//*:/usr/lib/hadoop-hdfs/./:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/.//*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/.//*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/.//*" \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you comment the use of this line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added - this is necessary to specify the hadoop version (and other dependencies) to match the cluster environment in google cloud.

@@ -0,0 +1,60 @@
#!/usr/bin/env bash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you simplify into 1 install script?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean? This script will be automatically run upon cluster creation.

@@ -0,0 +1,84 @@
Running Mango on Google Cloud
=============================

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do you make a google cloud cluster?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refer to the section below, there's a step to
Create the Cloud Dataproc Cluster


More information about available public datasets on Google cloud can be found `online <https://cloud.google.com/genomics/v1/public-data>`__

More information on using the dataproc cluster's Spark interface is available through `Google Cloud documentation <https://cloud.google.com/dataproc/docs/concepts/accessing/cluster-web-interfaces>`__
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there an example notebook you can push and reference on a non-1000g datasource?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will make some changes to mango-google-cloud.ipynb to reference a non-1000g datasource

"metadata": {},
"outputs": [],
"source": [
"alignmentFile = \"gs://genomics-public-data/ftp-trace.ncbi.nih.gov/1000genomes/ftp/data/150140/alignment/150140.chrom20.ILLUMINA.bwa.CHM1.20131218.bam\""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what other datasets are hosted besides 1000g? I ask because 1000g is the only datasource on AWS, so it would be nice to reference a different one in the GCE examples

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/mango-prb/556/
Test PASSed.

@akmorrow13
Copy link
Contributor

install.sh, run-notebook.sh, run-browser.sh

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/mango-prb/562/
Test PASSed.

@@ -0,0 +1,68 @@
set -ex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change file to run-browser

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/mango-prb/564/
Test PASSed.

@akmorrow13
Copy link
Contributor

looks good @Georgehe4 can you rebase?

@Georgehe4 Georgehe4 mentioned this pull request Feb 8, 2018
@Georgehe4
Copy link
Contributor Author

Moved to #360.

Can you enable squash merges for this repo?

@Georgehe4 Georgehe4 closed this Apr 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants