-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add basic Documentation for setting up on Google Cloud #340
Closed
Closed
Changes from all commits
Commits
Show all changes
18 commits
Select commit
Hold shift + click to select a range
4173ca3
Fix index underline warnings
Georgehe4 1817e1c
Fix github repo
Georgehe4 05fd42f
Add basic google cloud documentation
Georgehe4 750dd09
Update google cloud instructions for running mango in docker
Georgehe4 4f53b10
Add docker install scripts
Georgehe4 6a65f77
Update scripts directory for gce
Georgehe4 505705d
merge with master
Georgehe4 0209b33
update gce scripts
Georgehe4 770a02c
Add command for running docker
Georgehe4 517c746
Add docker_run options to run from both mango-notebook and mango-submit
Georgehe4 2916db0
Update documentation to support gs
Georgehe4 11843f6
Merge branch 'master' of https://github.com/bigdatagenomics/mango int…
Georgehe4 90dd129
Update documentation and add example file
Georgehe4 00b4d54
Streamline installation, updated documentation and scripts
Georgehe4 49272a8
Merge branch 'master' of https://github.com/bigdatagenomics/mango int…
Georgehe4 872bb02
Rename script names
Georgehe4 f214003
Remove example file
Georgehe4 0ab3d01
Update docs and scripts
Georgehe4 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,81 @@ | ||
#!/usr/bin/env bash | ||
# Based on gs://dataproc-initialization-actions/jupyter/jupyter.sh | ||
set -e | ||
|
||
ROLE=$(curl -f -s -H Metadata-Flavor:Google http://metadata/computeMetadata/v1/instance/attributes/dataproc-role) | ||
INIT_ACTIONS_REPO=$(curl -f -s -H Metadata-Flavor:Google http://metadata/computeMetadata/v1/instance/attributes/INIT_ACTIONS_REPO || true) | ||
INIT_ACTIONS_REPO="${INIT_ACTIONS_REPO:-https://github.com/GoogleCloudPlatform/dataproc-initialization-actions.git}" | ||
INIT_ACTIONS_BRANCH=$(curl -f -s -H Metadata-Flavor:Google http://metadata/computeMetadata/v1/instance/attributes/INIT_ACTIONS_BRANCH || true) | ||
INIT_ACTIONS_BRANCH="${INIT_ACTIONS_BRANCH:-master}" | ||
DATAPROC_BUCKET=$(curl -f -s -H Metadata-Flavor:Google http://metadata/computeMetadata/v1/instance/attributes/dataproc-bucket) | ||
|
||
# Colon-separated list of conda channels to add before installing packages | ||
JUPYTER_CONDA_CHANNELS=$(curl -f -s -H Metadata-Flavor:Google http://metadata/computeMetadata/v1/instance/attributes/JUPYTER_CONDA_CHANNELS || true) | ||
# Colon-separated list of conda packages to install, for example 'numpy:pandas' | ||
JUPYTER_CONDA_PACKAGES=$(curl -f -s -H Metadata-Flavor:Google http://metadata/computeMetadata/v1/instance/attributes/JUPYTER_CONDA_PACKAGES || true) | ||
|
||
echo "Cloning fresh dataproc-initialization-actions from repo $INIT_ACTIONS_REPO and branch $INIT_ACTIONS_BRANCH..." | ||
git clone -b "$INIT_ACTIONS_BRANCH" --single-branch $INIT_ACTIONS_REPO | ||
# Ensure we have conda installed. | ||
./dataproc-initialization-actions/conda/bootstrap-conda.sh | ||
|
||
source /etc/profile.d/conda.sh | ||
|
||
if [ -n "${JUPYTER_CONDA_CHANNELS}" ]; then | ||
echo "Adding custom conda channels '$(echo ${JUPYTER_CONDA_CHANNELS} | tr ':' ' ')'" | ||
conda config --add channels $(echo ${JUPYTER_CONDA_CHANNELS} | tr ':' ',') | ||
fi | ||
|
||
if [ -n "${JUPYTER_CONDA_PACKAGES}" ]; then | ||
echo "Installing custom conda packages '$(echo ${JUPYTER_CONDA_PACKAGES} | tr ':' ' ')'" | ||
conda install $(echo ${JUPYTER_CONDA_PACKAGES} | tr ':' ' ') | ||
fi | ||
|
||
if [[ "${ROLE}" == 'Master' ]]; then | ||
conda install jupyter | ||
pip install google_compute_engine | ||
|
||
if gsutil -q stat "gs://$DATAPROC_BUCKET/notebooks/**"; then | ||
echo "Pulling notebooks directory to cluster master node..." | ||
gsutil -m cp -r gs://$DATAPROC_BUCKET/notebooks /root/ | ||
fi | ||
./dataproc-initialization-actions/jupyter/internal/setup-jupyter-kernel.sh | ||
./dataproc-initialization-actions/jupyter/internal/launch-jupyter-kernel.sh | ||
|
||
# Install docker | ||
# install packages to allow apt to use a repository over HTTPS: | ||
apt-get -y install \ | ||
apt-transport-https ca-certificates curl software-properties-common | ||
# add Docker's GPG key: | ||
curl -fsSL https://download.docker.com/linux/debian/gpg | apt-key add - | ||
# set up the Docker stable repository. | ||
add-apt-repository \ | ||
"deb [arch=amd64] https://download.docker.com/linux/debian \ | ||
$(lsb_release -cs) \ | ||
stable" | ||
# update the apt package index: | ||
apt-get -y update | ||
# finally, install docker | ||
apt-get -y install docker-ce | ||
|
||
# Install google cloud nio | ||
curl -L https://oss.sonatype.org/content/repositories/releases/com/google/cloud/google-cloud-nio/0.22.0-alpha/google-cloud-nio-0.22.0-alpha-shaded.jar | gsutil cp - gs://mango-initialization-bucket/google-cloud-nio-0.22.0-alpha-shaded.jar | ||
|
||
fi | ||
echo "Completed installing Jupyter!" | ||
|
||
# Install Jupyter extensions (if desired) | ||
# TODO: document this in readme | ||
if [[ ! -v $INSTALL_JUPYTER_EXT ]] | ||
then | ||
INSTALL_JUPYTER_EXT=false | ||
fi | ||
if [[ "$INSTALL_JUPYTER_EXT" = true ]] | ||
then | ||
echo "Installing Jupyter Notebook extensions..." | ||
./dataproc-initialization-actions/jupyter/internal/bootstrap-jupyter-ext.sh | ||
echo "Jupyter Notebook extensions installed!" | ||
fi | ||
|
||
pip install cigar | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,68 @@ | ||
set -ex | ||
|
||
# Split args into Spark and notebook args | ||
DD=False # DD is "double dash" | ||
ENTRYPOINT=TRUE | ||
PRE_DD=() | ||
POST_DD=() | ||
|
||
# by default, runs mango browser (mango-submit) | ||
# to override to mango-notebook, | ||
# run docker with --entrypoint=/opt/cgl-docker-lib/mango/bin/mango-notebook | ||
ENTRYPOINT="--entrypoint=/opt/cgl-docker-lib/mango/bin/mango-submit" | ||
for ARG in "$@"; do | ||
shift | ||
if [[ $ARG == "--" ]]; then | ||
DD=True | ||
POST_DD=( "$@" ) | ||
break | ||
fi | ||
if [[ $ARG == '--entrypoint='* ]]; then | ||
ENTRYPOINT=${ARG#(--entrypoint=): } | ||
else | ||
PRE_DD+=("$ARG") | ||
fi | ||
done | ||
|
||
PRE_DD_ARGS="${PRE_DD[@]}" | ||
POST_DD_ARGS="${POST_DD[@]}" | ||
|
||
export SPARK_HOME=/usr/lib/spark | ||
export SPARK_CONF_DIR=/usr/lib/spark/conf | ||
export HADOOP_HOME=/usr/lib/hadoop | ||
export HADOOP_LIBEXEC_DIR=$HADOOP_HOME/libexec | ||
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop | ||
export HADOOP_HDFS=/usr/lib/hadoop-hdfs | ||
export HADOOP_YARN=/usr/lib/hadoop-yarn | ||
export HADOOP_MAPREDUCE=/usr/lib/hadoop-mapreduce | ||
export CONDA_DIR=/opt/conda | ||
export PYSPARK_DRIVER_PYTHON=/opt/conda/bin/jupyter | ||
export HIVE_CONF_DIR=$HIVE_DIR/conf | ||
export TARGET_MANGO_ASSEMBLY=/opt/cgl-docker-lib/mango/mango-assembly/target/mango-assembly-0.0.1-SNAPSHOT.jar | ||
# Sets java classes required for hadoop yarn, mapreduce, hive associated with the cluster | ||
export SPARK_DIST_CLASS_PATH="/usr/lib/hadoop/etc/hadoop:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/.//*:/usr/lib/hadoop-hdfs/./:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/.//*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/.//*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/.//*" | ||
|
||
sudo docker run \ | ||
--net=host \ | ||
-v ${SPARK_HOME}:${SPARK_HOME} \ | ||
-v ${SPARK_CONF_DIR}:${SPARK_CONF_DIR} \ | ||
-v ${HADOOP_HOME}:${HADOOP_HOME} \ | ||
-v ${HADOOP_CONF_DIR}:${HADOOP_CONF_DIR} \ | ||
-v ${HADOOP_HDFS}:${HADOOP_HDFS} \ | ||
-v ${HADOOP_YARN}:${HADOOP_YARN} \ | ||
-v ${CONDA_DIR}:${CONDA_DIR} \ | ||
-v ${HADOOP_MAPREDUCE}:${HADOOP_MAPREDUCE} \ | ||
-e SPARK_HOME=${SPARK_HOME} \ | ||
-e HADOOP_HOME=${HADOOP_HOME} \ | ||
-e SPARK_CONF_DIR=${SPARK_CONF_DIR} \ | ||
-e HADOOP_CONF_DIR=${HADOOP_CONF_DIR} \ | ||
-e HIVE_CONF_DIR=${HIVE_CONF_DIR} \ | ||
-e SPARK_DIST_CLASSPATH=${SPARK_DIST_CLASSPATH}} \ | ||
$ENTRYPOINT \ | ||
-p 8888:8888 \ | ||
quay.io/ucsc_cgl/mango:latest \ | ||
--master yarn \ | ||
--jars ${TARGET_MANGO_ASSEMBLY},gs://mango-initialization-bucket/google-cloud-nio-0.22.0-alpha-shaded.jar \ | ||
$PRE_DD_ARGS \ | ||
-- --ip=0.0.0.0 --allow-root \ | ||
$POST_DD_ARGS |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,68 @@ | ||
set -ex | ||
|
||
# Split args into Spark and notebook args | ||
DD=False # DD is "double dash" | ||
ENTRYPOINT=TRUE | ||
PRE_DD=() | ||
POST_DD=() | ||
|
||
# by default, runs mango notebook | ||
# to override to mango-submit, | ||
# run docker with --entrypoint=/opt/cgl-docker-lib/mango/bin/mango-submit | ||
ENTRYPOINT="--entrypoint=/opt/cgl-docker-lib/mango/bin/mango-notebook" | ||
for ARG in "$@"; do | ||
shift | ||
if [[ $ARG == "--" ]]; then | ||
DD=True | ||
POST_DD=( "$@" ) | ||
break | ||
fi | ||
if [[ $ARG == '--entrypoint='* ]]; then | ||
ENTRYPOINT=${ARG#(--entrypoint=): } | ||
else | ||
PRE_DD+=("$ARG") | ||
fi | ||
done | ||
|
||
PRE_DD_ARGS="${PRE_DD[@]}" | ||
POST_DD_ARGS="${POST_DD[@]}" | ||
|
||
export SPARK_HOME=/usr/lib/spark | ||
export SPARK_CONF_DIR=/usr/lib/spark/conf | ||
export HADOOP_HOME=/usr/lib/hadoop | ||
export HADOOP_LIBEXEC_DIR=$HADOOP_HOME/libexec | ||
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop | ||
export HADOOP_HDFS=/usr/lib/hadoop-hdfs | ||
export HADOOP_YARN=/usr/lib/hadoop-yarn | ||
export HADOOP_MAPREDUCE=/usr/lib/hadoop-mapreduce | ||
export CONDA_DIR=/opt/conda | ||
export PYSPARK_DRIVER_PYTHON=/opt/conda/bin/jupyter | ||
export HIVE_CONF_DIR=$HIVE_DIR/conf | ||
export TARGET_MANGO_ASSEMBLY=/opt/cgl-docker-lib/mango/mango-assembly/target/mango-assembly-0.0.1-SNAPSHOT.jar | ||
# Sets java classes required for hadoop yarn, mapreduce, hive associated with the cluster | ||
export SPARK_DIST_CLASS_PATH="/usr/lib/hadoop/etc/hadoop:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/.//*:/usr/lib/hadoop-hdfs/./:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/.//*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/.//*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/.//*" | ||
|
||
sudo docker run \ | ||
--net=host \ | ||
-v ${SPARK_HOME}:${SPARK_HOME} \ | ||
-v ${SPARK_CONF_DIR}:${SPARK_CONF_DIR} \ | ||
-v ${HADOOP_HOME}:${HADOOP_HOME} \ | ||
-v ${HADOOP_CONF_DIR}:${HADOOP_CONF_DIR} \ | ||
-v ${HADOOP_HDFS}:${HADOOP_HDFS} \ | ||
-v ${HADOOP_YARN}:${HADOOP_YARN} \ | ||
-v ${CONDA_DIR}:${CONDA_DIR} \ | ||
-v ${HADOOP_MAPREDUCE}:${HADOOP_MAPREDUCE} \ | ||
-e SPARK_HOME=${SPARK_HOME} \ | ||
-e HADOOP_HOME=${HADOOP_HOME} \ | ||
-e SPARK_CONF_DIR=${SPARK_CONF_DIR} \ | ||
-e HADOOP_CONF_DIR=${HADOOP_CONF_DIR} \ | ||
-e HIVE_CONF_DIR=${HIVE_CONF_DIR} \ | ||
-e SPARK_DIST_CLASSPATH=${SPARK_DIST_CLASSPATH}} \ | ||
$ENTRYPOINT \ | ||
-p 8888:8888 \ | ||
quay.io/ucsc_cgl/mango:latest \ | ||
--master yarn \ | ||
--jars ${TARGET_MANGO_ASSEMBLY},gs://mango-initialization-bucket/google-cloud-nio-0.22.0-alpha-shaded.jar \ | ||
$PRE_DD_ARGS \ | ||
-- --ip=0.0.0.0 --allow-root \ | ||
$POST_DD_ARGS |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,78 @@ | ||
Running Mango on Google Cloud | ||
============================= | ||
|
||
`Cloud Dataproc <https://cloud.google.com/dataproc/>`__ sets up the basic environment with HDFS and Spark, providing a simple environment to run Mango. | ||
|
||
Commands in this section will require users to create an account on `Google Cloud <https://cloud.google.com/>`__ and install the `gcloud cli <https://cloud.google.com/sdk/gcloud/>`__ | ||
|
||
Creating a Dataproc Cluster | ||
--------------------------- | ||
Download the necessary initialization scripts: | ||
|
||
.. code:: bash | ||
|
||
wget https://raw.githubusercontent.com/bigdatagenomics/mango/master/bin/gce/install.sh | ||
|
||
Initialize a Google Cloud Storage Bucket | ||
|
||
.. code:: bash | ||
|
||
gsutil mb gs://mango-initialization-bucket/ | ||
|
||
Copy the installation scripts to be used by cloud dataproc | ||
|
||
.. code:: bash | ||
|
||
gsutil cp install.sh gs://mango-initialization-bucket | ||
|
||
|
||
Create the Cloud Dataproc Cluster (modify the fields as appropriate) with the below installation script | ||
|
||
.. code:: bash | ||
|
||
gcloud dataproc clusters create <cluster-name> \ | ||
--project <project_id> \ | ||
--bucket <optional_bucket_name> \ | ||
--metadata MINICONDA_VARIANT=2 \ | ||
--master-machine-type=n1-standard-1 \ | ||
--worker-machine-type=n1-standard-1 \ | ||
--master-boot-disk-size=50GB \ | ||
--worker-boot-disk-size=10GB \ | ||
--initialization-actions \ | ||
gs://mango-initialization-bucket/install.sh | ||
|
||
|
||
After the above steps are completed, ssh into the master node. | ||
|
||
.. code:: bash | ||
|
||
gcloud compute ssh <cluster-name>-m | ||
|
||
Running Mango Notebook on a Dataproc Cluster | ||
-------------------------------------------- | ||
|
||
Before mango can run, it is recommended to stage datasets into hdfs if you are trying to view specific files. The created container will share the same hadoop file system with the root master user. | ||
|
||
.. code:: bash | ||
|
||
hdfs dfs -put /<local machime path> /<hdfs path> | ||
|
||
An example docker startup script is available in the Mango gce `scripts directory <https://github.com/bigdatagenomics/mango/blob/master/bin/gce>`__ for running `mango notebook <https://github.com/bigdatagenomics/mango/blob/master/bin/gce/run-notebook.sh>`__, or for running `mango browser <https://github.com/bigdatagenomics/mango/blob/master/bin/gce/run-browser.sh>`__ [root permissions may be necessary for docker]. | ||
|
||
.. code:: bash | ||
|
||
wget 'https://github.com/bigdatagenomics/mango/blob/master/bin/gce/run-notebook.sh' | ||
|
||
bash run-notebook.sh --entrypoint=/opt/cgl-docker-lib/mango/bin/mango-notebook | ||
|
||
Once the notebook is running, connect to Mango by setting up a tunnel to your local computer via the exposed port in the master node: | ||
|
||
.. code:: bash | ||
|
||
gcloud compute ssh <cluster-name>-m -- -N -L localhost:<local_port>:localhost:8888 | ||
|
||
You can navigate to notebook through your local browser by pointing it towards http://localhost:<local_port>/. Once in the browser notebook environment, navigate to /opt/cgl-docker-lib/mango/example-files/ to try out the example files after configuring the file paths to read relative to the home directory in HDFS. Public datasets can be accessed by referencing google cloud storage at gs://genomics-public-data/. | ||
|
||
More information about available public datasets on Google cloud can be found `online <https://cloud.google.com/genomics/v1/public-data>`__ | ||
|
||
More information on using the dataproc cluster's Spark interface is available through `Google Cloud documentation <https://cloud.google.com/dataproc/docs/concepts/accessing/cluster-web-interfaces>`__ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do you make a google cloud cluster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refer to the section below, there's a step to
Create the Cloud Dataproc Cluster