Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add basic Documentation for setting up on Google Cloud #340

Closed
wants to merge 18 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
81 changes: 81 additions & 0 deletions bin/gce/install.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
#!/usr/bin/env bash
# Based on gs://dataproc-initialization-actions/jupyter/jupyter.sh
set -e

ROLE=$(curl -f -s -H Metadata-Flavor:Google http://metadata/computeMetadata/v1/instance/attributes/dataproc-role)
INIT_ACTIONS_REPO=$(curl -f -s -H Metadata-Flavor:Google http://metadata/computeMetadata/v1/instance/attributes/INIT_ACTIONS_REPO || true)
INIT_ACTIONS_REPO="${INIT_ACTIONS_REPO:-https://github.com/GoogleCloudPlatform/dataproc-initialization-actions.git}"
INIT_ACTIONS_BRANCH=$(curl -f -s -H Metadata-Flavor:Google http://metadata/computeMetadata/v1/instance/attributes/INIT_ACTIONS_BRANCH || true)
INIT_ACTIONS_BRANCH="${INIT_ACTIONS_BRANCH:-master}"
DATAPROC_BUCKET=$(curl -f -s -H Metadata-Flavor:Google http://metadata/computeMetadata/v1/instance/attributes/dataproc-bucket)

# Colon-separated list of conda channels to add before installing packages
JUPYTER_CONDA_CHANNELS=$(curl -f -s -H Metadata-Flavor:Google http://metadata/computeMetadata/v1/instance/attributes/JUPYTER_CONDA_CHANNELS || true)
# Colon-separated list of conda packages to install, for example 'numpy:pandas'
JUPYTER_CONDA_PACKAGES=$(curl -f -s -H Metadata-Flavor:Google http://metadata/computeMetadata/v1/instance/attributes/JUPYTER_CONDA_PACKAGES || true)

echo "Cloning fresh dataproc-initialization-actions from repo $INIT_ACTIONS_REPO and branch $INIT_ACTIONS_BRANCH..."
git clone -b "$INIT_ACTIONS_BRANCH" --single-branch $INIT_ACTIONS_REPO
# Ensure we have conda installed.
./dataproc-initialization-actions/conda/bootstrap-conda.sh

source /etc/profile.d/conda.sh

if [ -n "${JUPYTER_CONDA_CHANNELS}" ]; then
echo "Adding custom conda channels '$(echo ${JUPYTER_CONDA_CHANNELS} | tr ':' ' ')'"
conda config --add channels $(echo ${JUPYTER_CONDA_CHANNELS} | tr ':' ',')
fi

if [ -n "${JUPYTER_CONDA_PACKAGES}" ]; then
echo "Installing custom conda packages '$(echo ${JUPYTER_CONDA_PACKAGES} | tr ':' ' ')'"
conda install $(echo ${JUPYTER_CONDA_PACKAGES} | tr ':' ' ')
fi

if [[ "${ROLE}" == 'Master' ]]; then
conda install jupyter
pip install google_compute_engine

if gsutil -q stat "gs://$DATAPROC_BUCKET/notebooks/**"; then
echo "Pulling notebooks directory to cluster master node..."
gsutil -m cp -r gs://$DATAPROC_BUCKET/notebooks /root/
fi
./dataproc-initialization-actions/jupyter/internal/setup-jupyter-kernel.sh
./dataproc-initialization-actions/jupyter/internal/launch-jupyter-kernel.sh

# Install docker
# install packages to allow apt to use a repository over HTTPS:
apt-get -y install \
apt-transport-https ca-certificates curl software-properties-common
# add Docker's GPG key:
curl -fsSL https://download.docker.com/linux/debian/gpg | apt-key add -
# set up the Docker stable repository.
add-apt-repository \
"deb [arch=amd64] https://download.docker.com/linux/debian \
$(lsb_release -cs) \
stable"
# update the apt package index:
apt-get -y update
# finally, install docker
apt-get -y install docker-ce

# Install google cloud nio
curl -L https://oss.sonatype.org/content/repositories/releases/com/google/cloud/google-cloud-nio/0.22.0-alpha/google-cloud-nio-0.22.0-alpha-shaded.jar | gsutil cp - gs://mango-initialization-bucket/google-cloud-nio-0.22.0-alpha-shaded.jar

fi
echo "Completed installing Jupyter!"

# Install Jupyter extensions (if desired)
# TODO: document this in readme
if [[ ! -v $INSTALL_JUPYTER_EXT ]]
then
INSTALL_JUPYTER_EXT=false
fi
if [[ "$INSTALL_JUPYTER_EXT" = true ]]
then
echo "Installing Jupyter Notebook extensions..."
./dataproc-initialization-actions/jupyter/internal/bootstrap-jupyter-ext.sh
echo "Jupyter Notebook extensions installed!"
fi

pip install cigar

68 changes: 68 additions & 0 deletions bin/gce/run-browser.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
set -ex

# Split args into Spark and notebook args
DD=False # DD is "double dash"
ENTRYPOINT=TRUE
PRE_DD=()
POST_DD=()

# by default, runs mango browser (mango-submit)
# to override to mango-notebook,
# run docker with --entrypoint=/opt/cgl-docker-lib/mango/bin/mango-notebook
ENTRYPOINT="--entrypoint=/opt/cgl-docker-lib/mango/bin/mango-submit"
for ARG in "$@"; do
shift
if [[ $ARG == "--" ]]; then
DD=True
POST_DD=( "$@" )
break
fi
if [[ $ARG == '--entrypoint='* ]]; then
ENTRYPOINT=${ARG#(--entrypoint=): }
else
PRE_DD+=("$ARG")
fi
done

PRE_DD_ARGS="${PRE_DD[@]}"
POST_DD_ARGS="${POST_DD[@]}"

export SPARK_HOME=/usr/lib/spark
export SPARK_CONF_DIR=/usr/lib/spark/conf
export HADOOP_HOME=/usr/lib/hadoop
export HADOOP_LIBEXEC_DIR=$HADOOP_HOME/libexec
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_HDFS=/usr/lib/hadoop-hdfs
export HADOOP_YARN=/usr/lib/hadoop-yarn
export HADOOP_MAPREDUCE=/usr/lib/hadoop-mapreduce
export CONDA_DIR=/opt/conda
export PYSPARK_DRIVER_PYTHON=/opt/conda/bin/jupyter
export HIVE_CONF_DIR=$HIVE_DIR/conf
export TARGET_MANGO_ASSEMBLY=/opt/cgl-docker-lib/mango/mango-assembly/target/mango-assembly-0.0.1-SNAPSHOT.jar
# Sets java classes required for hadoop yarn, mapreduce, hive associated with the cluster
export SPARK_DIST_CLASS_PATH="/usr/lib/hadoop/etc/hadoop:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/.//*:/usr/lib/hadoop-hdfs/./:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/.//*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/.//*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/.//*"

sudo docker run \
--net=host \
-v ${SPARK_HOME}:${SPARK_HOME} \
-v ${SPARK_CONF_DIR}:${SPARK_CONF_DIR} \
-v ${HADOOP_HOME}:${HADOOP_HOME} \
-v ${HADOOP_CONF_DIR}:${HADOOP_CONF_DIR} \
-v ${HADOOP_HDFS}:${HADOOP_HDFS} \
-v ${HADOOP_YARN}:${HADOOP_YARN} \
-v ${CONDA_DIR}:${CONDA_DIR} \
-v ${HADOOP_MAPREDUCE}:${HADOOP_MAPREDUCE} \
-e SPARK_HOME=${SPARK_HOME} \
-e HADOOP_HOME=${HADOOP_HOME} \
-e SPARK_CONF_DIR=${SPARK_CONF_DIR} \
-e HADOOP_CONF_DIR=${HADOOP_CONF_DIR} \
-e HIVE_CONF_DIR=${HIVE_CONF_DIR} \
-e SPARK_DIST_CLASSPATH=${SPARK_DIST_CLASSPATH}} \
$ENTRYPOINT \
-p 8888:8888 \
quay.io/ucsc_cgl/mango:latest \
--master yarn \
--jars ${TARGET_MANGO_ASSEMBLY},gs://mango-initialization-bucket/google-cloud-nio-0.22.0-alpha-shaded.jar \
$PRE_DD_ARGS \
-- --ip=0.0.0.0 --allow-root \
$POST_DD_ARGS
68 changes: 68 additions & 0 deletions bin/gce/run-notebook.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
set -ex

# Split args into Spark and notebook args
DD=False # DD is "double dash"
ENTRYPOINT=TRUE
PRE_DD=()
POST_DD=()

# by default, runs mango notebook
# to override to mango-submit,
# run docker with --entrypoint=/opt/cgl-docker-lib/mango/bin/mango-submit
ENTRYPOINT="--entrypoint=/opt/cgl-docker-lib/mango/bin/mango-notebook"
for ARG in "$@"; do
shift
if [[ $ARG == "--" ]]; then
DD=True
POST_DD=( "$@" )
break
fi
if [[ $ARG == '--entrypoint='* ]]; then
ENTRYPOINT=${ARG#(--entrypoint=): }
else
PRE_DD+=("$ARG")
fi
done

PRE_DD_ARGS="${PRE_DD[@]}"
POST_DD_ARGS="${POST_DD[@]}"

export SPARK_HOME=/usr/lib/spark
export SPARK_CONF_DIR=/usr/lib/spark/conf
export HADOOP_HOME=/usr/lib/hadoop
export HADOOP_LIBEXEC_DIR=$HADOOP_HOME/libexec
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_HDFS=/usr/lib/hadoop-hdfs
export HADOOP_YARN=/usr/lib/hadoop-yarn
export HADOOP_MAPREDUCE=/usr/lib/hadoop-mapreduce
export CONDA_DIR=/opt/conda
export PYSPARK_DRIVER_PYTHON=/opt/conda/bin/jupyter
export HIVE_CONF_DIR=$HIVE_DIR/conf
export TARGET_MANGO_ASSEMBLY=/opt/cgl-docker-lib/mango/mango-assembly/target/mango-assembly-0.0.1-SNAPSHOT.jar
# Sets java classes required for hadoop yarn, mapreduce, hive associated with the cluster
export SPARK_DIST_CLASS_PATH="/usr/lib/hadoop/etc/hadoop:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/.//*:/usr/lib/hadoop-hdfs/./:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/.//*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/.//*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/.//*"

sudo docker run \
--net=host \
-v ${SPARK_HOME}:${SPARK_HOME} \
-v ${SPARK_CONF_DIR}:${SPARK_CONF_DIR} \
-v ${HADOOP_HOME}:${HADOOP_HOME} \
-v ${HADOOP_CONF_DIR}:${HADOOP_CONF_DIR} \
-v ${HADOOP_HDFS}:${HADOOP_HDFS} \
-v ${HADOOP_YARN}:${HADOOP_YARN} \
-v ${CONDA_DIR}:${CONDA_DIR} \
-v ${HADOOP_MAPREDUCE}:${HADOOP_MAPREDUCE} \
-e SPARK_HOME=${SPARK_HOME} \
-e HADOOP_HOME=${HADOOP_HOME} \
-e SPARK_CONF_DIR=${SPARK_CONF_DIR} \
-e HADOOP_CONF_DIR=${HADOOP_CONF_DIR} \
-e HIVE_CONF_DIR=${HIVE_CONF_DIR} \
-e SPARK_DIST_CLASSPATH=${SPARK_DIST_CLASSPATH}} \
$ENTRYPOINT \
-p 8888:8888 \
quay.io/ucsc_cgl/mango:latest \
--master yarn \
--jars ${TARGET_MANGO_ASSEMBLY},gs://mango-initialization-bucket/google-cloud-nio-0.22.0-alpha-shaded.jar \
$PRE_DD_ARGS \
-- --ip=0.0.0.0 --allow-root \
$POST_DD_ARGS
78 changes: 78 additions & 0 deletions docs/cloud/google-cloud.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
Running Mango on Google Cloud
=============================

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do you make a google cloud cluster?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refer to the section below, there's a step to
Create the Cloud Dataproc Cluster

`Cloud Dataproc <https://cloud.google.com/dataproc/>`__ sets up the basic environment with HDFS and Spark, providing a simple environment to run Mango.

Commands in this section will require users to create an account on `Google Cloud <https://cloud.google.com/>`__ and install the `gcloud cli <https://cloud.google.com/sdk/gcloud/>`__

Creating a Dataproc Cluster
---------------------------
Download the necessary initialization scripts:

.. code:: bash

wget https://raw.githubusercontent.com/bigdatagenomics/mango/master/bin/gce/install.sh

Initialize a Google Cloud Storage Bucket

.. code:: bash

gsutil mb gs://mango-initialization-bucket/

Copy the installation scripts to be used by cloud dataproc

.. code:: bash

gsutil cp install.sh gs://mango-initialization-bucket


Create the Cloud Dataproc Cluster (modify the fields as appropriate) with the below installation script

.. code:: bash

gcloud dataproc clusters create <cluster-name> \
--project <project_id> \
--bucket <optional_bucket_name> \
--metadata MINICONDA_VARIANT=2 \
--master-machine-type=n1-standard-1 \
--worker-machine-type=n1-standard-1 \
--master-boot-disk-size=50GB \
--worker-boot-disk-size=10GB \
--initialization-actions \
gs://mango-initialization-bucket/install.sh


After the above steps are completed, ssh into the master node.

.. code:: bash

gcloud compute ssh <cluster-name>-m

Running Mango Notebook on a Dataproc Cluster
--------------------------------------------

Before mango can run, it is recommended to stage datasets into hdfs if you are trying to view specific files. The created container will share the same hadoop file system with the root master user.

.. code:: bash

hdfs dfs -put /<local machime path> /<hdfs path>

An example docker startup script is available in the Mango gce `scripts directory <https://github.com/bigdatagenomics/mango/blob/master/bin/gce>`__ for running `mango notebook <https://github.com/bigdatagenomics/mango/blob/master/bin/gce/run-notebook.sh>`__, or for running `mango browser <https://github.com/bigdatagenomics/mango/blob/master/bin/gce/run-browser.sh>`__ [root permissions may be necessary for docker].

.. code:: bash

wget 'https://github.com/bigdatagenomics/mango/blob/master/bin/gce/run-notebook.sh'

bash run-notebook.sh --entrypoint=/opt/cgl-docker-lib/mango/bin/mango-notebook

Once the notebook is running, connect to Mango by setting up a tunnel to your local computer via the exposed port in the master node:

.. code:: bash

gcloud compute ssh <cluster-name>-m -- -N -L localhost:<local_port>:localhost:8888

You can navigate to notebook through your local browser by pointing it towards http://localhost:<local_port>/. Once in the browser notebook environment, navigate to /opt/cgl-docker-lib/mango/example-files/ to try out the example files after configuring the file paths to read relative to the home directory in HDFS. Public datasets can be accessed by referencing google cloud storage at gs://genomics-public-data/.

More information about available public datasets on Google cloud can be found `online <https://cloud.google.com/genomics/v1/public-data>`__

More information on using the dataproc cluster's Spark interface is available through `Google Cloud documentation <https://cloud.google.com/dataproc/docs/concepts/accessing/cluster-web-interfaces>`__
6 changes: 6 additions & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,12 @@ variety of platforms.

docker/docker-examples

.. toctree::
:caption: Google Cloud
:maxdepth: 2

cloud/google-cloud

.. toctree::
:caption: Supported File Types
:maxdepth: 2
Expand Down