Setting up on Google Cloud #356

arokem · 2018-10-26T00:15:50Z

Hello! We are interested in setting up a Reana cluster on Google Cloud Platform (GCP).

We followed the instructions in the zero-to-jupyterhub documentation (https://zero-to-jupyterhub.readthedocs.io/en/stable/) to set up a Kubernetes cluster, and then followed the instructions here: https://reana-cluster.readthedocs.io/en/latest/gettingstarted.html#deploy-locally, but instead of using minikube, we pointed it to our cluster-in-the-clouds. Pretty quickly, we discovered that we can't write to /reana on these cloud machines (see: https://cloud.google.com/container-optimized-os/docs/concepts/security). All the pods come crashing down as soon as they try writing (into) this directory. So, we edited the provided default configuration (https://reana-cluster.readthedocs.io/en/latest/userguide.html#configure-reana-cluster) to point to /etc/reana, which is writeable. This solved most of the problems. The one remaining issue is that the database pod that is still crashing. The logs in this pod are:

The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.utf8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

fixing permissions on existing directory /var/lib/postgresql/data ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting dynamic shared memory implementation ... posix
creating configuration files ... ok
FATAL:  could not write to file "pg_xlog/xlogtemp.29": No space left on device
child process exited with exit code 1
initdb: removing contents of data directory "/var/lib/postgresql/data"
running bootstrap script ...

Which suggest that maybe it's still trying to write to a disallowed location.

We're not neccessarily expecting you to fix this, if it's not currently on your road-map, but we thought it would be good to raise this, and at least document our experiments for future experimenters seeking guidance.

But of course: your thoughts would be appreciated. Thanks!

The text was updated successfully, but these errors were encountered:

lukasheinrich · 2018-10-26T00:58:36Z

I've not been involved lately in the development, so I might not be of much help, but I'm pretty sure you need to be able to have distributed storage available. At CERN we use Volumes provided by CephFS which support the ReadWriteMany access mode (see this table https://kubernetes.io/docs/concepts/storage/persistent-volumes/ ) I think on GCP the only option available is Cloud Filestore, https://cloud.google.com/filestore/docs/accessing-fileshares, but I haven't tried this yet. Maybe @diegodelemos or @tiborsimko can comment whether a shared fs (or even Ceph) is still a hard requirement

In any case: happy to see people interested in deploying REANA, we'll try to help as much we can!

tiborsimko · 2018-11-05T08:44:10Z

Which suggest that maybe it's still trying to write to a disallowed location.

@arokem The DB pod error might be connected to writing to disallowed location indeed... We are using /reana and /reanadb locations in default configurations. Perhaps you have changed the former but not the latter?

$ git grep reanadb
reana_cluster/configurations/reana-cluster-dev.yaml:  db_persistence_path: "/reanadb"
reana_cluster/configurations/reana-cluster-latest.yaml:  db_persistence_path: "/reanadb"
reana_cluster/configurations/reana-cluster.yaml:  db_persistence_path: "/reanadb"

Alternatively, you could also switch to using a DB instance outside of the cluster.

P.S. We should perhaps switch /reana and /reanadb to some more reasonable defaults...

tiborsimko · 2018-11-05T08:46:45Z

I think on GCP the only option available is Cloud Filestore

Indeed REANA needs a shared filesystem at this stage. Support for distributed file systems, say S3, is in the plans later on.

We have not tried yet the installation on GCP but it would be definitely interesting to provide runnable configurations out of the box!

elibixby · 2022-02-08T16:50:15Z

FYI I am currently trying to get this running on GKE (v1.22).

Currently trying to get the barebones running (ingess, quota, etc turned off)

Some sticking points:

It's rather strange to use an unconfigurable storage class name. I already had RWX dynamic provisioning support in my cluster and ended up having to copy the storageclass config and change the name (rather than deploying a duplicate provisioner). Much better to allow the user to specify a storage class name and use current templated default.
hostPaths can't be used on GKE safely (and are generally discouraged for safety reasons), so the reana-workflow-controller pods are crashing due to the reana-code hostpath volume. Should be an easy fix allowing users to specify a storageclass for this as well (defaulting to hostPath). EDIT: Looks like I need to turn off debug to fix this.

I might have missed options that allow for this in the config. If you're interested in contributions I'd happily contribute some documentation etc if people can help me with PRs as I work out these issues.

Some far future things I'm interested in:

More flexible auth (e.g token based login)
Moving PostGres server out of cluster (e.g. using the Cloud SQL proxy)
More flexible quota management (e.g. with RBAC for externalized quota)
Allow use of existing volumes as workspaces (particularly useful for interacting with a customized jupyterhub running in the same cluster)

tiborsimko · 2022-02-10T16:26:35Z

@elibixby Thanks for reaching out! This issue is quite old, so let me share a short update about REANA-on-GKE status since 2018.

About a year or two ago we have tested a small REANA deployment on GKE, targeting mostly single-node deployment. The aim was just to test the general applicability of our Helm charts on various platforms. Everything worked well. This year we are just about to start work on a bigger GKE deployment for ATLAS physics use case (CC @lukasheinrich) which will need many nodes. So your message comes very timely!

Here are a few technical notes:

WRT Kubernetes 1.22, the current REANA does not support it yet because we are using older K8S APIs that were deprecated in 1.22. However, we have a fully working PR set that we should get to merging very soon, most probably next week after 0.8.1 is released.
WRT storage, we are definitely open to changes. The GKE deployment last year was done for single-node only, so using ephemeral/local storage. We would definitely need a shared filesystem for multi-node deployments, so it would be interesting to hear your plans about the best GKE shared storage options.
WRT auth, REANA currently offers either local accounts or CERN-specific SSO. However, CERN has a new OIDC-based authn/authz system in place, which we were thinking of migrating towards later in the year. If OIDC would be OK for your needs, there could be some synergy there, as for the storage need synergies.
WRT PostgreSQL, it is already possible to use a DB instance living outside of the cluster. That's our primary mode of deployment at CERN, actually. It should be sufficient to set Helm values.yaml variables in db_env_config for example:

db_env_config:
  REANA_DB_NAME: "reana"
  REANA_DB_HOST: "db.example.org"
  REANA_DB_PORT: "5432"

and then disable the "internal" reana-db component:

components:
  reana_db:
     enabled: false

and introduce corresponding secrets for REANA_DB_USERNAME and REANA_DB_PASSWORD:

secrets:
  database:
    user: *******
    password: *********

This should be enough to make DB-as-external-service usable. We can update our documentation with more detailed recipe if you are interested.

(BTW FWIW we have been using both DB-as-external-service and DB-as-internal-pod and the latter technique was working quite well for some of our deployments. But our primary mode of operation is DB-as-external-service as well.)

WRT quota management, we have not planned any concrete work on this in the near future; but we are definitely open to make it more flexible.
WRT using different volumes as workspaces for different users, we have done some preliminary work on abstracting workspace concept last summer. Achieving that would however require quite a lot of work still.

If you have some GKE documentation recipes and/or code to contribute, we'll be naturally happy to collaborate!

elibixby · 2022-02-14T11:10:32Z

WRT auth, REANA currently offers either local accounts or CERN-specific SSO. However, CERN has a new OIDC-based authn/authz system in place, which we were thinking of migrating towards later in the year. If OIDC would be OK for your needs, there could be some synergy there, as for the storage need synergies.

My ideal solution is an "authless" mode where I can put something like https://github.com/travisghansen/external-auth-server/ in front the API/UI and manage users quota and auth myself

A "nice to have" would be to allow mapping forwarded user IDs to namespaces and service accounts, to better isolate workflows from each-other, then use cluster quota

lukasheinrich · 2022-02-14T11:45:25Z

Hi @elibixby - thanks for your interest. As @tiborsimko said we're in the process of working with some folks in Google to deploy REANA @ GCP and it'd be great to learn more about your usecase Would you be interested to share a short slide-deck or similar in a call? (feel free to reach out at lukas.heinrich at cern dot ch)

elibixby · 2022-02-26T14:30:10Z

Hey @lukasheinrich
I got it working without much trouble in the end.

Main hiccups besides those above were:

Easiest path to auth seems to be bunch of cron jobs that sync ReANA users with my users, and a reverse proxy that adds username + password auth headers, feels gross but the only thing to do without rebuilding most of the ReANA frontend API.
Python client was broken between releases, but I appreciate the prompt responses :)
The helm chart somewhat awkwardly couples Ingress and intracluster communication. I tried to use GKE Ingress (managed cert + global load balancer), and it broke worker/master/webserver communication and that took a while to debug, as the service targeted by ingress also serves as the service used for intercluster communication for some reason. Would be nicer to have a separate cluster IP and use Cluster DNS addresses for communication between components, and then an external load balancer targeted by the ingress. I ended up just going back to Traefik
Would be nice to update to a newever version of the Traefik chart, I believe the current one is very out of date, and it took me a while to track down the values file.
The longest amount of time was spent trying to get Ceph to work on GKE. Finally realized that containerd OS doesn't have RBD driver, and autoscaling node pools aren't available for Ubuntu OS So I'm stuck with NFS for now. This actually looks to be more Ceph's fault than Google's or CNCI, as Ceph should really be containerizing the rbd driver as part of their CSI implementation (see Containerized mounts kubernetes/enhancements#278 ) but that may be a ton of work.

I'd be happy to get on a call and discuss my use cases if you're interested I'll shoot you an email from eli at cradle dot bio

diegodelemos transferred this issue from reanahub/reana-cluster Jul 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Setting up on Google Cloud #356

Setting up on Google Cloud #356

arokem commented Oct 26, 2018

lukasheinrich commented Oct 26, 2018 •

edited

Loading

tiborsimko commented Nov 5, 2018

tiborsimko commented Nov 5, 2018

elibixby commented Feb 8, 2022 •

edited

Loading

tiborsimko commented Feb 10, 2022

elibixby commented Feb 14, 2022

lukasheinrich commented Feb 14, 2022

elibixby commented Feb 26, 2022 •

edited

Loading

Setting up on Google Cloud #356

Setting up on Google Cloud #356

Comments

arokem commented Oct 26, 2018

lukasheinrich commented Oct 26, 2018 • edited Loading

tiborsimko commented Nov 5, 2018

tiborsimko commented Nov 5, 2018

elibixby commented Feb 8, 2022 • edited Loading

tiborsimko commented Feb 10, 2022

elibixby commented Feb 14, 2022

lukasheinrich commented Feb 14, 2022

elibixby commented Feb 26, 2022 • edited Loading

lukasheinrich commented Oct 26, 2018 •

edited

Loading

elibixby commented Feb 8, 2022 •

edited

Loading

elibixby commented Feb 26, 2022 •

edited

Loading