Run hooks as root before starting user server.
When running JupyterHub on Kubernetes, you want user pods to
run as non-root users. This is good security practice, and can seriously reduce blast
radius in case of compromised. For example, if you run your containers with privileged: True
,
a compromise of a user server will likely be able to take control of your entire kubernetes
cluster, and depending on how it's configured, your cloud account! Nobody wants that.
However, what people do want is to be able to run some commands as root before the user server starts. Often, this is to do some mounting stuff, although there are other use cases too.
So the goal would be to:
- Run some commands as root before the user server starts
- These commands failing should not cause the server to not start. This mostly shows the user a useless 'your server has failed to start' error. In most cases, it is better to start the server and provide some logging so the user can investigate what went wrong.
jupyterhub-roothooks
is designed to solve this very specific problem.
repo2docker is a common way to build images for
use with JupyterHub, so jupyterhub-roothooks
specifies some defaults that make it
easy to integrate with repo2docker.
-
Install
jupyterhub-roothooks
into your container image, by adding it to yourrequirements.txt
file or underpip:
in yourenvironment.yml
file. -
Add a
roothooks.d
directory to your repo. -
Add scripts you want executed as root inside the
roothooks.d
directory. These will be executed in sorted order, so you can clarify the ordering by prefixing them with numbers like01-first-script.sh
,02-second-script.sh
. -
Make sure these scripts are marked as executable (with
chmod +x <script-name>
), and have an appropriate shebang. -
Add a
start
script that looks like this:#!/bin/bash -l exec jupyterhub-roothooks --uid 1000 --gid 1000 -- "$@"
This will start
jupyterhub-roothooks
, which will execute any executable scripts it finds inroothooks.d
, and then run the appropriate command to start the user server (passed in via$@
) with the non-root uid 1000 and gid 1000.
-
Install
jupyterhub-roothooks
into your container image.RUN pip install --no-cache jupyterhub-roothooks
-
Add a
roothooks.d
directory to your repo, and copy it over to your container image.COPY roothooks.d /srv/roothooks.d
-
Add scripts you want executed as root inside the
roothooks.d
directory. These will be executed in sorted order, so you can clarify the ordering by prefixing them with numbers like01-first-script.sh
,02-second-script.sh
. -
Make sure these scripts are marked as executable (with
chmod +x <script-name>
), and have an appropriate shebang. -
Add an
ENTRYPOINT
to yourDockerfile
that invokesjupyterhub-roothooks
ENTRYPOINT ["jupyterhub-roothooks", "--uid", "1000", "--gid", "1000", "--hooks-dir", "/srv/roothooks.d", "--"]
This will start
jupyterhub-roothooks
, which will execute any executable scripts it finds inroothooks.d
, and then run the appropriate command to start the user server (passed in via args) with the non-root uid 1000 and gid 1000.
Now that the image is prepared, you can grant elevated root capabilities to the user pod
via z2jh config. Note that while the container will have these capabilities, the user
server itself will not. jupyterhub-roothooks
will drop these capabilities before starting
the user server.
hub:
config:
KubeSpawner:
container_security_context:
# Run the container *truly* as privileged. This can be very dangerous,
# but is required for doing most filesystem mounts
privileged: true
runAsUser: 0
allowPrivilegeEscalation: true
capabilities:
add:
- SYS_ADMIN
There are projects like gcsfuse or s3fuse that allow users to access object storage (like GCS or S3) via a traditional POSIX filesystem-like interface. This has serious performance and reliability disadvantages, as a n extra layer of complexity is added to each data access. If your code is accessing object storage directly, it goes:
flowchart LR
code[Your Code]-- HTTP -->obj[Object Storage]
with FUSE, it becomes:
flowchart LR
code[Your Code]-- Filesystem Call-->kernel[Linux Kernel]-->fuse[User space fuse driver]--HTTP -->obj[Object Storage]
The extra hops add complexity as well as performance degradation that could be avoided. Some aspects don't translate very well (for example, RANGE requests), and you might lose performance there too.
So you should just directly get your data from object storage when possible! However,
if you really need, you can use jupyterhub-roothooks
to setup FUSE.
This hook script provides an example of how you would do this on S3.
You need to make sure the s3fs apt package is installed in the image.