In this demo we will learn how we can work with capabilities inside our containers.
We will start by restricting the set of capabilities available in a local container with podman, and then we will see how we can manage capabilities for pods running on OpenShift.
We will see how we can add/drop capabilities in a given container using Podman and the implications of doing so.
NOTE: Below tests were run in a Fedora 35 machine with Podman v3.4.4, results may vary when using other O.S / Podman version.
-
Install podman on your system in case you don't have it yet.
-
Let's run a nginx container and see which capabilities are added to the container.
podman run -d --rm --name nginx-cap-test nginx:latest
-
Let's check the capabilities assigned to that process.
-
We can do it using the
/proc
filesystem:CONTAINER_PID=$(podman inspect nginx-cap-test --format {{.State.Pid}}) cat /proc/${CONTAINER_PID}/status | grep Cap
CapInh: 00000000a80425fb CapPrm: 00000000a80425fb CapEff: 00000000a80425fb CapBnd: 00000000a80425fb CapAmb: 0000000000000000
-
We got a list of capabilities, if we want to decode the value we can use
capsh
:capsh --decode=00000000a80425fb
0x00000000a80425fb=cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap
-
We can get them with
podman inspect
as well:podman inspect nginx-cap-test --format {{.EffectiveCaps}}
[CAP_AUDIT_WRITE CAP_CHOWN CAP_DAC_OVERRIDE CAP_FOWNER CAP_FSETID CAP_KILL CAP_MKNOD CAP_NET_BIND_SERVICE CAP_NET_RAW CAP_SETFCAP CAP_SETGID CAP_SETPCAP CAP_SETUID CAP_SYS_CHROOT]
-
And a third tool to get them could be
getpcaps
:CONTAINER_PID=$(podman inspect nginx-cap-test --format {{.State.Pid}}) getpcaps ${CONTAINER_PID}
-
We can stop the container now:
podman stop nginx-cap-test
-
-
So by default, podman will configure the capabilities defined here or alternatively the ones configured in its configuration file.
We can also drop capabilities not needed by our application, that way we are reducing the attack surface. Let's see.
-
We know that our demo application doesn't require any capability to run by default, so we can drop all capabilities:
podman run -d --rm --name reverse-words-app --cap-drop=all quay.io/mavazque/reversewords:latest
-
If we check the capabilities, we will see that the container has no capabilities:
podman inspect reverse-words-app --format {{.EffectiveCaps}}
[]
-
Now we want our application to bind port 80, that will require the capability
NET_BIND_SERVICE
, let's see what happens if we try to run the app without that capability:podman run --rm --name reverse-words-app-80 --cap-drop=all -e APP_PORT=80 quay.io/mavazque/reversewords:latest
-
The container failed to start because it couldn't bind port 80 to the container namespace. It required higher permissions:
2021/01/19 16:12:53 Starting Reverse Api v0.0.17 Release: NotSet 2021/01/19 16:12:53 Listening on port 80 2021/01/19 16:12:53 listen tcp :80: bind: permission denied
-
-
If we add that capability, we will see how the container now starts properly and binds to port 80:
podman run --rm --name reverse-words-app-80 --cap-drop=all --cap-add=cap_net_bind_service -e APP_PORT=80 quay.io/mavazque/reversewords:latest
2021/01/19 16:14:34 Starting Reverse Api v0.0.17 Release: NotSet 2021/01/19 16:14:34 Listening on port 80
Now it's time to see how capabilities can be managed on OpenShift.
Before we start, it is worth mentioning that there is a limitation in Kubernetes that will prevent capabilities to work as one would expect if not running your pods with UID 0, that's why we will allow our pods to run with any uid on the following examples.
There are tools such as capabilities_tracker which can help you to understand which capabilities are being used by your apps.
In this demo we are going to see how we can drop all capabilities but NET_BIND_SERVICE on the pod running our application.
-
Create a new namespace for running our tests:
NAMESPACE=test-capabilities oc create ns ${NAMESPACE}
-
Create a user and give it edit role on the namespace
oc -n ${NAMESPACE} create sa testuser oc -n ${NAMESPACE} adm policy add-role-to-user edit system:serviceaccount:${NAMESPACE}:testuser
-
The default SCC
restricted-v2
has the following settings for capabilities:NOTE: The configuration below basically means pods running with restricted-v2 SCC can only gain
NET_BIND_SERVICE
capability. Every other capability will be dropped.defaultAddCapabilities: null allowedCapabilities: - NET_BIND_SERVICE requiredDropCapabilities: - ALL
-
We are going to create our own SCC based on the
restricted
SCC (NOT restricted-v2), and on top of that, we need our container to run withanyuid
(due to the issue mentioned earlier) and we want to be able to useNET_BIND_SERVICE
capability.cat <<EOF | oc create -f - kind: SecurityContextConstraints metadata: name: restricted-netbind priority: 1 readOnlyRootFilesystem: false requiredDropCapabilities: - KILL - MKNOD - SETUID - SETGID runAsUser: type: RunAsAny seLinuxContext: type: MustRunAs supplementalGroups: type: RunAsAny users: [] volumes: - configMap - downwardAPI - emptyDir - persistentVolumeClaim - projected - secret allowHostDirVolumePlugin: false allowHostIPC: false allowHostNetwork: false allowHostPID: false allowHostPorts: false allowPrivilegeEscalation: true allowPrivilegedContainer: false allowedCapabilities: - NET_BIND_SERVICE apiVersion: security.openshift.io/v1 defaultAddCapabilities: null fsGroup: type: MustRunAs groups: [] EOF
-
We're giving access to this new SCC
restricted-netbind
to the SA test-use:oc -n ${NAMESPACE} adm policy add-scc-to-user restricted-netbind system:serviceaccount:${NAMESPACE}:testuser
-
On top of the SCC caps we can drop/add (add will depend on the SCC settings) capabilities on a given pod:
NOTE: Below example drops NET_BIND_SERVICE capability
cat <<EOF | oc -n ${NAMESPACE} create --as=system:serviceaccount:${NAMESPACE}:testuser -f - apiVersion: v1 kind: Pod metadata: name: reversewords-app-captest spec: serviceAccountName: testuser containers: - image: quay.io/mavazque/reversewords:latest name: reversewords securityContext: capabilities: drop: - NET_BIND_SERVICE dnsPolicy: ClusterFirst restartPolicy: Never status: {} EOF
-
Since we're not binding to a privileged port, the application will start with no issues. On top of that the pod started with the
restricted
SCC since it didn't need any extra config provided by our new SCC. Now, let's see what happens if we create the pod with a binding to port 80:cat <<EOF | oc -n ${NAMESPACE} create --as=system:serviceaccount:${NAMESPACE}:testuser -f - apiVersion: v1 kind: Pod metadata: name: reversewords-app-captest-80 spec: serviceAccountName: testuser containers: - image: quay.io/mavazque/reversewords:latest name: reversewords env: - name: APP_PORT value: "80" securityContext: capabilities: drop: - NET_BIND_SERVICE dnsPolicy: ClusterFirst restartPolicy: Never status: {} EOF
-
The pod failed to run, in the logs we can see:
oc -n ${NAMESPACE} logs reversewords-app-captest-80
2021/01/19 17:12:39 Starting Reverse Api v0.0.17 Release: NotSet 2021/01/19 17:12:39 Listening on port 80 2021/01/19 17:12:39 listen tcp :80: bind: permission denied
-
-
If we drop all capabilities, and we add NET_BIND_SERVICE to the list of capabilities we will see how the pod now runs properly:
cat <<EOF | oc -n ${NAMESPACE} create --as=system:serviceaccount:${NAMESPACE}:testuser -f - apiVersion: v1 kind: Pod metadata: name: reversewords-app-captest-80-2 spec: serviceAccountName: testuser containers: - image: quay.io/mavazque/reversewords:latest name: reversewords env: - name: APP_PORT value: "80" securityContext: runAsUser: 0 capabilities: drop: - ALL add: - NET_BIND_SERVICE dnsPolicy: ClusterFirst restartPolicy: Never status: {} EOF
oc -n ${NAMESPACE} logs reversewords-app-captest-80-2
NOTE: In this case our container image for reversewords application was built using uid 0 as default user. In case your application has been built to run with a specific uid different from 0 you must force it by adding
runAsUser:0
to you configuration under securityContext. Otherwise, the added capabilities won't be effective.
In this demo we need an SCC so we can run a pod that changes the ownership of the /etc/resolv.conf
file to nobody
user. The information we have is that CHOWN
capability will be required for chown
to work.
-
Let's try to run the pod with the SCC we have already in place:
cat <<EOF | oc -n ${NAMESPACE} create --as=system:serviceaccount:${NAMESPACE}:testuser -f - apiVersion: v1 kind: Pod metadata: name: chown-test spec: serviceAccountName: testuser containers: - image: quay.io/fedora/fedora:36 command: ["chown", "-v", "nobody", "/etc/resolv.conf"] name: centos securityContext: capabilities: drop: - CHOWN - DAC_OVERRIDE - FSETID - FOWNER - SETGID - SETUID - SETPCAP - KILL - NET_BIND_SERVICE dnsPolicy: ClusterFirst restartPolicy: Never status: {} EOF
-
The pod failed:
oc -n ${NAMESPACE} logs chown-test
chown: changing ownership of '/etc/resolv.conf': Operation not permitted failed to change ownership of '/etc/resolv.conf' from root to nobody
-
We can patch the SCC we created in the previous demo and allow the use of CHOWN capability as well:
oc patch scc restricted-netbind -p '{"allowedCapabilities":["NET_BIND_SERVICE","CHOWN"]}' --type=merge
-
Let's try to create the pod again, but now request the capability we just added to the SCC:
cat <<EOF | oc -n ${NAMESPACE} create --as=system:serviceaccount:${NAMESPACE}:testuser -f - apiVersion: v1 kind: Pod metadata: name: chown-test-2 spec: serviceAccountName: testuser containers: - image: registry.centos.org/centos:8 command: ["chown", "-v", "nobody", "/etc/resolv.conf"] name: centos securityContext: runAsUser: 0 capabilities: drop: - DAC_OVERRIDE - FSETID - FOWNER - SETGID - SETUID - SETPCAP - KILL - NET_BIND_SERVICE add: - CHOWN dnsPolicy: ClusterFirst restartPolicy: Never status: {} EOF
-
Now we can see that it works as expected now:
oc -n ${NAMESPACE} logs chown-test-2
changed ownership of '/etc/resolv.conf' from root to nobody
In this demo we are going to deploy the application from Demo 1, but this time using a deployment. We will see how we assign an SCC to a workload in a more realistic scenario. Remember: users do not usually create pods manually.
-
We will create the deployment for our app:
NOTE: As you can see we're using the
default
SA for running our deployment.cat <<EOF | oc -n ${NAMESPACE} create --as=system:serviceaccount:${NAMESPACE}:testuser -f - apiVersion: apps/v1 kind: Deployment metadata: creationTimestamp: null labels: app: reversewords-app-captest name: reversewords-app-captest spec: replicas: 1 selector: matchLabels: app: reversewords-app-captest strategy: {} template: metadata: creationTimestamp: null labels: app: reversewords-app-captest spec: serviceAccountName: default containers: - image: quay.io/mavazque/reversewords:latest name: reversewords resources: {} env: - name: APP_PORT value: "80" securityContext: runAsUser: 0 capabilities: drop: - ALL add: - NET_BIND_SERVICE status: {} EOF
-
No pods were created, but why? - Let's check the deployment status:
oc -n ${NAMESPACE} get deployment reversewords-app-captest -o yaml -o jsonpath='{.status.conditions[*]}' | jq
-
The deployment couldn't use the SCC
restricted-netbind
because the ServiceAccountdefault
used by the deployment doesn't have access to it.{ "lastTransitionTime": "2022-08-29T13:47:13Z", "lastUpdateTime": "2022-08-29T13:47:13Z", "message": "pods \"reversewords-app-captest-6b89bbc766-\" is forbidden: unable to validate against any security context constraint: [provider \"anyuid\": Forbidden: not usable by user or serviceaccount, spec.containers[0].securityContext.runAsUser: Invalid value: 0: must be in the ranges: [1000800000, 1000809999], provider \"restricted-v2-seccomp\": Forbidden: not usable by user or serviceaccount, provider \"restricted\": Forbidden: not usable by user or serviceaccount, provider \"nonroot-v2\": Forbidden: not usable by user or serviceaccount, provider \"nonroot\": Forbidden: not usable by user or serviceaccount, provider \"restricted-netbind\": Forbidden: not usable by user or serviceaccount, provider \"hostmount-anyuid\": Forbidden: not usable by user or serviceaccount, provider \"machine-api-termination-handler\": Forbidden: not usable by user or serviceaccount, provider \"hostnetwork-v2\": Forbidden: not usable by user or serviceaccount, provider \"hostnetwork\": Forbidden: not usable by user or serviceaccount, provider \"hostaccess\": Forbidden: not usable by user or serviceaccount, provider \"node-exporter\": Forbidden: not usable by user or serviceaccount, provider \"privileged\": Forbidden: not usable by user or serviceaccount]", "reason": "FailedCreate", "status": "True", "type": "ReplicaFailure" }
-
At this point you might think that giving access to the
default
SA to the SCCrestricted-netbind
will solve the issue. And that's right, but there is a better way.
-
-
Let's create a new SA for running our application:
oc -n ${NAMESPACE} create sa reverse-words-app
-
Now, it's time to give it access to the
restricted-netbind
SCC:oc -n ${NAMESPACE} adm policy add-scc-to-user restricted-netbind system:serviceaccount:${NAMESPACE}:reverse-words-app
-
Finally, patch the deployment so it uses the new SA we created:
oc -n ${NAMESPACE} patch deployment reversewords-app-captest -p '{"spec":{"template":{"spec":{"serviceAccountName":"reverse-words-app"}}}}' --type merge
-
The deployment will run our container now:
oc -n ${NAMESPACE} get deployment reversewords-app-captest
NAME READY UP-TO-DATE AVAILABLE AGE reversewords-app-captest 1/1 1 1 3m
In this demo, we are going to use file capabilities to see how we can grant capabilities to our binaries without having to run them with UID 0. We will use the reverse-words-app as the base.
-
Build the following Dockerfile
cat <<EOF > /tmp/reversewords.dockerfile FROM registry.access.redhat.com/ubi8:latest ENV GOPATH=/go RUN mkdir -p /go RUN dnf install golang git -y RUN go get github.com/gorilla/mux && go get github.com/prometheus/client_golang/prometheus/promhttp && go get github.com/mvazquezc/reverse-words WORKDIR /go/src/github.com/mvazquezc/reverse-words/ RUN CGO_ENABLED=0 GOOS=linux go build -a -installsuffix cgo -o /usr/bin/reverse-words . RUN rm -rf /go && dnf clean all # Add CAP_NET_BIND capability to our binary RUN setcap 'cap_net_bind_service+ep' /usr/bin/reverse-words EXPOSE 80 CMD ["/usr/bin/reverse-words"] EOF
QUAY_USER=<your_user> podman build -f /tmp/reversewords.dockerfile -t quay.io/${QUAY_USER}/reversewords-captest:latest
-
Once the image is built, push it to your favorite registry. In this example we're using quay.io but you can use one of your choice.
podman push quay.io/${QUAY_USER}/reversewords-captest:latest
-
Let's create a deployment for the application
NOTE: As you can see we're using the
reverse-words-app
SA for running our deployment, we are not running as UID 0 and we're dropping all capabilities but NET_BIND_SERVICE.APP_IMAGE=quay.io/${QUAY_USER}/reversewords-captest:latest cat <<EOF | oc -n ${NAMESPACE} create --as=system:serviceaccount:${NAMESPACE}:testuser -f - apiVersion: apps/v1 kind: Deployment metadata: creationTimestamp: null labels: app: reversewords-app-filecaptest name: reversewords-app-filecaptest spec: replicas: 1 selector: matchLabels: app: reversewords-app-filecaptest strategy: {} template: metadata: creationTimestamp: null labels: app: reversewords-app-filecaptest spec: serviceAccountName: reverse-words-app containers: - image: ${APP_IMAGE} name: reversewords resources: {} env: - name: APP_PORT value: "80" securityContext: capabilities: add: - NET_BIND_SERVICE drop: - ALL status: {} EOF
-
The pod is running and it binded to the port 80 even though it's running under
restricted-netbind
SCC and with a nonroot UIDoc -n ${NAMESPACE} get pod -l app=reversewords-app-filecaptest -o yaml | grep scc
openshift.io/scc: restricted-netbind
oc -n ${NAMESPACE} logs -l app=reversewords-app-filecaptest
NOTE: Since the SCC selected was restricted-netbind, the user assigned to this pod is not uid 0, it belongs to the range allowed by the namespace instead. So basically, you were able to bind the application to a privileged port in the pod's namespace without being root.
2021/01/29 16:58:28 Starting Reverse Api v0.0.18 Release: NotSet 2021/01/29 16:58:28 Listening on port 80