Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rootless Podman using Linux Capabilities (on the host) #7866

Closed
stephengaito opened this issue Oct 1, 2020 · 11 comments
Closed

Rootless Podman using Linux Capabilities (on the host) #7866

stephengaito opened this issue Oct 1, 2020 · 11 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@stephengaito
Copy link

/kind feature

Description

(This is a more general and disciplined solution to, for example, 5)
rootless containers cannot ping
hosts
).

I would like to make use of one or more linux capabilities on the host
in rootless-podman mode.

For example, I would like to make use of both CAP_NET_BIND_SERVICE and
CAP_NET_RAW.

While I could use sysctl net.ipv4.ip_unprivileged_port_start=0 instead
of CAP_NET_BIND_SERVICE, there is no equivalent for CAP_NET_RAW.

More importantly using sysctl net.ipv4.ip_unprivileged_port_start=0 is a
blunt tool, once permitted, every user and every executable can run a
service on a "privileged port".

Similarly using setcap cap_net_bind_service+iep /usr/bin/podman would
grant the CAP_NET_BIND_SERVICE to all containers run by podman.

It is more secure to have a simple executable which permits a limited
range of linux capapbilities to an internally named execl'ed executable
(such as podman pod start somePod)

The following example code is based upon Adrian Mouat's
set_ambient.c
(See also: Linux Capabilities In
Practice
)

/*
 * Simple program to start a specific podman pod with 
 *     CAP_NET_BIND_SERVICE
 * in the ambient capabilities. 
 *
 * Based on Adrian Mouat's set_ambient.c program.
 * (https://github.com/ContainerSolutions/capabilities-blog/blob/master/set_ambient.c)
 *
 * (C) 2015 Christoph Lameter <cl@linux.com>
 * (C) 2019 Adrian Mouat <adrian.mouat@container-solutions.com>
 * (C) 2020 Stephen Gaito <stephen@perceptisys.co.uk>
 *
 * Released under: GPL v3 or later.
 *
 * Compile using:
 *
 *      gcc ./startPod.c -o startPod -lcap-ng
 *
 * (requires cap-ng headers, which is in libcap-ng-dev in debian)
 *
 * This program must have the
 *     CAP_NET_BIND_SERVICE
 * capabilities in the permitted set to run properly.
 *
 * This can be set on the file with:
 *
 *	sudo setcap cap_net_bind_service+p startPod
 *
 * An example pod might be created by the following commands:
 *
 * ```
  podman pod create \
    --name somePod \
    --publish 0.0.0.0:80:80
 
  podman container create \
    --name somePodWebfs \
    --restart always \
    --pod somePod \
    jonashaag/webfsd
 * ```
 *
 */

#include <stdlib.h>
#include <stdio.h>
#include <errno.h>
#include <cap-ng.h>
#include <sys/prctl.h>
#include <linux/capability.h>

static void set_ambient_cap(int cap)
{
	int rc;

	capng_get_caps_process();
	rc = capng_update(CAPNG_ADD, CAPNG_INHERITABLE, cap);
	if (rc) {
		printf("Cannot add inheritable cap\n");
		exit(2);
	}
	capng_apply(CAPNG_SELECT_CAPS);

	/* Note the two 0s at the end. Kernel checks for these */
	if (prctl(PR_CAP_AMBIENT, PR_CAP_AMBIENT_RAISE, cap, 0, 0)) {
		perror("Cannot set cap");
		exit(1);
	}
}

int main(int argc, char **argv)
{
	int rc;

	set_ambient_cap(CAP_NET_BIND_SERVICE);

	printf("Starting podman pod somePod with CAP_NET_BIND_SERVICE in ambient\n");
//	if (execl("/usr/bin/echo", "echo", "pod", "start", "somePod", (char*)NULL)) {
//	if (execl("/usr/sbin/capsh", "capsh", "--print", (char*)NULL)) {
	if (execl("/usr/bin/podman", "podman", "pod", "start", "somePod", (char*)NULL)) {
		printf("Cannot exec [/usr/bin/podman pod start somePod]\n");
		return -1;
	}

	return 0;
}

This code is short, sweet and easily audited, and most importantly does
one thing in a known enhanced security environment.

After creating the podman pod somePod as suggested in the above code,
the command ./startPod should result in a running podman pod somePod
(which is able to serve http requests on the host's port 80).

Steps to reproduce the issue:

  1. compile the above example code (as outlined in the code above).

  2. use setcap to grant the startPod command the required linux
    capabilities (as outline in the code above).

  3. create the podman pod somePod (as outlined in the above code).

  4. run the command ./startPod

Describe the results you received:

Starting podman pod somePod with CAP_NET_BIND_SERVICE in ambient
error starting container 
a23870f01ca443d830882e4e507d660dd207191197e31a4e9868e967ce75fabf: failed to expose 
ports via rootlessport: "cannot expose privileged port 80, you might need to add 
\"net.ipv4.ip_unprivileged_port_start=0\" (currently 1024) to /etc/sysctl.conf, or choose a larger 
port number (>= 1024): listen tcp 0.0.0.0:80: bind: permission denied\n"

Error: error starting container 
e107b93b546fc39199bbadfc551e4c4abe4bf1a41e77e52eb181775e6096b96d: a dependency of 
container e107b93b546fc39199bbadfc551e4c4abe4bf1a41e77e52eb181775e6096b96d failed to 
start: container state improper

Describe the results you expected:

Should result in a running pod somePod for which
the command:

lynx http://10.42.0.211:80

would produce:

                                   listing: /
     _____________________________________________________________________

access      user      group     date             size  name

     _____________________________________________________________________

   webfs/1.21   01/Oct/2020 10:41:57 GMT

Additional information you deem important (e.g. issue happens only occasionally):

Output of podman version:

Version:      2.1.1
API Version:  2.0.0
Go Version:   go1.15.2
Built:        Thu Jan  1 00:00:00 1970
OS/Arch:      linux/amd64

Output of podman info --debug:

host:
  arch: amd64
  buildahVersion: 1.16.1
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: 'conmon: /usr/libexec/podman/conmon'
    path: /usr/libexec/podman/conmon
    version: 'conmon version 2.0.20, commit: '
  cpus: 2
  distribution:
    distribution: ubuntu
    version: "20.04"
  eventLogger: journald
  hostname: sDev
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 4242
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 4242
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
  kernel: 5.4.0-48-generic
  linkmode: dynamic
  memFree: 1629048832
  memTotal: 2084036608
  ociRuntime:
    name: crun
    package: 'crun: /usr/bin/crun'
    path: /usr/bin/crun
    version: |-
      crun version UNKNOWN
      commit: 3e46dd849fdf6bfa68127786e073318184641f05
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +YAJL
  os: linux
  remoteSocket:
    path: /run/user/4242/podman/podman.sock
  rootless: true
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: 'slirp4netns: /usr/bin/slirp4netns'
    version: |-
      slirp4netns version 1.1.4
      commit: unknown
      libslirp: 4.3.1-git
      SLIRP_CONFIG_VERSION_MAX: 3
  swapFree: 0
  swapTotal: 0
  uptime: 45.68s
registries:
  search:
  - docker.io
  - quay.io
store:
  configFile: /home/dev/.config/containers/storage.conf
  containerStore:
    number: 0
    paused: 0
    running: 0
    stopped: 0
  graphDriverName: vfs
  graphOptions: {}
  graphRoot: /home/dev/.local/share/containers/storage
  graphStatus: {}
  imageStore:
    number: 10
  runRoot: /run/user/4242/containers
  volumePath: /home/dev/.local/share/containers/storage/volumes
version:
  APIVersion: 2.0.0
  Built: 0
  BuiltTime: Thu Jan  1 00:00:00 1970
  GitCommit: ""
  GoVersion: go1.15.2
  OsArch: linux/amd64
  Version: 2.1.1

Package info (e.g. output of rpm -q podman or apt list podman):

Listing... Done
podman/unknown,now 2.1.1~1 amd64 [installed]
podman/unknown 2.1.1~1 arm64
podman/unknown 2.1.1~1 armhf
podman/unknown 2.1.1~1 s390x

Installed on Ubuntu using an /etc/apt/sources.list.d/devel:kubic:libcontainers:stable.list file whose content is:

deb http://download.opensuse.org/repositories/devel:/kubic:/libcontainers:/stable/xUbuntu_20.04/ 

Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide?

Yes

Additional environment details (AWS, VirtualBox, physical, etc.):

I am working inside a recent (two days old) QEMU-KVM VM based upon Ubuntu's
Focal CloudImage:

http://cloud-images.ubuntu.com/releases/focal/release/ubuntu-20.04-server-cloudimg-amd64.img

I have the additional kernel parameter set:

GRUB_CMDLINE_LINUX="systemd.unified_cgroup_hierarchy=1"
@openshift-ci-robot openshift-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Oct 1, 2020
@mheon
Copy link
Member

mheon commented Oct 1, 2020 via email

@rhatdan
Copy link
Member

rhatdan commented Oct 1, 2020

Why not just run the container through sudo with the user set back to the current user?

sudo podman run --user ${id -u} --cap-add cap_net_raw ...

@stephengaito
Copy link
Author

@mheon and @rhatdan

Many thanks for both an excellent set of tools as well as your swift replies.


A rather flippant response to @rhatdan's comment:

 Why not just run the container through sudo 
 with the user set back to the current user?

is that, well, it is not rootless.

At first sight it grants podman more capabilities than should be needed for the problem at hand. On second sight, it also means that I, as a systems admin, have to trust that podman does in fact drop privileges correctly. Auditing the whole of podman is much more difficult (not least because podman is evolving rather quickly), than auditing my example code above.

However, the sub text of @rhatdan's comment is that, maybe, in my horror at docker's security model, I might be "throwing the baby out with the bath water".... that is, maybe it is secure (enough) to use sudo podman run --user somebody ...


@mheon I thought that it was simply a matter of some exec/fork call dropping privileges too ruthlessly.

I see that I need to read more carefully the (few) documents about both capabilities and namespaces.... (some bed time reading).

For later reference I have found:

Are there other resources that I should consult?


Given the usefulness to the podman community, of being able to ensure security escalation only happens in ultra simple and easily audited code, please humour me by leaving this feature request open for a while while I try to understand (and document?) why my request won't work.

If after a month of no response (on my part), please feel free to close this request.

@stephengaito
Copy link
Author

So I have just investigated @rhatdan's suggestion:

Why not just run the container through sudo 
with the user set back to the current user?

TL;DR: This runs a process inside a container whose users map directly to the host. So a privilege escalation inside the container is actually a privilege escalation on the host itself. This is definitely not what I mean by "rootless-podman".


Rootful-Podman: Using the "User and group ID mappings: uid_map and gid_map" section of user_namespaces(7), I read the following output as saying that uids inside the container map directly to uids on the host. That is root inside the container is root on the host. So while podman run --user guest might be switching to the quest user inside the container, there is no privilege escalation separation between the container and the host.

dev@sDev ~> sudo podman run --user guest -it alpine cat /proc/1/uid_map
         0          0 4294967295
dev@sDev ~> sudo podman run --user guest -it alpine whoami
guest

Rootless-Podman: Similarly, I read the following output to mean that the root user inside the container maps to the 4242 (dev) user on the host. All other users inside the container all map into the dev user's /etc/subuid defined namespace. So, while run this way, the user inside the container is root, the corresponding user on the host is merely dev (to whom I can specifically grant no extra privileges (such as sudo)). This means that a privilege escalation inside the container, simply grants the same privileges as my dev user on the host (which can be limited to as few privileges as needed).

dev@sDev ~> podman run -it alpine cat /proc/1/uid_map
         0       4242          1
         1     100000      65536
dev@sDev ~> podman run -it alpine whoami
root

So for the medium term, I would rather use the

sudo sysctl net.ipv4.ip_unprivileged_port_start=0

solution even if it grants slightly more permissions than needed.

@rhatdan
Copy link
Member

rhatdan commented Oct 2, 2020

I would use something like:

sudo sysctl net.ipv4.ip_unprivileged_port_start=23

Which would prevent a process by your user impersonating the sshd daemon.

@rhatdan
Copy link
Member

rhatdan commented Oct 21, 2020

Since this is impossible and there are other workarounds, I am closing.

@ruckc
Copy link

ruckc commented Dec 1, 2022

@rhatdan - I don't think its entirely impossible. I needed IPC_LOCK for running Hashicorp Vault rootless, and this seems to provide it. Yes... it does involve jumping through root, but the end processes are being run by the rootless user, with the rootless user's existing containers/images, without requiring double storage.

sudo capsh --user=$USER --caps=cap_ipc_lock+epi -- -c "HOME=${HOME} podman run ..."

or the ugly/cleaner option
sudo capsh --user=$USER --caps=cap_ipc_lock+epi -- -c "sudo su - $USER -c \"podman run ...\""

@dravenk
Copy link

dravenk commented Feb 4, 2023

I would use something like:

sudo sysctl net.ipv4.ip_unprivileged_port_start=23

Which would prevent a process by your user impersonating the sshd daemon.

This configuration will be cleared after reboot the computer.It will need to be added to the system configuration file if you want it work every time after reboot.

$sudo touch /etc/sysctl.d/local.conf

Paste the following parameters into your local.conf. See cat /etc/sysctl.d/README.sysctl for more information . e.g using port 80.

net.ipv4.ip_unprivileged_port_start=80

@amn
Copy link

amn commented May 24, 2023

I am grappling pretty much with this issue -- wanting Podman to publish a container port on a privileged host port for accessing the contained service as if it were running directly on the host. If I use dedicated network namespace for the container (default), rootlessport isn't permitted to bind to privileged port(s) when --publish is necessarily used for podman run. If I use --network=host instead, Linux also refuses to permit Podman to bind to the privileged port(s).

From what I understand, since the following works as expected (nc is able to listen on port 80):

sudo capsh --inh=cap_net_bind_service --user=$USER --addamb=cap_net_bind_service -- -c 'nc -l 80'

...yet using podman run ... or unshare -U ... instead of nc ... fail, I deduced for my part this had something to do with user namespaces specifically, as capabilities seem in order otherwise (cap_net_bind_service in all relevant sets for the shell launched by capsh).

As much seems to be hinted at by man user_namespaces:

a process that creates a new user namespace using unshare(2) or joins an existing user namespace using setns(2) gains a full set of capabilities in that namespace. On the other hand, that process has no capabilities in the parent (in the case of clone(2)) or previous (in the case of unshare(2) and setns(2)) user namespace, even if the new namespace is created or joined by the root user (i.e., a process with user ID 0 in the root namespace).

Would the above mean that Podman creating a user namespace for the container, will cause the processes in the container to have no capabilities in the previous [user] namespace? That's certainly supported by evidence here, although it may be a harmless correlation.

Otherwise, I don't understand what's stopping the kernel from allowing the contained processes run by Podman, from binding to privileged ports?

Explanation by @mheon seems to be basically alluding to the same thing as above, although I wouldn't have been able to tell unless I had read the man-pages of user_namespaces in the first place. Matt's shorter elaboration is a bit vague, is all I am saying:

The kernel provides new, namespaced capabilities within the user namespace - they're limited in what they can do for security reasons, and also for security reasons the kernel will not allow actually capabilities into the user namespace.

I mean, although possibly correct the explanation seems more along the lines of "we do advanced things so it won't work".

@mheon
Copy link
Member

mheon commented May 24, 2023

You have it right - the kernel is dropping all capabilities in the root user namespace when Podman creates its own user namespace for use with rootless containers. And given that we require the user namespace for some things (mounting/unmounting filesystems, getting access to more than 1 UID/GID) there is no way to avoid the creation of a user namespace and dropping of "real" capabilities, even if granted explicitly by capsh or similar.

@mheon
Copy link
Member

mheon commented May 24, 2023

For your specific case (privileged ports) we generally recommend just making the ports unprivileged (via the net.ipv4.ip_unprivileged_port_start sysctl) - though this obviously has more implications than just granting a single container additional caps to bind a privileged port would.

@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Aug 23, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/feature Categorizes issue or PR as related to a new feature. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

No branches or pull requests

7 participants