-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rootless Podman using Linux Capabilities (on the host) #7866
Comments
I don't believe this is possible. Rootless Podman runs in a user namespace
to enable some necessary functionality (mounting some filesystems, mapping
in more than one user, etc). The kernel provides new, namespaced
capabilities within the user namespace - they're limited in what they can
do for security reasons, and also for security reasons the kernel will not
allow actually capabilities into the user namespace.
…On Thu, Oct 1, 2020, 07:36 stephengaito ***@***.***> wrote:
/kind feature
*Description*
(This is a more general and disciplined solution to, for example, 5)
rootless containers cannot ping
hosts
<https://github.com/containers/podman/blob/master/troubleshooting.md#5-rootless-containers-cannot-ping-hosts>
).
I would like to make use of one or more linux capabilities on the *host*
in *rootless-podman* mode.
For example, I would like to make use of *both* CAP_NET_BIND_SERVICE and
CAP_NET_RAW.
While I could use sysctl net.ipv4.ip_unprivileged_port_start=0 instead
of CAP_NET_BIND_SERVICE, there is no equivalent for CAP_NET_RAW.
More importantly using sysctl net.ipv4.ip_unprivileged_port_start=0 is a
blunt tool, once permitted, every user and every executable can run a
service on a "privileged port".
Similarly using setcap cap_net_bind_service+iep /usr/bin/podman would
grant the CAP_NET_BIND_SERVICE to *all* containers run by podman.
It is more secure to have a simple executable which permits a limited
range of linux capapbilities to an internally named execl'ed executable
(such as podman pod start somePod)
The following example code is based upon Adrian Mouat's
set_ambient.c
<https://github.com/ContainerSolutions/capabilities-blog/blob/master/set_ambient.c>
(See also: Linux Capabilities In
Practice
<https://blog.container-solutions.com/linux-capabilities-in-practice>)
/*
* Simple program to start a specific podman pod with
* CAP_NET_BIND_SERVICE
* in the ambient capabilities.
*
* Based on Adrian Mouat's set_ambient.c program.
* (https://github.com/ContainerSolutions/capabilities-blog/blob/master/set_ambient.c)
*
* (C) 2015 Christoph Lameter ***@***.***>
* (C) 2019 Adrian Mouat ***@***.***>
* (C) 2020 Stephen Gaito ***@***.***>
*
* Released under: GPL v3 or later.
*
* Compile using:
*
* gcc ./startPod.c -o startPod -lcap-ng
*
* (requires cap-ng headers, which is in libcap-ng-dev in debian)
*
* This program must have the
* CAP_NET_BIND_SERVICE
* capabilities in the permitted set to run properly.
*
* This can be set on the file with:
*
* sudo setcap cap_net_bind_service+p startPod
*
* An example pod might be created by the following commands:
*
* ```
podman pod create \
--name somePod \
--publish 0.0.0.0:80:80
podman container create \
--name somePodWebfs \
--restart always \
--pod somePod \
jonashaag/webfsd
* ```
*
*/
#include <stdlib.h>
#include <stdio.h>
#include <errno.h>
#include <cap-ng.h>
#include <sys/prctl.h>
#include <linux/capability.h>
static void set_ambient_cap(int cap)
{
int rc;
capng_get_caps_process();
rc = capng_update(CAPNG_ADD, CAPNG_INHERITABLE, cap);
if (rc) {
printf("Cannot add inheritable cap\n");
exit(2);
}
capng_apply(CAPNG_SELECT_CAPS);
/* Note the two 0s at the end. Kernel checks for these */
if (prctl(PR_CAP_AMBIENT, PR_CAP_AMBIENT_RAISE, cap, 0, 0)) {
perror("Cannot set cap");
exit(1);
}
}
int main(int argc, char **argv)
{
int rc;
set_ambient_cap(CAP_NET_BIND_SERVICE);
printf("Starting podman pod somePod with CAP_NET_BIND_SERVICE in ambient\n");
// if (execl("/usr/bin/echo", "echo", "pod", "start", "somePod", (char*)NULL)) {
// if (execl("/usr/sbin/capsh", "capsh", "--print", (char*)NULL)) {
if (execl("/usr/bin/podman", "podman", "pod", "start", "somePod", (char*)NULL)) {
printf("Cannot exec [/usr/bin/podman pod start somePod]\n");
return -1;
}
return 0;
}
This code is short, sweet and easily audited, and most importantly does
one thing in a known enhanced security environment.
After creating the podman pod somePod as suggested in the above code,
the command ./startPod should result in a running podman pod somePod
(which is able to serve http requests on the host's port 80).
*Steps to reproduce the issue:*
1.
compile the above example code (as outlined in the code above).
2.
use setcap to grant the startPod command the required linux
capabilities (as outline in the code above).
3.
create the podman pod somePod (as outlined in the above code).
4.
run the command ./startPod
*Describe the results you received:*
Starting podman pod somePod with CAP_NET_BIND_SERVICE in ambient
error starting container
a23870f01ca443d830882e4e507d660dd207191197e31a4e9868e967ce75fabf: failed to expose
ports via rootlessport: "cannot expose privileged port 80, you might need to add
\"net.ipv4.ip_unprivileged_port_start=0\" (currently 1024) to /etc/sysctl.conf, or choose a larger
port number (>= 1024): listen tcp 0.0.0.0:80: bind: permission denied\n"
Error: error starting container
e107b93b546fc39199bbadfc551e4c4abe4bf1a41e77e52eb181775e6096b96d: a dependency of
container e107b93b546fc39199bbadfc551e4c4abe4bf1a41e77e52eb181775e6096b96d failed to
start: container state improper
*Describe the results you expected:*
Should result in a running pod somePod for which
the command:
lynx http://10.42.0.211:80
would produce:
listing: /
_____________________________________________________________________
access user group date size name
_____________________________________________________________________
webfs/1.21 01/Oct/2020 10:41:57 GMT
*Additional information you deem important (e.g. issue happens only
occasionally):*
*Output of podman version:*
Version: 2.1.1
API Version: 2.0.0
Go Version: go1.15.2
Built: Thu Jan 1 00:00:00 1970
OS/Arch: linux/amd64
*Output of podman info --debug:*
host:
arch: amd64
buildahVersion: 1.16.1
cgroupManager: systemd
cgroupVersion: v2
conmon:
package: 'conmon: /usr/libexec/podman/conmon'
path: /usr/libexec/podman/conmon
version: 'conmon version 2.0.20, commit: '
cpus: 2
distribution:
distribution: ubuntu
version: "20.04"
eventLogger: journald
hostname: sDev
idMappings:
gidmap:
- container_id: 0
host_id: 4242
size: 1
- container_id: 1
host_id: 100000
size: 65536
uidmap:
- container_id: 0
host_id: 4242
size: 1
- container_id: 1
host_id: 100000
size: 65536
kernel: 5.4.0-48-generic
linkmode: dynamic
memFree: 1629048832
memTotal: 2084036608
ociRuntime:
name: crun
package: 'crun: /usr/bin/crun'
path: /usr/bin/crun
version: |-
crun version UNKNOWN
commit: 3e46dd849fdf6bfa68127786e073318184641f05
spec: 1.0.0
+SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +YAJL
os: linux
remoteSocket:
path: /run/user/4242/podman/podman.sock
rootless: true
slirp4netns:
executable: /usr/bin/slirp4netns
package: 'slirp4netns: /usr/bin/slirp4netns'
version: |-
slirp4netns version 1.1.4
commit: unknown
libslirp: 4.3.1-git
SLIRP_CONFIG_VERSION_MAX: 3
swapFree: 0
swapTotal: 0
uptime: 45.68s
registries:
search:
- docker.io
- quay.io
store:
configFile: /home/dev/.config/containers/storage.conf
containerStore:
number: 0
paused: 0
running: 0
stopped: 0
graphDriverName: vfs
graphOptions: {}
graphRoot: /home/dev/.local/share/containers/storage
graphStatus: {}
imageStore:
number: 10
runRoot: /run/user/4242/containers
volumePath: /home/dev/.local/share/containers/storage/volumes
version:
APIVersion: 2.0.0
Built: 0
BuiltTime: Thu Jan 1 00:00:00 1970
GitCommit: ""
GoVersion: go1.15.2
OsArch: linux/amd64
Version: 2.1.1
*Package info (e.g. output of rpm -q podman or apt list podman):*
Listing... Done
podman/unknown,now 2.1.1~1 amd64 [installed]
podman/unknown 2.1.1~1 arm64
podman/unknown 2.1.1~1 armhf
podman/unknown 2.1.1~1 s390x
Installed on Ubuntu using an
/etc/apt/sources.list.d/devel:kubic:libcontainers:stable.list file whose
content is:
deb http://download.opensuse.org/repositories/devel:/kubic:/libcontainers:/stable/xUbuntu_20.04/
*Have you tested with the latest version of Podman and have you checked
the Podman Troubleshooting Guide?*
Yes
*Additional environment details (AWS, VirtualBox, physical, etc.):*
I am working inside a recent (two days old) QEMU-KVM VM based upon Ubuntu's
Focal CloudImage:
http://cloud-images.ubuntu.com/releases/focal/release/ubuntu-20.04-server-cloudimg-amd64.img
I have the additional kernel parameter set:
GRUB_CMDLINE_LINUX="systemd.unified_cgroup_hierarchy=1"
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#7866>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB3AOCBRVFHHWKEMD7AWOW3SIRSDNANCNFSM4SAJU7UQ>
.
|
Why not just run the container through sudo with the user set back to the current user? sudo podman run --user ${id -u} --cap-add cap_net_raw ... |
Many thanks for both an excellent set of tools as well as your swift replies. A rather flippant response to @rhatdan's comment:
is that, well, it is not rootless. At first sight it grants podman more capabilities than should be needed for the problem at hand. On second sight, it also means that I, as a systems admin, have to trust that podman does in fact drop privileges correctly. Auditing the whole of podman is much more difficult (not least because podman is evolving rather quickly), than auditing my example code above. However, the sub text of @rhatdan's comment is that, maybe, in my horror at docker's security model, I might be "throwing the baby out with the bath water".... that is, maybe it is secure (enough) to use @mheon I thought that it was simply a matter of some exec/fork call dropping privileges too ruthlessly. I see that I need to read more carefully the (few) documents about both capabilities and namespaces.... (some bed time reading). For later reference I have found:
Are there other resources that I should consult? Given the usefulness to the podman community, of being able to ensure security escalation only happens in ultra simple and easily audited code, please humour me by leaving this feature request open for a while while I try to understand (and document?) why my request won't work. If after a month of no response (on my part), please feel free to close this request. |
So I have just investigated @rhatdan's suggestion:
TL;DR: This runs a process inside a container whose users map directly to the host. So a privilege escalation inside the container is actually a privilege escalation on the host itself. This is definitely not what I mean by "rootless-podman". Rootful-Podman: Using the "User and group ID mappings: uid_map and gid_map" section of user_namespaces(7), I read the following output as saying that uids inside the container map directly to uids on the host. That is
Rootless-Podman: Similarly, I read the following output to mean that the
So for the medium term, I would rather use the
solution even if it grants slightly more permissions than needed. |
I would use something like: sudo sysctl net.ipv4.ip_unprivileged_port_start=23 Which would prevent a process by your user impersonating the sshd daemon. |
Since this is impossible and there are other workarounds, I am closing. |
@rhatdan - I don't think its entirely impossible. I needed
or the ugly/cleaner option |
This configuration will be cleared after reboot the computer.It will need to be added to the system configuration file if you want it work every time after reboot.
Paste the following parameters into your
|
I am grappling pretty much with this issue -- wanting Podman to publish a container port on a privileged host port for accessing the contained service as if it were running directly on the host. If I use dedicated network namespace for the container (default), From what I understand, since the following works as expected (
...yet using As much seems to be hinted at by
Would the above mean that Podman creating a user namespace for the container, will cause the processes in the container to have no capabilities in the previous [user] namespace? That's certainly supported by evidence here, although it may be a harmless correlation. Otherwise, I don't understand what's stopping the kernel from allowing the contained processes run by Podman, from binding to privileged ports? Explanation by @mheon seems to be basically alluding to the same thing as above, although I wouldn't have been able to tell unless I had read the man-pages of
I mean, although possibly correct the explanation seems more along the lines of "we do advanced things so it won't work". |
You have it right - the kernel is dropping all capabilities in the root user namespace when Podman creates its own user namespace for use with rootless containers. And given that we require the user namespace for some things (mounting/unmounting filesystems, getting access to more than 1 UID/GID) there is no way to avoid the creation of a user namespace and dropping of "real" capabilities, even if granted explicitly by |
For your specific case (privileged ports) we generally recommend just making the ports unprivileged (via the |
/kind feature
Description
(This is a more general and disciplined solution to, for example, 5)
rootless containers cannot ping
hosts).
I would like to make use of one or more linux capabilities on the host
in rootless-podman mode.
For example, I would like to make use of both CAP_NET_BIND_SERVICE and
CAP_NET_RAW.
While I could use
sysctl net.ipv4.ip_unprivileged_port_start=0
insteadof CAP_NET_BIND_SERVICE, there is no equivalent for CAP_NET_RAW.
More importantly using
sysctl net.ipv4.ip_unprivileged_port_start=0
is ablunt tool, once permitted, every user and every executable can run a
service on a "privileged port".
Similarly using
setcap cap_net_bind_service+iep /usr/bin/podman
wouldgrant the CAP_NET_BIND_SERVICE to all containers run by podman.
It is more secure to have a simple executable which permits a limited
range of linux capapbilities to an internally named execl'ed executable
(such as
podman pod start somePod
)The following example code is based upon Adrian Mouat's
set_ambient.c
(See also: Linux Capabilities In
Practice)
This code is short, sweet and easily audited, and most importantly does
one thing in a known enhanced security environment.
After creating the podman pod
somePod
as suggested in the above code,the command
./startPod
should result in a running podman podsomePod
(which is able to serve http requests on the host's port 80).
Steps to reproduce the issue:
compile the above example code (as outlined in the code above).
use setcap to grant the
startPod
command the required linuxcapabilities (as outline in the code above).
create the podman pod
somePod
(as outlined in the above code).run the command
./startPod
Describe the results you received:
Describe the results you expected:
Should result in a running pod
somePod
for whichthe command:
would produce:
Additional information you deem important (e.g. issue happens only occasionally):
Output of
podman version
:Output of
podman info --debug
:Package info (e.g. output of
rpm -q podman
orapt list podman
):Installed on Ubuntu using an /etc/apt/sources.list.d/devel:kubic:libcontainers:stable.list file whose content is:
Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide?
Yes
Additional environment details (AWS, VirtualBox, physical, etc.):
I am working inside a recent (two days old) QEMU-KVM VM based upon Ubuntu's
Focal CloudImage:
I have the additional kernel parameter set:
The text was updated successfully, but these errors were encountered: