Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agent K8S workload attestor support for cgroups v2 #4004

Closed
tihhoni opened this issue Mar 21, 2023 · 12 comments · Fixed by #5076
Closed

Agent K8S workload attestor support for cgroups v2 #4004

tihhoni opened this issue Mar 21, 2023 · 12 comments · Fixed by #5076
Assignees
Labels
help wanted Issues with this label are ready to start work but are in need of someone to do it priority/urgent Issue is approved and is must be completed in the assigned milestone
Milestone

Comments

@tihhoni
Copy link

tihhoni commented Mar 21, 2023

  • Version: 1.5.5+
  • Platform: K8S with amd64 linux kernel using cgroups v2
  • Subsystem: Agent k8s workload attestor

Currently k8s workload attestor is extracting podUID and containerUID from /proc/<...pid...>/cgroup
https://github.com/spiffe/spire/blob/main/pkg/agent/plugin/workloadattestor/k8s/k8s.go#L206
https://github.com/spiffe/spire/blob/main/pkg/agent/common/cgroups/cgroups.go#L31

This works for cgroups v1, but it does not work with cgroup v2
For v2 /proc/<...pid...>/cgroup will always be:
0::/
Which results in no containerUID extracted and thus no k8s workload attestation possible

Are you planning to migrate it to support cgroups v2? e.g. looking at some solutions it seems its possible to extract containerUID from /proc/self/mountinfo

Background:

@MarcosDY MarcosDY added the triage/needed Issue is in need of triage label Mar 21, 2023
@rturner3 rturner3 added triage/in-progress Issue triage is in progress and removed triage/needed Issue is in need of triage labels Mar 21, 2023
@evan2645
Copy link
Member

Thank you for opening this @tihhoni , and for sharing the links - they're very helpful

I see that kubelet will use cgroups v2 by default any time that it is enabled. Do you have a sense for which or how many distributions are shipping with cgroups v2 enabled currently? Did you bump into this issue directly?

@guilhermocc
Copy link
Contributor

Here's some info that I was able to find regarding which distros are already being shipped with groups v2 enabled by default:

  • Container Optimized OS (since M97)
  • Ubuntu (since 21.10)
  • Debian GNU/Linux (since Debian 11 Bullseye)
  • Fedora (since 31)
  • Arch Linux (since April 2021)
  • RHEL and RHEL-like distributions (since 9)

Ref: https://kubernetes.io/blog/2022/08/31/cgroupv2-ga-1-25/#how-do-you-use-cgroup-v2

@tihhoni
Copy link
Author

tihhoni commented Mar 22, 2023

Hi @evan2645

From various sources i've got the feeling that cgroup v1 will be deprecated EOY 2023. So presumably new kernel OS updates will come exclusively with cgroup v2 support.

We've run into this issue in staging cluster, where OS upgrade automatically rolled out new linux distribution and tests started to fail.

I am still running tests to understand what impact that will cause. What i know so far:

So it seems that PID cgroup might have container/pod uid info in some cases, i just did not figure out what that condition is.

If i get more results i will surely post it here.

--
Nikolai

@tihhoni
Copy link
Author

tihhoni commented Mar 23, 2023

After some testing it seems that issue is not that dire:

spire-agent run PID = 302967
busybox sleep PID = 329596

Both run pods

hostPID: true
hostNetwork: true
dnsPolicy: ClusterFirstWithHostNet
k exec -it busybox -- /bin/sh
/ # cat /proc/302967/cgroup
0::/../../kubepods-besteffort-pod02b5aa9b_9013_49b5_9755_efeb06552752.slice/cri-containerd-222a5371beed8c0ad58e23d1cd39e13bc113915a385d3fea4510e1cd404868c2.scope
/ # cat /proc/329596/cgroup
0::/
k exec -it ds/spire-agent -c spire-agent -- /bin/sh
/opt/spire # cat /proc/302967/cgroup
0::/
/opt/spire # cat /proc/329596/cgroup
0::/../../kubepods-besteffort-podb2dce23b_8b3a_4356_b903_56152d488916.slice/cri-containerd-cf6d9de603ab1f665fc84f88c22f761678fcaae86b98a3936e0b9717740146aa.scope

So from within the container if you try to read your own cgroup - its root without container id. But if you read other process cgroup its fine.

As a result spire agent can attest other k8s pids but not itself.

This might be different depending on the cluster setup.

@evan2645
Copy link
Member

Thank you very much @tihhoni for looking into this further ❤️

Can you tell if the problem only affects "self", or if it affects e.g. all pods running as root? I'm marking this as priority/backlog for now, but if it affects all root pods, we may consider upgrading it to urgent. Thanks again!

@evan2645 evan2645 added help wanted Issues with this label are ready to start work but are in need of someone to do it priority/backlog Issue is approved and in the backlog and removed triage/in-progress Issue triage is in progress labels Mar 28, 2023
@zmt
Copy link
Contributor

zmt commented Mar 28, 2023

We've noticed similar changes will be needed for the dockerd-workloadattestor to support cgroups v2. We have a project that will require cgroups v2 support in dockerd-workloadattestor beginning in May 2023. After a brief discussion in the contributor sync today, we reached a consensus that it makes sense to introduce logic to add this support across several PRs:

  • introduce logic to support cgroups v2
  • migrate k8s workload attestor to use the cgroups v2 logic
  • migrate dockerd workload attestor to use the cgroups v2 logic

@tihhoni
Copy link
Author

tihhoni commented Mar 31, 2023

@evan2645 my impression was only "self" is the problem. I tested full ingress/egress envoy setup and attestation works fine (spire server can successfully get envoy pids container ids)

Forgot to mention: we run things in containerd 1.6.* and k8s 1.25.* i did not check other runtimes

@kfox1111
Copy link
Contributor

migrate k8s workload attestor to use the cgroups v2 logic

v1 will also need to be supported for some time. This is probably what you meant, but want to make sure.

@evan2645 evan2645 removed their assignment Jun 1, 2023
@nikotih
Copy link
Contributor

nikotih commented Jan 18, 2024

I am again running into this issue, but now from different perspective. just wanted to share (more related to cgroupsns but in all cases running on cgroups v2)

Seems that tools are slowly moving away from cgroup v1 and also using cgroupns=private (e.g. last docker daemon and kind releases)
https://docs.docker.com/engine/reference/commandline/dockerd/#daemon
https://github.com/kubernetes-sigs/kind/releases/tag/v0.20.0

this creates a problem for running some integration tests where you have cgroupns hardcoded
https://github.com/spiffe/spire/blob/main/test/integration/suites/nested-rotation/00-setup#L32

if we run tests with dind in k8s (cgroupns=private)
docker cgroups that agent workload attester sees looks like /../<id> (--cgroup-parent=/actions_job on docker deamon does nothing here)

if we run tests with dind cgroupsns=host + --cgroup-parent=/actions_job then they work fine but then we face issues with kind cluster tests that mess up main k8s kubelet cgroup

we can patch CGROUP_MATCHERS for us for tests of course (maybe its good to allow users to set it too, default matcher /docker/<id> does not work in this case too)

just a heads up for now. if your GA VM will switch docker daemon to cgroupns=private on cgroups v2

@evan2645
Copy link
Member

@nikotih thank you very much for reporting this ❤️ would you mind filing it as a new issue specifically around cgroupns=private? I think that warrants separate attention and it will be good to triage it independently (but still in relation to this one)

@nikotih
Copy link
Contributor

nikotih commented Jan 19, 2024

Sure thing i can. However, i think cgroups v2 and cgroups namespace private go hand in hand.

cgroups v2 enabled kubelet and docker daemon will automatically start with namespace private (unless explicitly changed)

original issue that i opened here, actual root cause is also namespace private:

docker info
Client:
 Version:    24.0.7
 Context:    desktop-linux
...
Server:
 Cgroup Driver: cgroupfs
 Cgroup Version: 2
...

Discalimer: Non systemd environment

docker run --rm alpine cat /proc/1/cgroup
0::/

docker run --rm --cgroupns=private alpine cat /proc/1/cgroup
0::/

docker run --rm --cgroupns=host alpine cat /proc/1/cgroup
0::/docker/46f093fa43676f4cfda0fe1fbf49a6c959f56e22808e2592e1d1e7bd0366ed0c

aka for spire agent on cgroups v2 and namespace private, it cannot deduce its own container

starting 2 containers with

docker run -it --rm --cgroupns=private --pid=host alpine /bin/sh

PIDS:

2809 root      0:00 /usr/bin/containerd-shim-runc-v2 -namespace moby -id 3c217b7f3a40352850d757d3f5f0e263731425b0350
2835 root      0:00 /bin/sh

2977 root      0:00 /usr/bin/containerd-shim-runc-v2 -namespace moby -id 67b44b40c304559915db0b688f60e83555c40f26e1d
3003 root      0:00 /bin/sh

First image:

/ # cat /proc/2835/cgroup
0::/
/ # cat /proc/3003/cgroup
0::/../67b44b40c304559915db0b688f60e83555c40f26e1d4712f0b7dc91dafe3ddd4

Second image:

/ # cat /proc/2835/cgroup
0::/../3c217b7f3a40352850d757d3f5f0e263731425b03505dd73420d9825ec76d9b2
/ # cat /proc/3003/cgroup
0::/

Same thing:

  • its own container it sees as root
  • other containers it sees not as /docker/<id> but as /../<id>

@evan2645
Copy link
Member

Ok, I appreciate this @nikotih . I don't have the time to dig in right now but I'll move this issue back to triage so we can reassess the impact.

@evan2645 evan2645 added triage/in-progress Issue triage is in progress and removed help wanted Issues with this label are ready to start work but are in need of someone to do it priority/backlog Issue is approved and in the backlog labels Jan 19, 2024
@rturner3 rturner3 added help wanted Issues with this label are ready to start work but are in need of someone to do it priority/urgent Issue is approved and is must be completed in the assigned milestone and removed triage/in-progress Issue triage is in progress labels Feb 1, 2024
@rturner3 rturner3 added this to the 1.9.1 milestone Feb 1, 2024
@MarcosDY MarcosDY modified the milestones: 1.9.2, 1.9.3 Mar 14, 2024
@amartinezfayo amartinezfayo modified the milestones: 1.9.3, 1.9.4, 1.9.5 Apr 4, 2024
azdagron added a commit to azdagron/spire that referenced this issue Apr 18, 2024
The docker and k8s workload attestors work backwards from pid to
container by inspecting the proc filesystem. Today, this happens by
inspecting the cgroup file. Identifying the container ID (and pod UID)
from the cgroup file has been a continual arms race. The k8s and docker
workload attestors grew different mechanisms for trying to deal with the
large variety in the output.

Further, with cgroups v2 and private namespaces, the cgroup file might
not have the container ID or pod UID information within it.

This PR unifies the container ID (and pod UID) extraction for both the
docker and k8s workload attestors. The new implementation searches the
mountinfo file first for cgroups mounts. If not found, it will fall back
to the cgroup file (typically necessary only when the workload is
running in the same container as the agent).

The extraction algorithm is the same for both mountinfo and cgroup
entries, and is as follows:
1. Iterator over each entry in the file being searched, extracting
   either the cgroup mount root (mountinfo) or the cgroup group
   path (cgroup) as the source path.
2. Walk backwards through the segments in the source path looking for
   the 64-bit hex digit container ID.
3. If looking for the pod UID (K8s only), then walk backwards through
   the segments in the path looking for the pod UID pattern used by
   kubelet. Start with the segment the container ID was found in
   (truncated to remove the container ID portion).
4. If there are pod UID/container ID conflicts after searching these
   files then log and abort. Entries that have a pod UID override those
   that don't.

The container ID is very often contained in the last segment in the path
but there are situations where it isn't.

This new functionality is NOT enabled by default, but opted in using the
`use_new_container_locator` configurable in each plugin. In 1.10, we can
consider enabling it by default.

The testing for the new code is spread out a little bit. The cgroups
fallback functionality is mostly tested by the existing tests in the
k8s and docker plugin tests. The mountinfo tests are only in the new
containerinfo package.

In the long term, I'd like to see all of the container info extraction
related tests moved solely to the containerinfo package and removed from
the individual plugins.

Resolves spiffe#4004, resolves spiffe#4682, resolves spiffe#4917.

Signed-off-by: Andrew Harding <azdagron@gmail.com>
azdagron added a commit to azdagron/spire that referenced this issue Apr 18, 2024
The docker and k8s workload attestors work backwards from pid to
container by inspecting the proc filesystem. Today, this happens by
inspecting the cgroup file. Identifying the container ID (and pod UID)
from the cgroup file has been a continual arms race. The k8s and docker
workload attestors grew different mechanisms for trying to deal with the
large variety in the output.

Further, with cgroups v2 and private namespaces, the cgroup file might
not have the container ID or pod UID information within it.

This PR unifies the container ID (and pod UID) extraction for both the
docker and k8s workload attestors. The new implementation searches the
mountinfo file first for cgroups mounts. If not found, it will fall back
to the cgroup file (typically necessary only when the workload is
running in the same container as the agent).

The extraction algorithm is the same for both mountinfo and cgroup
entries, and is as follows:
1. Iterator over each entry in the file being searched, extracting
   either the cgroup mount root (mountinfo) or the cgroup group
   path (cgroup) as the source path.
2. Walk backwards through the segments in the source path looking for
   the 64-bit hex digit container ID.
3. If looking for the pod UID (K8s only), then walk backwards through
   the segments in the path looking for the pod UID pattern used by
   kubelet. Start with the segment the container ID was found in
   (truncated to remove the container ID portion).
4. If there are pod UID/container ID conflicts after searching these
   files then log and abort. Entries that have a pod UID override those
   that don't.

The container ID is very often contained in the last segment in the path
but there are situations where it isn't.

This new functionality is NOT enabled by default, but opted in using the
`use_new_container_locator` configurable in each plugin. In 1.10, we can
consider enabling it by default.

The testing for the new code is spread out a little bit. The cgroups
fallback functionality is mostly tested by the existing tests in the
k8s and docker plugin tests. The mountinfo tests are only in the new
containerinfo package.

In the long term, I'd like to see all of the container info extraction
related tests moved solely to the containerinfo package and removed from
the individual plugins.

Resolves spiffe#4004, resolves spiffe#4682, resolves spiffe#4917.

Signed-off-by: Andrew Harding <azdagron@gmail.com>
azdagron added a commit that referenced this issue Apr 24, 2024
* New container locator for docker/k8s on linux

The docker and k8s workload attestors work backwards from pid to
container by inspecting the proc filesystem. Today, this happens by
inspecting the cgroup file. Identifying the container ID (and pod UID)
from the cgroup file has been a continual arms race. The k8s and docker
workload attestors grew different mechanisms for trying to deal with the
large variety in the output.

Further, with cgroups v2 and private namespaces, the cgroup file might
not have the container ID or pod UID information within it.

This PR unifies the container ID (and pod UID) extraction for both the
docker and k8s workload attestors. The new implementation searches the
mountinfo file first for cgroups mounts. If not found, it will fall back
to the cgroup file (typically necessary only when the workload is
running in the same container as the agent).

The extraction algorithm is the same for both mountinfo and cgroup
entries, and is as follows:
1. Iterator over each entry in the file being searched, extracting
   either the cgroup mount root (mountinfo) or the cgroup group
   path (cgroup) as the source path.
2. Walk backwards through the segments in the source path looking for
   the 64-bit hex digit container ID.
3. If looking for the pod UID (K8s only), then walk backwards through
   the segments in the path looking for the pod UID pattern used by
   kubelet. Start with the segment the container ID was found in
   (truncated to remove the container ID portion).
4. If there are pod UID/container ID conflicts after searching these
   files then log and abort. Entries that have a pod UID override those
   that don't.

The container ID is very often contained in the last segment in the path
but there are situations where it isn't.

This new functionality is NOT enabled by default, but opted in using the
`use_new_container_locator` configurable in each plugin. In 1.10, we can
consider enabling it by default.

The testing for the new code is spread out a little bit. The cgroups
fallback functionality is mostly tested by the existing tests in the
k8s and docker plugin tests. The mountinfo tests are only in the new
containerinfo package.

In the long term, I'd like to see all of the container info extraction
related tests moved solely to the containerinfo package and removed from
the individual plugins.

Resolves #4004, resolves #4682, resolves #4917.

Signed-off-by: Andrew Harding <azdagron@gmail.com>

* missing new arg

Signed-off-by: Andrew Harding <azdagron@gmail.com>

* fix windows tests

Signed-off-by: Andrew Harding <azdagron@gmail.com>

* fix windows tests and lint

Signed-off-by: Andrew Harding <azdagron@gmail.com>

* address pr comments

Signed-off-by: Andrew Harding <azdagron@gmail.com>

* markdown lint

Signed-off-by: Andrew Harding <azdagron@gmail.com>

* add agent full conf

Signed-off-by: Andrew Harding <azdagron@gmail.com>

* fix labels

Signed-off-by: Andrew Harding <azdagron@gmail.com>

* change log to warn

Signed-off-by: Andrew Harding <azdagron@gmail.com>

* use new locator in it

Signed-off-by: Andrew Harding <azdagron@gmail.com>

---------

Signed-off-by: Andrew Harding <azdagron@gmail.com>
@amartinezfayo amartinezfayo modified the milestones: 1.9.5, 1.9.6 May 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Issues with this label are ready to start work but are in need of someone to do it priority/urgent Issue is approved and is must be completed in the assigned milestone
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants