Skip to content

Commit

Permalink
tree-wide: use /proc/thread-self for thread-local state
Browse files Browse the repository at this point in the history
With the idmap work, we will have a tainted Go thread in our
thread-group that has a different mount namespace to the other threads.
It seems that (due to some bad luck) the Go scheduler tends to make this
thread the thread-group leader in our tests, which results in very
baffling failures where /proc/self/mountinfo produces gibberish results.

In order to avoid this, switch to using /proc/thread-self for everything
that is thread-local. This primarily includes switching all file
descriptor paths (CLONE_FS), all of the places that check the current
cgroup (technically we never will run a single runc thread in a separate
cgroup, but better to be safe than sorry), and the aforementioned
mountinfo code. We don't need to do anything for the following because
the results we need aren't thread-local:

 * Checks that certain namespaces are supported by stat(2)ing
   /proc/self/ns/...

 * /proc/self/exe and /proc/self/cmdline are not thread-local.

 * While threads can be in different cgroups, we do not do this for the
   runc binary (or libcontainer) and thus we do not need to switch to
   the thread-local version of /proc/self/cgroups.

 * All of the CLONE_NEWUSER files are not thread-local because you
   cannot set the usernamespace of a single thread (setns(CLONE_NEWUSER)
   is blocked for multi-threaded programs).

Note that we have to use runtime.LockOSThread when we have an open
handle to a tid-specific procfs file that we are operating on multiple
times. Go can reschedule us such that we are running on a different
thread and then kill the original thread (causing -ENOENT or similarly
confusing errors). This is not strictly necessary for most usages of
/proc/thread-self (such as using /proc/thread-self/fd/$n directly) since
only operating on the actual inodes associated with the tid requires
this locking, but because of the pre-3.17 fallback for CentOS, we have
to do this in most cases.

In addition, CentOS's kernel is too old for /proc/thread-self, which
requires us to emulate it -- however in rootfs_linux.go, we are in the
container pid namespace but /proc is the host's procfs. This leads to
the incredibly frustrating situation where there is no way (on pre-4.1
Linux) to figure out which /proc/self/task/... entry refers to the
current tid. We can just use /proc/self in this case.

Yes this is all pretty ugly. I also wish it wasn't necessary.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
  • Loading branch information
cyphar committed Dec 14, 2023
1 parent a04d88e commit 8e8b136
Show file tree
Hide file tree
Showing 18 changed files with 319 additions and 114 deletions.
15 changes: 10 additions & 5 deletions libcontainer/apparmor/apparmor_linux.go
Original file line number Diff line number Diff line change
Expand Up @@ -26,14 +26,19 @@ func isEnabled() bool {
}

func setProcAttr(attr, value string) error {
// Under AppArmor you can only change your own attr, so use /proc/self/
// instead of /proc/<tid>/ like libapparmor does
attrPath := "/proc/self/attr/apparmor/" + attr
if _, err := os.Stat(attrPath); errors.Is(err, os.ErrNotExist) {
attr = utils.CleanPath(attr)
attrSubPath := "attr/apparmor/" + attr
if _, err := os.Stat("/proc/self/" + attrSubPath); errors.Is(err, os.ErrNotExist) {
// fall back to the old convention
attrPath = "/proc/self/attr/" + attr
attrSubPath = "attr/" + attr
}

// Under AppArmor you can only change your own attr, so there's no reason
// to not use /proc/thread-self/ (instead of /proc/<tid>/, like libapparmor
// does).
attrPath, closer := utils.ProcThreadSelf(attrSubPath)
defer closer()

f, err := os.OpenFile(attrPath, os.O_WRONLY, 0)
if err != nil {
return err
Expand Down
3 changes: 3 additions & 0 deletions libcontainer/cgroups/cgroups_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,9 @@ import (
)

func TestParseCgroups(t *testing.T) {
// We don't need to use /proc/thread-self here because runc always runs
// with every thread in the same cgroup. This lets us avoid having to do
// runtime.LockOSThread.
cgroups, err := ParseCgroupFile("/proc/self/cgroup")
if err != nil {
t.Fatal(err)
Expand Down
9 changes: 5 additions & 4 deletions libcontainer/cgroups/file.go
Original file line number Diff line number Diff line change
Expand Up @@ -136,13 +136,14 @@ func openFile(dir, file string, flags int) (*os.File, error) {
//
// TODO: if such usage will ever be common, amend this
// to reopen cgroupFd and retry openat2.
fdStr := strconv.Itoa(cgroupFd)
fdDest, _ := os.Readlink("/proc/self/fd/" + fdStr)
fdPath, closer := utils.ProcThreadSelf("fd/" + strconv.Itoa(cgroupFd))
defer closer()
fdDest, _ := os.Readlink(fdPath)
if fdDest != cgroupfsDir {
// Wrap the error so it is clear that cgroupFd
// is opened to an unexpected/wrong directory.
err = fmt.Errorf("cgroupFd %s unexpectedly opened to %s != %s: %w",
fdStr, fdDest, cgroupfsDir, err)
err = fmt.Errorf("cgroupFd %d unexpectedly opened to %s != %s: %w",
cgroupFd, fdDest, cgroupfsDir, err)
}
return nil, err
}
Expand Down
3 changes: 3 additions & 0 deletions libcontainer/cgroups/fs2/defaultpath.go
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,9 @@ func _defaultDirPath(root, cgPath, cgParent, cgName string) (string, error) {
return filepath.Join(root, innerPath), nil
}

// we don't need to use /proc/thread-self here because runc always runs
// with every thread in the same cgroup. This lets us avoid having to do
// runtime.LockOSThread.
ownCgroup, err := parseCgroupFile("/proc/self/cgroup")
if err != nil {
return "", err
Expand Down
10 changes: 9 additions & 1 deletion libcontainer/cgroups/v1_utils.go
Original file line number Diff line number Diff line change
Expand Up @@ -99,11 +99,12 @@ func tryDefaultPath(cgroupPath, subsystem string) string {
// expensive), so it is assumed that cgroup mounts are not being changed.
func readCgroupMountinfo() ([]*mountinfo.Info, error) {
readMountinfoOnce.Do(func() {
// mountinfo.GetMounts uses /proc/thread-self, so we can use it without
// issues.
cgroupMountinfo, readMountinfoErr = mountinfo.GetMounts(
mountinfo.FSTypeFilter("cgroup"),
)
})

return cgroupMountinfo, readMountinfoErr
}

Expand Down Expand Up @@ -196,6 +197,9 @@ func getCgroupMountsV1(all bool) ([]Mount, error) {
return nil, err
}

// We don't need to use /proc/thread-self here because runc always runs
// with every thread in the same cgroup. This lets us avoid having to do
// runtime.LockOSThread.
allSubsystems, err := ParseCgroupFile("/proc/self/cgroup")
if err != nil {
return nil, err
Expand All @@ -214,6 +218,10 @@ func GetOwnCgroup(subsystem string) (string, error) {
if IsCgroup2UnifiedMode() {
return "", errUnified
}

// We don't need to use /proc/thread-self here because runc always runs
// with every thread in the same cgroup. This lets us avoid having to do
// runtime.LockOSThread.
cgroups, err := ParseCgroupFile("/proc/self/cgroup")
if err != nil {
return "", err
Expand Down
3 changes: 3 additions & 0 deletions libcontainer/configs/namespaces_linux.go
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,9 @@ func IsNamespaceSupported(ns NamespaceType) bool {
if nsFile == "" {
return false
}
// We don't need to use /proc/thread-self here because the list of
// namespace types is unrelated to the thread. This lets us avoid having to
// do runtime.LockOSThread.
_, err := os.Stat("/proc/self/ns/" + nsFile)
// a namespace is supported if it exists and we have permissions to read it
supported = err == nil
Expand Down
10 changes: 8 additions & 2 deletions libcontainer/container_linux.go
Original file line number Diff line number Diff line change
Expand Up @@ -509,14 +509,20 @@ func (c *Container) newParentProcess(p *Process) (parentProcess, error) {
if dmz.IsSelfExeCloned() {
// /proc/self/exe is already a cloned binary -- no need to do anything
logrus.Debug("skipping binary cloning -- /proc/self/exe is already cloned!")
// We don't need to use /proc/thread-self here because the exe mm of a
// thread-group is guaranteed to be the same for all threads by
// definition. This lets us avoid having to do runtime.LockOSThread.
exePath = "/proc/self/exe"
} else {
var err error
if isDmzBinarySafe(c.config) {
dmzExe, err = dmz.Binary(c.stateDir)
if err == nil {
// We can use our own executable without cloning if we are using
// runc-dmz.
// We can use our own executable without cloning if we are
// using runc-dmz. We don't need to use /proc/thread-self here
// because the exe mm of a thread-group is guaranteed to be the
// same for all threads by definition. This lets us avoid
// having to do runtime.LockOSThread.
exePath = "/proc/self/exe"
p.clonedExes = append(p.clonedExes, dmzExe)
logrus.Debug("runc-dmz: using runc-dmz") // used for tests
Expand Down
5 changes: 3 additions & 2 deletions libcontainer/criu_linux.go
Original file line number Diff line number Diff line change
Expand Up @@ -526,6 +526,7 @@ func (c *Container) restoreNetwork(req *criurpc.CriuReq, criuOpts *CriuOpts) {
// restore using CRIU. This function is inspired from the code in
// rootfs_linux.go.
func (c *Container) makeCriuRestoreMountpoints(m *configs.Mount) error {
me := mountEntry{Mount: m}
switch m.Device {
case "cgroup":
// No mount point(s) need to be created:
Expand All @@ -540,7 +541,7 @@ func (c *Container) makeCriuRestoreMountpoints(m *configs.Mount) error {
// The prepareBindMount() function checks if source
// exists. So it cannot be used for other filesystem types.
// TODO: pass srcFD? Not sure if criu is impacted by issue #2484.
if err := prepareBindMount(mountEntry{Mount: m}, c.config.Rootfs); err != nil {
if err := prepareBindMount(me, c.config.Rootfs); err != nil {
return err
}
default:
Expand All @@ -549,7 +550,7 @@ func (c *Container) makeCriuRestoreMountpoints(m *configs.Mount) error {
if err != nil {
return err
}
if err := checkProcMount(c.config.Rootfs, dest, ""); err != nil {
if err := checkProcMount(c.config.Rootfs, dest, me); err != nil {
return err
}
if err := os.MkdirAll(dest, 0o755); err != nil {
Expand Down
3 changes: 3 additions & 0 deletions libcontainer/init_linux.go
Original file line number Diff line number Diff line change
Expand Up @@ -480,6 +480,9 @@ func setupUser(config *initConfig) error {
return err
}

// We don't need to use /proc/thread-self here because setgroups is a
// per-userns file and thus is global to all threads in a thread-group.
// This lets us avoid having to do runtime.LockOSThread.
setgroups, err := os.ReadFile("/proc/self/setgroups")
if err != nil && !os.IsNotExist(err) {
return err
Expand Down
36 changes: 23 additions & 13 deletions libcontainer/integration/exec_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ import (
"github.com/opencontainers/runc/libcontainer/cgroups/systemd"
"github.com/opencontainers/runc/libcontainer/configs"
"github.com/opencontainers/runc/libcontainer/userns"
"github.com/opencontainers/runc/libcontainer/utils"
"github.com/opencontainers/runtime-spec/specs-go"

"golang.org/x/sys/unix"
Expand Down Expand Up @@ -1691,6 +1692,20 @@ func TestFdLeaksSystemd(t *testing.T) {
testFdLeaks(t, true)
}

func fdList(t *testing.T) []string {
procSelfFd, closer := utils.ProcThreadSelf("fd")
defer closer()

fdDir, err := os.Open(procSelfFd)
ok(t, err)
defer fdDir.Close()

fds, err := fdDir.Readdirnames(-1)
ok(t, err)

return fds
}

func testFdLeaks(t *testing.T, systemd bool) {
if testing.Short() {
return
Expand All @@ -1705,21 +1720,12 @@ func testFdLeaks(t *testing.T, systemd bool) {
// - /sys/fs/cgroup dirfd opened by prepareOpenat2 in libct/cgroups;
// - dbus connection opened by getConnection in libct/cgroups/systemd.
_ = runContainerOk(t, config, "true")

pfd, err := os.Open("/proc/self/fd")
ok(t, err)
defer pfd.Close()
fds0, err := pfd.Readdirnames(0)
ok(t, err)
_, err = pfd.Seek(0, 0)
ok(t, err)
fds0 := fdList(t)

_ = runContainerOk(t, config, "true")
fds1 := fdList(t)

fds1, err := pfd.Readdirnames(0)
ok(t, err)

if len(fds1) == len(fds0) {
if reflect.DeepEqual(fds0, fds1) {
return
}
// Show the extra opened files.
Expand All @@ -1729,14 +1735,18 @@ func testFdLeaks(t *testing.T, systemd bool) {
}

count := 0

procSelfFd, closer := utils.ProcThreadSelf("fd/")
defer closer()

next_fd:
for _, fd1 := range fds1 {
for _, fd0 := range fds0 {
if fd0 == fd1 {
continue next_fd
}
}
dst, _ := os.Readlink("/proc/self/fd/" + fd1)
dst, _ := os.Readlink(filepath.Join(procSelfFd, fd1))
for _, ex := range excludedPaths {
if ex == dst {
continue next_fd
Expand Down
22 changes: 17 additions & 5 deletions libcontainer/mount_linux.go
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ import (

"github.com/opencontainers/runc/libcontainer/configs"
"github.com/opencontainers/runc/libcontainer/userns"
"github.com/opencontainers/runc/libcontainer/utils"
)

// mountSourceType indicates what type of file descriptor is being returned. It
Expand All @@ -23,7 +24,7 @@ const (
// An open_tree(2)-style file descriptor that needs to be installed using
// move_mount(2) to install.
mountSourceOpenTree mountSourceType = "open_tree"
// A plain file descriptor that can be mounted through /proc/self/fd.
// A plain file descriptor that can be mounted through /proc/thread-self/fd.
mountSourcePlain mountSourceType = "plain-open"
)

Expand Down Expand Up @@ -90,7 +91,7 @@ func mount(source, target, fstype string, flags uintptr, data string) error {
// will mount it according to the mountSourceType of the file descriptor.
//
// The dstFd argument, if non-empty, is expected to be in the form of a path to
// an opened file descriptor on procfs (i.e. "/proc/self/fd/NN").
// an opened file descriptor on procfs (i.e. "/proc/thread-self/fd/NN").
//
// If a file descriptor is used instead of a source or a target path, the
// corresponding path is only used to add context to an error in case the mount
Expand All @@ -101,19 +102,30 @@ func mountViaFds(source string, srcFile *mountSource, target, dstFd, fstype stri
logrus.Debugf("mount source passed along with MS_REMOUNT -- ignoring srcFile")
srcFile = nil
}

dst := target
if dstFd != "" {
dst = dstFd
}
src := source
isMoveMount := srcFile != nil && srcFile.Type == mountSourceOpenTree
if srcFile != nil {
src = "/proc/self/fd/" + strconv.Itoa(int(srcFile.file.Fd()))
// If we're going to use the /proc/thread-self/... path for classic
// mount(2), we need to get a safe handle to /proc/thread-self. This
// isn't needed for move_mount(2) because in that case the path is just
// a dummy string used for error info.
fdStr := strconv.Itoa(int(srcFile.file.Fd()))
if isMoveMount {
src = "/proc/self/fd/" + fdStr
} else {
var closer utils.ProcThreadSelfCloser
src, closer = utils.ProcThreadSelf("fd/" + fdStr)
defer closer()
}
}

var op string
var err error
if srcFile != nil && srcFile.Type == mountSourceOpenTree {
if isMoveMount {
op = "move_mount"
err = unix.MoveMount(int(srcFile.file.Fd()), "",
unix.AT_FDCWD, dstFd,
Expand Down
Loading

0 comments on commit 8e8b136

Please sign in to comment.