Unlimit UNIX remove_dir_all() implementation #93160

hkratz · 2022-01-21T14:22:17Z

The current recursive implementation runs out of file descriptors traversing deep directory hierarchies. This implementation improves on it in multiple ways:

An on-heap stack of (name, inode) pairs is used instead of recursion to avoid stack overflows.
A small cache of (Readdir, raw_dirfd) pairs is used to optimize traversal.
If an element is shifted out of the cache, on the way back up it is reopened with openat(dirfd, "..", O_NOFOLLOW) and the inode is compared to the expected inode.

Open questions:

Inodes are being reused. Explore all possibilities, that the inode check going up could be subverted maliciously.
Is storing the device id with the inode in the on-heap stack necessary? See review comment. -> Implemented.
Is loop protection with a (device, inode) hashset necessary? No scenario known to me but we might want to err on the side of caution.

TODOs:

~~Use dirfd()~~
~~Use O_PATH on Linux-ish~~ -> Impossible because readdir() fails later
Adapt Macos x86-64 implementation
Add test for removing deep dir hierarchies
Will need to be rebased once UNIX remove_dir_all(): Try recursing first on the slow path #94446 lands.

cc #93129

It is inspired by #88731 from @the8472 and uses an initial recursive -> iterative conversion from @cuviper.

Not ready for review yet, thus:
r? @ghost
@rustbot label S-waiting-on-author

hkratz · 2022-01-21T14:27:11Z

library/std/src/sys/unix/fs.rs

+
+    struct DirComponent {
+        name: CString,
+        ino: u64,


We could also save the device id here as well, but an open directory cannot be moved across filesystems on all known implementations and doing so would imply an additional fstatat(..., AT_SYMLINK_NOFOLLOW) call before going down a directory. On modern UNIXes that call is not needed because we usually get the inode for free in DirEntry.

What about mount --move? If that can be done while a descriptor is open then .. might point to a different parent. Haven't tested it though

Not sure what happens to open fds. If they were preserved and we are traversing into a mount point, that mount point is moved to another location, then going back up we would fail due to parent-of-mount-point inode mismatch. Even if the inode were the same by chance the next thing we do is trying to delete the mount point, which would fail with EBUSY. I don't think we have a problem there.

The mount point could have been moved to a different name and something else under the target directory would have the name of the old mountpoint. We would then try to continue to traverse the other directory. When combined with a sticky directory that might be exploitable?

Even without mount, someone could rename a directory while we're iterating within, and yes I suppose they could put a new directory with the same name that would affect how we resume. But as long as that's still underneath the root parent that we're trying to delete, it seems fine.

I'm not sure how being sticky affects this? The attacker can still only create things with their own permissions, and if they're permitted to create a new directory there, so be it.

As long as the number of cached file descriptors is constant, the number of fs operations required is O(depth^2) for the adversarial case. With the implementation in this PR we can delete a million nested subdirs in ~16 seconds (just like rm -r, which does something similar according to strace).

Exponential spacing implies log(depth) + C ancestors being kept rather than a constant amount.

Using the device id + inode for comparison and hashing to avoid loops could be done as additional safeguards

Yes. I was thinking of another possibility involving FUSE doing inode recycling which could be used to defeat the .. check and make it ascend out of the starting point and wreak havoc.
Neither FUSE nor bind mounts can fake device ids, so that should make things much safer.

Exponential spacing implies log(depth) + C ancestors being kept rather than a constant amount.

Oh I thought you just meant spacing out a constant number of cached parents differently. That makes more sense.

The inode reuse case could also be exploited within the same filesystem. But lots of things would have to line up for that to work IMHO:

While remove_dir_all("/home/attacker/foo") is proceeding with elevated privileges in the depths, say in home/attacker/foo/bar/baz....

The attacker moves /home/attacker/foo/bar to a temporary place, say /home/attacker/my_tmp/bar and deletes the now empty parent directory /home/attacker/foo freeing its inode.

The inode is reused (very common on Linux at least with ext4) by a privileged process for a directory with the sticky bit set, e.g. for /newtmp.

The attacker moves /home/attacker/my_tmp/bar to /newtmp/bar.

remove_dir_all ascends into the sticky dir inode check succeeding and proceeds to delete other entries in it.

That is the best I have come up with so far.

Interesting! Waiting until a different privileged process (not necessarily the one that does the recursive deleting) does something adds extra power.

But lots of things would have to line up for that to work IMHO

An attacker may be able to stall the privileged process indefinitely by using multiple threads or io_uring to keep filling the tree faster than it's being deleted. It can then use that time to arrange whatever constellation of inodes it needs. Perhaps it can even arrange a suid process to be spawned with a lower CPU/IO priority or even in a cgroup controlled by the attacker that'll allow it to be frozen. This way it could wait until an opportunity arises. Then it wouldn't even show up in the system load.

I have changed the implementation so that whenever the cache of file descriptors is empty going up, the code fills it with ancestor directory fds and compares the dev/inode pairs. The chance that an attacker can arrange privileged process inode reuse for nested directories seems negligible. The slowdown is not bad as there is at most one additional openat(dir_fd, "..") and one additional fstat() syscall per traversed directory amortized.
Edit: There is still an obvious hole, will fix soon.

This should be ok now. When going up with openat(dir_fd, b"..")we compare the dev/inode and also open the grandparent checking its dev/inode. That way we can be reasonably sure to detect if the parent has been sneakily substituted and happens to have the same inode.

An attacker would need to get a priviliged process to trace to create a dir/subdir pair with subdir being sticky and dir/subdir having the expected inodes in order for this to be exploitable. And if that were possible it could only force the victim to delete stuff that has already been created in there. Can't be much since it is created in a race.

To be honest I am not even sure if that additional precaution is even necessary, because without that check we would have the same security as gnulib fts used in coreutils rm and the damage would still be contained due to it only working with directory inode reuse. That naturally limits the amount of damage to files created in the the time between the attacker deleting the directory and the superuser creating a new (sticky) dir with the same inode and filling it with sensitive content.

library/std/src/sys/unix/fs.rs

the8472 · 2022-01-21T17:33:41Z

library/std/src/sys/unix/fs.rs

+
+                    let (child_dir_compoment, child_readdir) =
+                        readdir_open_child(&current_readdir, &child)?;
+                    parent_dir_components.push(current_dir_component);


We could do try_reserve here and return ENOMEM when it fails.

The CString allocation for the directory name could fail as well, haven't check if that has something like try_reserve().

Indirectly via from_vec_unchecked

the8472 · 2022-01-21T17:41:58Z

library/std/src/sys/unix/fs.rs

@@ -1576,15 +1576,34 @@ mod remove_dir_impl {
 #[cfg(not(any(all(target_os = "macos", target_arch = "x86_64"), target_os = "redox")))]
 mod remove_dir_impl {


I think it makes sense to move everything that uses -at syscalls to a separate module. fs.rs already is very bloated.
I called it dir_fd in #88731

cuviper · 2022-01-21T19:03:10Z

At a high level, I wonder if it would be better to separate these concerns of stack and fd exhaustion? We can address the stack overflow relatively easily with an iterative heap change. It is also a real problem to run out of file descriptors, but it seems we have some trickier edge cases to worry about there, and at least you'll get a controlled EMFILE error back if we don't address it yet.

hkratz · 2022-01-21T19:45:28Z

@cuviper Fine with me as that is a clear improvement on the version currently in master. AFAICS the main thing that is needed is to adapt 09467984547f43aeab14235083fba09954de5632 for Macos x86-64.

cuviper · 2022-01-21T19:50:54Z

@hkratz are you willing to continue on that? I personally have no access to macOS, and only limited Windows.

ericonr · 2022-01-22T06:39:01Z

library/std/src/sys/unix/fs.rs

+        let parent_dir_fd = openat_nofollow_dironly(Some(dir.raw_fd), unsafe {
+            CStr::from_bytes_with_nul_unchecked(b"..\0")
        })?;


I'm not sure about the semantics, but opening the parent like this seems very risky...

A fool proof but slow solution is keeping a file descriptor for the first directory, and whenever you run out of file descriptors and can't go back to parent, rewind that first directory stream and go again.

Making this safe is what the long discussion above is about.

Ok, I tried reading through it, will have to try again.

Given that this function has once again had trouble with distrusted directories, I would advocate for the simplest solution (keep parent fd, rewind back and start from the beginning again) as a first step, even if it's absurdly slower...

Studying code from coreutils from multiple OS's might also be a good bet (instead of only the strace).

GNU Coreutils, for example, uses the FTS interface to go through the directories recursively.

GNU Coreutils, for example, uses the FTS interface to go through the directories recursively.

If explore this route, we would have to use FTS_NOCHDIR at the very least. We can't change directories for library use, while rm is free to do so.

I know that coreutils rm uses fts under the hood. But fts is not universal (e.g. musl does not have it), not standardized via POSIX and horribly broken on some platforms, e.g. on Macos where rm -r dir on a deep directory hierarchy results in rm: fts_read: File name too long errors while the implementation from this PR works fine. I haven't tested BSD. I think rolling our own on top of openat(), fdreaddir(), ... is the best choice. I have an implementation using iterated openat() from the top dir lying around which we can use if we deem going up via .. to dangerous. I also think looking into @the8472's suggestion regarding caching log(depth) file descriptors is a worthwhile improvement for that implementation.

Looking into this more coreutils uses gnulib fts in the proprietary FTS_CWDFD mode instead of the (g)libc fts. The check for parent dir dev/inode is here.

…o up.

…tion of child.

hkratz · 2022-01-31T08:59:13Z

@rustbot label -T-compiler +T-libs

…ally

src/test/ui/stdlib-unit-tests/remove-dir-all-deep.rs

bors · 2022-03-05T07:50:22Z

☔ The latest upstream changes (presumably #94634) made this pull request unmergeable. Please resolve the merge conflicts.

hkratz · 2022-04-11T07:24:57Z

Closing in favor of #95925.

hkratz commented Jan 21, 2022

View reviewed changes

library/std/src/sys/unix/fs.rs Outdated Show resolved Hide resolved

rustbot added the S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. label Jan 21, 2022

hkratz changed the title ~~Optimize UNIX remove_dir_all() implementation~~ Unlimit UNIX remove_dir_all() implementation Jan 21, 2022

the8472 reviewed Jan 21, 2022

View reviewed changes

ericonr reviewed Jan 22, 2022

View reviewed changes

cuviper and others added 3 commits January 30, 2022 06:57

WIP: looping version of unix remove_dir_all

61bd4e9

Temporarily remove special-cased Macos x86-64 implementation.

7cf62c7

Readd Macos x86-64 remove_dir_all() special case using dynamic symbols.

d2cfba1

hkratz mentioned this pull request Jan 30, 2022

Convert UNIX std::fs::remove_dir_all() from recursive to looping #93473

Closed

hkratz added 3 commits January 30, 2022 09:47

Use limited Readdir cache and use openat(dirf, "..", O_NOFOLLOW) to g…

93e708b

…o up.

replace RawFd with BorrowedFd

424189d

Store and compare with stat.st_dev as well when going back up via ..

65b740c

hkratz force-pushed the unix_remove_dir_all_opt branch from c28b262 to 65b740c Compare January 30, 2022 12:07

hkratz added 2 commits January 30, 2022 23:09

fix for some UNIXes: parent directory needs to be reopened after dele…

12b456e

…tion of child.

Add ui test for calling remove_dir_all() on deep directory hierarchies.

4641c56

rustbot added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Jan 31, 2022

rustbot added T-libs Relevant to the library team, which will review and decide on the PR/issue. and removed T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Jan 31, 2022

test: safe depths for all platforms.

752330f

rustbot added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Jan 31, 2022

hkratz added 4 commits February 2, 2022 09:20

WIP: lazy read dir datatype

bc9f33c

refill complete cache if empty when going up

a6953a5

cosmetics

a6b94a4

CI filesystem is apparently slow, 64K deep hierarchy takes <2sec. loc…

20542dc

…ally

hkratz added 5 commits February 2, 2022 12:54

correct limit for readdir cache

f591d57

remove debug output (oops)

266aaaf

better inode reuse protection

f0bc8a6

expand test to include an empty dir/file in each directory

aec1f45

include parent of deletion root dir in reused inode check

dafa41f

the8472 reviewed Feb 13, 2022

View reviewed changes

src/test/ui/stdlib-unit-tests/remove-dir-all-deep.rs Show resolved Hide resolved

the8472 mentioned this pull request Feb 28, 2022

use openat when encountering ENAMETOOLONG #88731

Closed

the8472 removed the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Apr 2, 2022

hkratz mentioned this pull request Apr 11, 2022

Unlimit UNIX remove_dir_all() implementation (take 2) #95925

Closed

hkratz closed this Apr 11, 2022

		@@ -1576,15 +1576,34 @@ mod remove_dir_impl {
		#[cfg(not(any(all(target_os = "macos", target_arch = "x86_64"), target_os = "redox")))]
		mod remove_dir_impl {

Unlimit UNIX remove_dir_all() implementation #93160

Unlimit UNIX remove_dir_all() implementation #93160

Uh oh!

Conversation

hkratz commented Jan 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hkratz Jan 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hkratz Jan 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

the8472 Jan 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hkratz Feb 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hkratz Feb 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cuviper commented Jan 21, 2022

Uh oh!

hkratz commented Jan 21, 2022

Uh oh!

cuviper commented Jan 21, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hkratz Jan 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hkratz Jan 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hkratz commented Jan 31, 2022

Uh oh!

Uh oh!

bors commented Mar 5, 2022

Uh oh!

hkratz commented Apr 11, 2022

Uh oh!

Uh oh!

hkratz commented Jan 21, 2022 •

edited

Loading

hkratz Jan 21, 2022 •

edited

Loading

hkratz Jan 23, 2022 •

edited

Loading

the8472 Jan 23, 2022 •

edited

Loading

hkratz Feb 2, 2022 •

edited

Loading

hkratz Feb 13, 2022 •

edited

Loading

hkratz Jan 23, 2022 •

edited

Loading

hkratz Jan 25, 2022 •

edited

Loading