-
Notifications
You must be signed in to change notification settings - Fork 12.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unlimit UNIX remove_dir_all() implementation #93160
Conversation
|
||
struct DirComponent { | ||
name: CString, | ||
ino: u64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also save the device id here as well, but an open directory cannot be moved across filesystems on all known implementations and doing so would imply an additional fstatat(..., AT_SYMLINK_NOFOLLOW)
call before going down a directory. On modern UNIXes that call is not needed because we usually get the inode for free in DirEntry
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about mount --move
? If that can be done while a descriptor is open then ..
might point to a different parent. Haven't tested it though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what happens to open fds. If they were preserved and we are traversing into a mount point, that mount point is moved to another location, then going back up we would fail due to parent-of-mount-point inode mismatch. Even if the inode were the same by chance the next thing we do is trying to delete the mount point, which would fail with EBUSY
. I don't think we have a problem there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mount point could have been moved to a different name and something else under the target directory would have the name of the old mountpoint. We would then try to continue to traverse the other directory. When combined with a sticky directory that might be exploitable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even without mount
, someone could rename a directory while we're iterating within, and yes I suppose they could put a new directory with the same name that would affect how we resume. But as long as that's still underneath the root parent that we're trying to delete, it seems fine.
I'm not sure how being sticky affects this? The attacker can still only create things with their own permissions, and if they're permitted to create a new directory there, so be it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as the number of cached file descriptors is constant, the number of fs operations required is O(depth^2) for the adversarial case. With the implementation in this PR we can delete a million nested subdirs in ~16 seconds (just like rm -r, which does something similar according to strace).
Exponential spacing implies log(depth) + C ancestors being kept rather than a constant amount.
Using the device id + inode for comparison and hashing to avoid loops could be done as additional safeguards
Yes. I was thinking of another possibility involving FUSE doing inode recycling which could be used to defeat the ..
check and make it ascend out of the starting point and wreak havoc.
Neither FUSE nor bind mounts can fake device ids, so that should make things much safer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exponential spacing implies log(depth) + C ancestors being kept rather than a constant amount.
Oh I thought you just meant spacing out a constant number of cached parents differently. That makes more sense.
The inode reuse case could also be exploited within the same filesystem. But lots of things would have to line up for that to work IMHO:
- While
remove_dir_all("/home/attacker/foo")
is proceeding with elevated privileges in the depths, say inhome/attacker/foo/bar/baz
.... - The attacker moves
/home/attacker/foo/bar
to a temporary place, say/home/attacker/my_tmp/bar
and deletes the now empty parent directory/home/attacker/foo
freeing its inode. - The inode is reused (very common on Linux at least with ext4) by a privileged process for a directory with the sticky bit set, e.g. for
/newtmp
. - The attacker moves
/home/attacker/my_tmp/bar
to/newtmp/bar
. remove_dir_all
ascends into the sticky dir inode check succeeding and proceeds to delete other entries in it.
That is the best I have come up with so far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting! Waiting until a different privileged process (not necessarily the one that does the recursive deleting) does something adds extra power.
But lots of things would have to line up for that to work IMHO
An attacker may be able to stall the privileged process indefinitely by using multiple threads or io_uring to keep filling the tree faster than it's being deleted. It can then use that time to arrange whatever constellation of inodes it needs. Perhaps it can even arrange a suid process to be spawned with a lower CPU/IO priority or even in a cgroup controlled by the attacker that'll allow it to be frozen. This way it could wait until an opportunity arises. Then it wouldn't even show up in the system load.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have changed the implementation so that whenever the cache of file descriptors is empty going up, the code fills it with ancestor directory fds and compares the dev/inode pairs. The chance that an attacker can arrange privileged process inode reuse for nested directories seems negligible. The slowdown is not bad as there is at most one additional openat(dir_fd, "..")
and one additional fstat()
syscall per traversed directory amortized.
Edit: There is still an obvious hole, will fix soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be ok now. When going up with openat(dir_fd, b"..")
we compare the dev/inode and also open the grandparent checking its dev/inode. That way we can be reasonably sure to detect if the parent has been sneakily substituted and happens to have the same inode.
An attacker would need to get a priviliged process to trace to create a dir/subdir pair with subdir being sticky and dir/subdir having the expected inodes in order for this to be exploitable. And if that were possible it could only force the victim to delete stuff that has already been created in there. Can't be much since it is created in a race.
To be honest I am not even sure if that additional precaution is even necessary, because without that check we would have the same security as gnulib fts used in coreutils rm and the damage would still be contained due to it only working with directory inode reuse. That naturally limits the amount of damage to files created in the the time between the attacker deleting the directory and the superuser creating a new (sticky) dir with the same inode and filling it with sensitive content.
|
||
let (child_dir_compoment, child_readdir) = | ||
readdir_open_child(¤t_readdir, &child)?; | ||
parent_dir_components.push(current_dir_component); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could do try_reserve
here and return ENOMEM
when it fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CString allocation for the directory name could fail as well, haven't check if that has something like try_reserve()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indirectly via from_vec_unchecked
library/std/src/sys/unix/fs.rs
Outdated
@@ -1576,15 +1576,34 @@ mod remove_dir_impl { | |||
#[cfg(not(any(all(target_os = "macos", target_arch = "x86_64"), target_os = "redox")))] | |||
mod remove_dir_impl { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to move everything that uses -at
syscalls to a separate module. fs.rs
already is very bloated.
I called it dir_fd
in #88731
At a high level, I wonder if it would be better to separate these concerns of stack and fd exhaustion? We can address the stack overflow relatively easily with an iterative heap change. It is also a real problem to run out of file descriptors, but it seems we have some trickier edge cases to worry about there, and at least you'll get a controlled |
@cuviper Fine with me as that is a clear improvement on the version currently in master. AFAICS the main thing that is needed is to adapt 09467984547f43aeab14235083fba09954de5632 for Macos x86-64. |
@hkratz are you willing to continue on that? I personally have no access to macOS, and only limited Windows. |
library/std/src/sys/unix/fs.rs
Outdated
let parent_dir_fd = openat_nofollow_dironly(Some(dir.raw_fd), unsafe { | ||
CStr::from_bytes_with_nul_unchecked(b"..\0") | ||
})?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about the semantics, but opening the parent like this seems very risky...
A fool proof but slow solution is keeping a file descriptor for the first directory, and whenever you run out of file descriptors and can't go back to parent, rewind that first directory stream and go again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making this safe is what the long discussion above is about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I tried reading through it, will have to try again.
Given that this function has once again had trouble with distrusted directories, I would advocate for the simplest solution (keep parent fd, rewind back and start from the beginning again) as a first step, even if it's absurdly slower...
Studying code from coreutils from multiple OS's might also be a good bet (instead of only the strace).
GNU Coreutils, for example, uses the FTS interface to go through the directories recursively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GNU Coreutils, for example, uses the FTS interface to go through the directories recursively.
If explore this route, we would have to use FTS_NOCHDIR
at the very least. We can't change directories for library use, while rm
is free to do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that coreutils rm uses fts under the hood. But fts is not universal (e.g. musl does not have it), not standardized via POSIX and horribly broken on some platforms, e.g. on Macos where rm -r dir
on a deep directory hierarchy results in rm: fts_read: File name too long
errors while the implementation from this PR works fine. I haven't tested BSD. I think rolling our own on top of openat()
, fdreaddir()
, ... is the best choice. I have an implementation using iterated openat()
from the top dir lying around which we can use if we deem going up via ..
to dangerous. I also think looking into @the8472's suggestion regarding caching log(depth)
file descriptors is a worthwhile improvement for that implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking into this more coreutils uses gnulib fts in the proprietary FTS_CWDFD
mode instead of the (g)libc fts. The check for parent dir dev/inode is here.
c28b262
to
65b740c
Compare
@rustbot label -T-compiler +T-libs |
☔ The latest upstream changes (presumably #94634) made this pull request unmergeable. Please resolve the merge conflicts. |
Closing in favor of #95925. |
The current recursive implementation runs out of file descriptors traversing deep directory hierarchies. This implementation improves on it in multiple ways:
Readdir
, raw_dirfd) pairs is used to optimize traversal.openat(dirfd, "..", O_NOFOLLOW)
and the inode is compared to the expected inode.Open questions:
TODOs:
Usedirfd()
Use-> Impossible becauseO_PATH
on Linux-ishreaddir()
fails laterremove_dir_all()
: Try recursing first on the slow path #94446 lands.cc #93129
It is inspired by #88731 from @the8472 and uses an initial recursive -> iterative conversion from @cuviper.
Not ready for review yet, thus:
r? @ghost
@rustbot label S-waiting-on-author