Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unlimit UNIX remove_dir_all() implementation #93160

Closed
wants to merge 18 commits into from

Conversation

hkratz
Copy link
Contributor

@hkratz hkratz commented Jan 21, 2022

The current recursive implementation runs out of file descriptors traversing deep directory hierarchies. This implementation improves on it in multiple ways:

  • An on-heap stack of (name, inode) pairs is used instead of recursion to avoid stack overflows.
  • A small cache of (Readdir, raw_dirfd) pairs is used to optimize traversal.
  • If an element is shifted out of the cache, on the way back up it is reopened with openat(dirfd, "..", O_NOFOLLOW) and the inode is compared to the expected inode.

Open questions:

  • Inodes are being reused. Explore all possibilities, that the inode check going up could be subverted maliciously.
  • Is storing the device id with the inode in the on-heap stack necessary? See review comment. -> Implemented.
  • Is loop protection with a (device, inode) hashset necessary? No scenario known to me but we might want to err on the side of caution.

TODOs:

cc #93129

It is inspired by #88731 from @the8472 and uses an initial recursive -> iterative conversion from @cuviper.

Not ready for review yet, thus:
r? @ghost
@rustbot label S-waiting-on-author


struct DirComponent {
name: CString,
ino: u64,
Copy link
Contributor Author

@hkratz hkratz Jan 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also save the device id here as well, but an open directory cannot be moved across filesystems on all known implementations and doing so would imply an additional fstatat(..., AT_SYMLINK_NOFOLLOW) call before going down a directory. On modern UNIXes that call is not needed because we usually get the inode for free in DirEntry.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about mount --move? If that can be done while a descriptor is open then .. might point to a different parent. Haven't tested it though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what happens to open fds. If they were preserved and we are traversing into a mount point, that mount point is moved to another location, then going back up we would fail due to parent-of-mount-point inode mismatch. Even if the inode were the same by chance the next thing we do is trying to delete the mount point, which would fail with EBUSY. I don't think we have a problem there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mount point could have been moved to a different name and something else under the target directory would have the name of the old mountpoint. We would then try to continue to traverse the other directory. When combined with a sticky directory that might be exploitable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even without mount, someone could rename a directory while we're iterating within, and yes I suppose they could put a new directory with the same name that would affect how we resume. But as long as that's still underneath the root parent that we're trying to delete, it seems fine.

I'm not sure how being sticky affects this? The attacker can still only create things with their own permissions, and if they're permitted to create a new directory there, so be it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as the number of cached file descriptors is constant, the number of fs operations required is O(depth^2) for the adversarial case. With the implementation in this PR we can delete a million nested subdirs in ~16 seconds (just like rm -r, which does something similar according to strace).

Exponential spacing implies log(depth) + C ancestors being kept rather than a constant amount.

Using the device id + inode for comparison and hashing to avoid loops could be done as additional safeguards

Yes. I was thinking of another possibility involving FUSE doing inode recycling which could be used to defeat the .. check and make it ascend out of the starting point and wreak havoc.
Neither FUSE nor bind mounts can fake device ids, so that should make things much safer.

Copy link
Contributor Author

@hkratz hkratz Jan 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exponential spacing implies log(depth) + C ancestors being kept rather than a constant amount.

Oh I thought you just meant spacing out a constant number of cached parents differently. That makes more sense.

The inode reuse case could also be exploited within the same filesystem. But lots of things would have to line up for that to work IMHO:

  1. While remove_dir_all("/home/attacker/foo") is proceeding with elevated privileges in the depths, say in home/attacker/foo/bar/baz....
  2. The attacker moves /home/attacker/foo/bar to a temporary place, say /home/attacker/my_tmp/bar and deletes the now empty parent directory /home/attacker/foo freeing its inode.
  3. The inode is reused (very common on Linux at least with ext4) by a privileged process for a directory with the sticky bit set, e.g. for /newtmp.
  4. The attacker moves /home/attacker/my_tmp/bar to /newtmp/bar.
  5. remove_dir_all ascends into the sticky dir inode check succeeding and proceeds to delete other entries in it.

That is the best I have come up with so far.

Copy link
Member

@the8472 the8472 Jan 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting! Waiting until a different privileged process (not necessarily the one that does the recursive deleting) does something adds extra power.

But lots of things would have to line up for that to work IMHO

An attacker may be able to stall the privileged process indefinitely by using multiple threads or io_uring to keep filling the tree faster than it's being deleted. It can then use that time to arrange whatever constellation of inodes it needs. Perhaps it can even arrange a suid process to be spawned with a lower CPU/IO priority or even in a cgroup controlled by the attacker that'll allow it to be frozen. This way it could wait until an opportunity arises. Then it wouldn't even show up in the system load.

Copy link
Contributor Author

@hkratz hkratz Feb 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed the implementation so that whenever the cache of file descriptors is empty going up, the code fills it with ancestor directory fds and compares the dev/inode pairs. The chance that an attacker can arrange privileged process inode reuse for nested directories seems negligible. The slowdown is not bad as there is at most one additional openat(dir_fd, "..") and one additional fstat() syscall per traversed directory amortized.
Edit: There is still an obvious hole, will fix soon.

Copy link
Contributor Author

@hkratz hkratz Feb 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be ok now. When going up with openat(dir_fd, b"..")we compare the dev/inode and also open the grandparent checking its dev/inode. That way we can be reasonably sure to detect if the parent has been sneakily substituted and happens to have the same inode.

An attacker would need to get a priviliged process to trace to create a dir/subdir pair with subdir being sticky and dir/subdir having the expected inodes in order for this to be exploitable. And if that were possible it could only force the victim to delete stuff that has already been created in there. Can't be much since it is created in a race.

To be honest I am not even sure if that additional precaution is even necessary, because without that check we would have the same security as gnulib fts used in coreutils rm and the damage would still be contained due to it only working with directory inode reuse. That naturally limits the amount of damage to files created in the the time between the attacker deleting the directory and the superuser creating a new (sticky) dir with the same inode and filling it with sensitive content.

@rustbot rustbot added the S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. label Jan 21, 2022
@hkratz hkratz changed the title Optimize UNIX remove_dir_all() implementation Unlimit UNIX remove_dir_all() implementation Jan 21, 2022

let (child_dir_compoment, child_readdir) =
readdir_open_child(&current_readdir, &child)?;
parent_dir_components.push(current_dir_component);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do try_reserve here and return ENOMEM when it fails.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CString allocation for the directory name could fail as well, haven't check if that has something like try_reserve().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indirectly via from_vec_unchecked

@@ -1576,15 +1576,34 @@ mod remove_dir_impl {
#[cfg(not(any(all(target_os = "macos", target_arch = "x86_64"), target_os = "redox")))]
mod remove_dir_impl {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to move everything that uses -at syscalls to a separate module. fs.rs already is very bloated.
I called it dir_fd in #88731

@cuviper
Copy link
Member

cuviper commented Jan 21, 2022

At a high level, I wonder if it would be better to separate these concerns of stack and fd exhaustion? We can address the stack overflow relatively easily with an iterative heap change. It is also a real problem to run out of file descriptors, but it seems we have some trickier edge cases to worry about there, and at least you'll get a controlled EMFILE error back if we don't address it yet.

@hkratz
Copy link
Contributor Author

hkratz commented Jan 21, 2022

@cuviper Fine with me as that is a clear improvement on the version currently in master. AFAICS the main thing that is needed is to adapt 09467984547f43aeab14235083fba09954de5632 for Macos x86-64.

@cuviper
Copy link
Member

cuviper commented Jan 21, 2022

@hkratz are you willing to continue on that? I personally have no access to macOS, and only limited Windows.

Comment on lines 1733 to 1681
let parent_dir_fd = openat_nofollow_dironly(Some(dir.raw_fd), unsafe {
CStr::from_bytes_with_nul_unchecked(b"..\0")
})?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about the semantics, but opening the parent like this seems very risky...

A fool proof but slow solution is keeping a file descriptor for the first directory, and whenever you run out of file descriptors and can't go back to parent, rewind that first directory stream and go again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making this safe is what the long discussion above is about.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I tried reading through it, will have to try again.

Given that this function has once again had trouble with distrusted directories, I would advocate for the simplest solution (keep parent fd, rewind back and start from the beginning again) as a first step, even if it's absurdly slower...

Studying code from coreutils from multiple OS's might also be a good bet (instead of only the strace).

GNU Coreutils, for example, uses the FTS interface to go through the directories recursively.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GNU Coreutils, for example, uses the FTS interface to go through the directories recursively.

If explore this route, we would have to use FTS_NOCHDIR at the very least. We can't change directories for library use, while rm is free to do so.

Copy link
Contributor Author

@hkratz hkratz Jan 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that coreutils rm uses fts under the hood. But fts is not universal (e.g. musl does not have it), not standardized via POSIX and horribly broken on some platforms, e.g. on Macos where rm -r dir on a deep directory hierarchy results in rm: fts_read: File name too long errors while the implementation from this PR works fine. I haven't tested BSD. I think rolling our own on top of openat(), fdreaddir(), ... is the best choice. I have an implementation using iterated openat() from the top dir lying around which we can use if we deem going up via .. to dangerous. I also think looking into @the8472's suggestion regarding caching log(depth) file descriptors is a worthwhile improvement for that implementation.

Copy link
Contributor Author

@hkratz hkratz Jan 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking into this more coreutils uses gnulib fts in the proprietary FTS_CWDFD mode instead of the (g)libc fts. The check for parent dir dev/inode is here.

@rustbot rustbot added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Jan 31, 2022
@hkratz
Copy link
Contributor Author

hkratz commented Jan 31, 2022

@rustbot label -T-compiler +T-libs

@rustbot rustbot added T-libs Relevant to the library team, which will review and decide on the PR/issue. and removed T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Jan 31, 2022
@rustbot rustbot added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Jan 31, 2022
@bors
Copy link
Contributor

bors commented Mar 5, 2022

☔ The latest upstream changes (presumably #94634) made this pull request unmergeable. Please resolve the merge conflicts.

@the8472 the8472 removed the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Apr 2, 2022
@hkratz
Copy link
Contributor Author

hkratz commented Apr 11, 2022

Closing in favor of #95925.

@hkratz hkratz closed this Apr 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants