Skip to content

Commit 2a5cdaf

Browse files
committed
open: add close_range()
This adds the close_range() syscall. It allows to efficiently close a range of file descriptors up to all file descriptors of a calling task. The syscall came up in a recent discussion around the new mount API and making new file descriptor types cloexec by default. During this discussion, Al suggested the close_range() syscall (cf. [1]). Note, a syscall in this manner has been requested by various people over time. First, it helps to close all file descriptors of an exec()ing task. This can be done safely via (quoting Al's example from [1] verbatim): /* that exec is sensitive */ unshare(CLONE_FILES); /* we don't want anything past stderr here */ close_range(3, ~0U); execve(....); The code snippet above is one way of working around the problem that file descriptors are not cloexec by default. This is aggravated by the fact that we can't just switch them over without massively regressing userspace. For a whole class of programs having an in-kernel method of closing all file descriptors is very helpful (e.g. demons, service managers, programming language standard libraries, container managers etc.). (Please note, unshare(CLONE_FILES) should only be needed if the calling task is multi-threaded and shares the file descriptor table with another thread in which case two threads could race with one thread allocating file descriptors and the other one closing them via close_range(). For the general case close_range() before the execve() is sufficient.) Second, it allows userspace to avoid implementing closing all file descriptors by parsing through /proc/<pid>/fd/* and calling close() on each file descriptor. From looking at various large(ish) userspace code bases this or similar patterns are very common in: - service managers (cf. [4]) - libcs (cf. [6]) - container runtimes (cf. [5]) - programming language runtimes/standard libraries - Python (cf. [2]) - Rust (cf. [7], [8]) As Dmitry pointed out there's even a long-standing glibc bug about missing kernel support for this task (cf. [3]). In addition, the syscall will also work for tasks that do not have procfs mounted and on kernels that do not have procfs support compiled in. In such situations the only way to make sure that all file descriptors are closed is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE, OPEN_MAX trickery (cf. comment [8] on Rust). The performance is striking. For good measure, comparing the following simple close_all_fds() userspace implementation that is essentially just glibc's version in [6]: static int close_all_fds(void) { int dir_fd; DIR *dir; struct dirent *direntp; dir = opendir("/proc/self/fd"); if (!dir) return -1; dir_fd = dirfd(dir); while ((direntp = readdir(dir))) { int fd; if (strcmp(direntp->d_name, ".") == 0) continue; if (strcmp(direntp->d_name, "..") == 0) continue; fd = atoi(direntp->d_name); if (fd == dir_fd || fd == 0 || fd == 1 || fd == 2) continue; close(fd); } closedir(dir); return 0; } to close_range() yields: 1. closing 4 open files: - close_all_fds(): ~280 us - close_range(): ~24 us 2. closing 1000 open files: - close_all_fds(): ~5000 us - close_range(): ~800 us close_range() is designed to allow for some flexibility. Specifically, it does not simply always close all open file descriptors of a task. Instead, callers can specify an upper bound. This is e.g. useful for scenarios where specific file descriptors are created with well-known numbers that are supposed to be excluded from getting closed. For extra paranoia close_range() comes with a flags argument. This can e.g. be used to implement extension. Once can imagine userspace wanting to stop at the first error instead of ignoring errors under certain circumstances. There might be other valid ideas in the future. In any case, a flag argument doesn't hurt and keeps us on the safe side. From an implementation side this is kept rather dumb. It saw some input from David and Jann but all nonsense is obviously my own! - Errors to close file descriptors are currently ignored. (Could be changed by setting a flag in the future if needed.) - __close_range() is a rather simplistic wrapper around __close_fd(). My reasoning behind this is based on the nature of how __close_fd() needs to release an fd. But maybe I misunderstood specifics: We take the files_lock and rcu-dereference the fdtable of the calling task, we find the entry in the fdtable, get the file and need to release files_lock before calling filp_close(). In the meantime the fdtable might have been altered so we can't just retake the spinlock and keep the old rcu-reference of the fdtable around. Instead we need to grab a fresh reference to the fdtable. If my reasoning is correct then there's really no point in fancyfying __close_range(): We just need to rcu-dereference the fdtable of the calling task once to cap the max_fd value correctly and then go on calling __close_fd() in a loop. /* References */ [1]: https://lore.kernel.org/lkml/20190516165021.GD17978@ZenIV.linux.org.uk/ [2]: https://github.com/python/cpython/blob/9e4f2f3a6b8ee995c365e86d976937c141d867f8/Modules/_posixsubprocess.c#L220 [3]: https://sourceware.org/bugzilla/show_bug.cgi?id=10353#c7 [4]: https://github.com/systemd/systemd/blob/5238e9575906297608ff802a27e2ff9effa3b338/src/basic/fd-util.c#L217 [5]: https://github.com/lxc/lxc/blob/ddf4b77e11a4d08f09b7b9cd13e593f8c047edc5/src/lxc/start.c#L236 [6]: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/grantpt.c;h=2030e07fa6e652aac32c775b8c6e005844c3c4eb;hb=HEAD#l17 Note that this is an internal implementation that is not exported. Currently, libc seems to not provide an exported version of this because of missing kernel support to do this. [7]: rust-lang/rust#12148 [8]: https://github.com/rust-lang/rust/blob/5f47c0613ed4eb46fca3633c1297364c09e5e451/src/libstd/sys/unix/process2.rs#L303-L308 Rust's solution is slightly different but is equally unperformant. Rust calls getdtablesize() which is a glibc library function that simply returns the current RLIMIT_NOFILE or OPEN_MAX values. Rust then goes on to call close() on each fd. That's obviously overkill for most tasks. Rarely, tasks - especially non-demons - hit RLIMIT_NOFILE or OPEN_MAX. Let's be nice and assume an unprivileged user with RLIMIT_NOFILE set to 1024. Even in this case, there's a very high chance that in the common case Rust is calling the close() syscall 1021 times pointlessly if the task just has 0, 1, and 2 open. Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Christian Brauner <christian@brauner.io> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Jann Horn <jannh@google.com> Cc: David Howells <dhowells@redhat.com> Cc: Dmitry V. Levin <ldv@altlinux.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Florian Weimer <fweimer@redhat.com> Cc: linux-api@vger.kernel.org --- v1: - Linus Torvalds <torvalds@linux-foundation.org>: - add cond_resched() to yield cpu when closing a lot of file descriptors - Al Viro <viro@zeniv.linux.org.uk>: - add cond_resched() to yield cpu when closing a lot of file descriptors v2: - Oleg Nesterov <oleg@redhat.com>: - make use of already existing helpers that allow to better implement close_range()
1 parent a188339 commit 2a5cdaf

File tree

22 files changed

+114
-1
lines changed

22 files changed

+114
-1
lines changed

arch/alpha/kernel/syscalls/syscall.tbl

+1
Original file line numberDiff line numberDiff line change
@@ -473,3 +473,4 @@
473473
541 common fsconfig sys_fsconfig
474474
542 common fsmount sys_fsmount
475475
543 common fspick sys_fspick
476+
545 common close_range sys_close_range

arch/arm/tools/syscall.tbl

+1
Original file line numberDiff line numberDiff line change
@@ -447,3 +447,4 @@
447447
431 common fsconfig sys_fsconfig
448448
432 common fsmount sys_fsmount
449449
433 common fspick sys_fspick
450+
435 common close_range sys_close_range

arch/arm64/include/asm/unistd32.h

+2
Original file line numberDiff line numberDiff line change
@@ -886,6 +886,8 @@ __SYSCALL(__NR_fsconfig, sys_fsconfig)
886886
__SYSCALL(__NR_fsmount, sys_fsmount)
887887
#define __NR_fspick 433
888888
__SYSCALL(__NR_fspick, sys_fspick)
889+
#define __NR_close_range 435
890+
__SYSCALL(__NR_close_range, sys_close_range)
889891

890892
/*
891893
* Please add new compat syscalls above this comment and update

arch/ia64/kernel/syscalls/syscall.tbl

+1
Original file line numberDiff line numberDiff line change
@@ -354,3 +354,4 @@
354354
431 common fsconfig sys_fsconfig
355355
432 common fsmount sys_fsmount
356356
433 common fspick sys_fspick
357+
435 common close_range sys_close_range

arch/m68k/kernel/syscalls/syscall.tbl

+1
Original file line numberDiff line numberDiff line change
@@ -433,3 +433,4 @@
433433
431 common fsconfig sys_fsconfig
434434
432 common fsmount sys_fsmount
435435
433 common fspick sys_fspick
436+
435 common close_range sys_close_range

arch/microblaze/kernel/syscalls/syscall.tbl

+1
Original file line numberDiff line numberDiff line change
@@ -439,3 +439,4 @@
439439
431 common fsconfig sys_fsconfig
440440
432 common fsmount sys_fsmount
441441
433 common fspick sys_fspick
442+
435 common close_range sys_close_range

arch/mips/kernel/syscalls/syscall_n32.tbl

+1
Original file line numberDiff line numberDiff line change
@@ -372,3 +372,4 @@
372372
431 n32 fsconfig sys_fsconfig
373373
432 n32 fsmount sys_fsmount
374374
433 n32 fspick sys_fspick
375+
435 n32 close_range sys_close_range

arch/mips/kernel/syscalls/syscall_n64.tbl

+1
Original file line numberDiff line numberDiff line change
@@ -348,3 +348,4 @@
348348
431 n64 fsconfig sys_fsconfig
349349
432 n64 fsmount sys_fsmount
350350
433 n64 fspick sys_fspick
351+
435 n64 close_range sys_close_range

arch/mips/kernel/syscalls/syscall_o32.tbl

+1
Original file line numberDiff line numberDiff line change
@@ -421,3 +421,4 @@
421421
431 o32 fsconfig sys_fsconfig
422422
432 o32 fsmount sys_fsmount
423423
433 o32 fspick sys_fspick
424+
435 o32 close_range sys_close_range

arch/parisc/kernel/syscalls/syscall.tbl

+1
Original file line numberDiff line numberDiff line change
@@ -430,3 +430,4 @@
430430
431 common fsconfig sys_fsconfig
431431
432 common fsmount sys_fsmount
432432
433 common fspick sys_fspick
433+
435 common close_range sys_close_range

arch/powerpc/kernel/syscalls/syscall.tbl

+1
Original file line numberDiff line numberDiff line change
@@ -515,3 +515,4 @@
515515
431 common fsconfig sys_fsconfig
516516
432 common fsmount sys_fsmount
517517
433 common fspick sys_fspick
518+
435 common close_range sys_close_range

arch/s390/kernel/syscalls/syscall.tbl

+1
Original file line numberDiff line numberDiff line change
@@ -436,3 +436,4 @@
436436
431 common fsconfig sys_fsconfig sys_fsconfig
437437
432 common fsmount sys_fsmount sys_fsmount
438438
433 common fspick sys_fspick sys_fspick
439+
435 common close_range sys_close_range sys_close_range

arch/sh/kernel/syscalls/syscall.tbl

+1
Original file line numberDiff line numberDiff line change
@@ -436,3 +436,4 @@
436436
431 common fsconfig sys_fsconfig
437437
432 common fsmount sys_fsmount
438438
433 common fspick sys_fspick
439+
435 common close_range sys_close_range

arch/sparc/kernel/syscalls/syscall.tbl

+1
Original file line numberDiff line numberDiff line change
@@ -479,3 +479,4 @@
479479
431 common fsconfig sys_fsconfig
480480
432 common fsmount sys_fsmount
481481
433 common fspick sys_fspick
482+
435 common close_range sys_close_range

arch/x86/entry/syscalls/syscall_32.tbl

+1
Original file line numberDiff line numberDiff line change
@@ -438,3 +438,4 @@
438438
431 i386 fsconfig sys_fsconfig __ia32_sys_fsconfig
439439
432 i386 fsmount sys_fsmount __ia32_sys_fsmount
440440
433 i386 fspick sys_fspick __ia32_sys_fspick
441+
435 i386 close_range sys_close_range __ia32_sys_close_range

arch/x86/entry/syscalls/syscall_64.tbl

+1
Original file line numberDiff line numberDiff line change
@@ -355,6 +355,7 @@
355355
431 common fsconfig __x64_sys_fsconfig
356356
432 common fsmount __x64_sys_fsmount
357357
433 common fspick __x64_sys_fspick
358+
435 common close_range __x64_sys_close_range
358359

359360
#
360361
# x32-specific system call numbers start at 512 to avoid cache impact

arch/xtensa/kernel/syscalls/syscall.tbl

+1
Original file line numberDiff line numberDiff line change
@@ -404,3 +404,4 @@
404404
431 common fsconfig sys_fsconfig
405405
432 common fsmount sys_fsmount
406406
433 common fspick sys_fspick
407+
435 common close_range sys_close_range

fs/file.c

+69
Original file line numberDiff line numberDiff line change
@@ -641,6 +641,75 @@ int __close_fd(struct files_struct *files, unsigned fd)
641641
}
642642
EXPORT_SYMBOL(__close_fd); /* for ksys_close() */
643643

644+
/**
645+
* __close_next_open_fd() - Close the nearest open fd.
646+
*
647+
* @curfd: lowest file descriptor to consider
648+
* @maxfd: highest file descriptor to consider
649+
*
650+
* This function will close the nearest open fd, i.e. it will either
651+
* close @curfd if it is open or the closest open file descriptor
652+
* greater than @curfd that is smaller or equal to maxfd.
653+
* If the function found a file descriptor to close it will return 0 and
654+
* place the file descriptor it closed in @curfd. If it did not find a
655+
* file descriptor to close it will return -EBADF.
656+
*/
657+
static int __close_next_open_fd(struct files_struct *files, unsigned *curfd,
658+
unsigned maxfd)
659+
{
660+
struct file *file = NULL;
661+
unsigned fd;
662+
struct fdtable *fdt;
663+
664+
spin_lock(&files->file_lock);
665+
fdt = files_fdtable(files);
666+
fd = find_next_fd(fdt, *curfd);
667+
if (fd >= fdt->max_fds || fd > maxfd)
668+
goto out_unlock;
669+
670+
file = fdt->fd[fd];
671+
rcu_assign_pointer(fdt->fd[fd], NULL);
672+
__put_unused_fd(files, fd);
673+
674+
out_unlock:
675+
spin_unlock(&files->file_lock);
676+
677+
if (!file)
678+
return -EBADF;
679+
680+
*curfd = fd;
681+
filp_close(file, files);
682+
return 0;
683+
}
684+
685+
/**
686+
* __close_range() - Close all file descriptors in a given range.
687+
*
688+
* @startfd: lowest file descriptor to close
689+
* @maxfd: highest file descriptor to close
690+
*
691+
* This closes a range of file descriptors. All file descriptors
692+
* from @startfd up to and including @maxfd are closed.
693+
*/
694+
int __close_range(struct files_struct *files, unsigned startfd, unsigned maxfd)
695+
{
696+
unsigned curfd;
697+
698+
if (startfd > maxfd)
699+
return -EINVAL;
700+
701+
curfd = startfd;
702+
while (curfd <= maxfd) {
703+
if (__close_next_open_fd(files, &curfd, maxfd))
704+
break;
705+
706+
cond_resched();
707+
curfd++;
708+
}
709+
710+
return 0;
711+
}
712+
644713
/*
645714
* variant of __close_fd that gets a ref on the file for later fput
646715
*/

fs/open.c

+20
Original file line numberDiff line numberDiff line change
@@ -1174,6 +1174,26 @@ SYSCALL_DEFINE1(close, unsigned int, fd)
11741174
return retval;
11751175
}
11761176

1177+
/**
1178+
* close_range() - Close all file descriptors in a given range.
1179+
*
1180+
* @fd: starting file descriptor to close
1181+
* @max_fd: last file descriptor to close
1182+
* @flags: reserved for future extensions
1183+
*
1184+
* This closes a range of file descriptors. All file descriptors
1185+
* from @fd up to and including @max_fd are closed.
1186+
* Currently, errors to close a given file descriptor are ignored.
1187+
*/
1188+
SYSCALL_DEFINE3(close_range, unsigned int, fd, unsigned int, max_fd,
1189+
unsigned int, flags)
1190+
{
1191+
if (flags)
1192+
return -EINVAL;
1193+
1194+
return __close_range(current->files, fd, max_fd);
1195+
}
1196+
11771197
/*
11781198
* This routine simulates a hangup on the tty, to arrange that users
11791199
* are given clean terminals at login time.

include/linux/fdtable.h

+2
Original file line numberDiff line numberDiff line change
@@ -121,6 +121,8 @@ extern void __fd_install(struct files_struct *files,
121121
unsigned int fd, struct file *file);
122122
extern int __close_fd(struct files_struct *files,
123123
unsigned int fd);
124+
extern int __close_range(struct files_struct *files, unsigned int fd,
125+
unsigned int max_fd);
124126
extern int __close_fd_get_file(unsigned int fd, struct file **res);
125127

126128
extern struct kmem_cache *files_cachep;

include/linux/syscalls.h

+2
Original file line numberDiff line numberDiff line change
@@ -441,6 +441,8 @@ asmlinkage long sys_fchown(unsigned int fd, uid_t user, gid_t group);
441441
asmlinkage long sys_openat(int dfd, const char __user *filename, int flags,
442442
umode_t mode);
443443
asmlinkage long sys_close(unsigned int fd);
444+
asmlinkage long sys_close_range(unsigned int fd, unsigned int max_fd,
445+
unsigned int flags);
444446
asmlinkage long sys_vhangup(void);
445447

446448
/* fs/pipe.c */

include/uapi/asm-generic/unistd.h

+3-1
Original file line numberDiff line numberDiff line change
@@ -844,9 +844,11 @@ __SYSCALL(__NR_fsconfig, sys_fsconfig)
844844
__SYSCALL(__NR_fsmount, sys_fsmount)
845845
#define __NR_fspick 433
846846
__SYSCALL(__NR_fspick, sys_fspick)
847+
#define __NR_close_range 435
848+
__SYSCALL(__NR_close_range, sys_close_range)
847849

848850
#undef __NR_syscalls
849-
#define __NR_syscalls 434
851+
#define __NR_syscalls 436
850852

851853
/*
852854
* 32 bit systems traditionally used different

0 commit comments

Comments
 (0)