Improve the robustness of closing open file descriptors in jailer #3543

gorbak25 · 2023-03-20T05:42:33Z

Changes

Closes all open FDs in a single syscall. In case close_range is not available then fallback to the old approach. In case the fallback fails then discover open FDs by reading from /proc/self/fds.

Reason

Fixes #3542 - a bug where firecracker would end up spending minutes starting which arises when running on systems with OPEN_MAX set to a very high number.

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following
Developer Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

If a specific issue led to this PR, this PR closes the issue.
The description of changes is clear and encompassing.
Any required documentation changes (code and docs) are included in this PR.
API changes follow the Runbook for Firecracker API changes.
User-facing changes are mentioned in CHANGELOG.md.
All added/changed functionality is tested.
New TODOs link to an issue.
Commits meet contribution quality standards.

This functionality cannot be added in rust-vmm.

JonathanWoollett-Light · 2023-03-20T07:46:53Z

Firecracker currently has full supported for 4.14 (kernel-policy).

I believe the close_range syscall was introduced in 5.9 in https://github.com/torvalds/linux/blob/master/include/uapi/linux/close_range.h.

close_range() first appeared in Linux 5.9. Library support was added in glibc in version 2.34.

Please correct me if I'm wrong, but otherwise I think this blocks this.

(I recall this was what prevented us using this approach previously)

gorbak25 · 2023-03-20T15:09:40Z

@JonathanWoollett-Light that's why I introduced fallbacks for older kernels. Essentially this patch on:
kernel >= 5.9 - uses close_range
kernel < 5.9 and OPEN_MAX <= 2**20 - iterates up to 1 mln FDs and tries to close them
kernel < 5.9 and OPEN_MAX > 2**20 - discovers open FDs by reading /proc/self/fd and closing them

Please provide feedback if having those fall-backs is feasible in firecracker.

gorbak25 · 2023-03-20T20:23:10Z

@JonathanWoollett-Light I just pushed a V2 of this patch. Changes compared to V1:

Each of the fall backs has a separate test
If OPEN_MAX is bellow 1024 then I iterate to 1024 anyway, I decreased the maximum threshold from 2**20 to 2**16
I replaced std::fs::read_dir with nix::dir::Dir::open in order to not close the dirfd descriptor twice - otherwise the rust code would panic. Is using the nix crate appropriate here? I added it cause it is a peer dependency anyway.

gorbak25 · 2023-03-21T11:39:39Z

@JonathanWoollett-Light renamed the patch to a more appropriate name

JonathanWoollett-Light · 2023-03-21T14:42:48Z

I think we can do this check at compile time.

If we include a build.rs in the jailer with:

fn main() {
    let process = std::process::Command::new("uname").arg("-r").output().unwrap();
    assert!(process.status.success());
    let version = std::str::from_utf8(&process.stdout).unwrap();
    let mut iter = version.split(".");
    let major = iter.next().unwrap().parse::<u32>().unwrap();
    let minor = iter.next().unwrap().parse::<u32>().unwrap();

    // Firecracker and the jailer officaily supports 2 kernel versions 5.10 and 4.14.
    //
    // Some improved functionality is supported in the 5.10 version that is not supported in 4.14,
    // to accmodate this we set conditionl compilation flags for these versions.
    //
    // In developing Firecracker and the jailer a newer kernel may be used, to support this we don't
    // match the versions exactly.
    match (major,minor) {
        (5.., 10..) => println!("cargo:rustc-cfg=kernel=\"5.10\""),
        (4, 14..) => println!("cargo:rustc-cfg=kernel=\"4.14\""),
        _ => panic!("unsupported kernel version")
    }
}

and then change sanitize_process to:

fn sanitize_process() {
    // First thing to do is make sure we don't keep any inherited FDs
    // other that IN, OUT and ERR.
    #[cfg(kernel="5.10")]
    {
        // This syscall was introduced in 5.9 and is not supported in 4.14.
        // SAFETY: The values are constant and valid.
        unsafe {
            libc::syscall(
                libc::SYS_close_range,
                3,
                libc::c_uint::MAX,
                libc::CLOSE_RANGE_UNSHARE,
            )
        }
    }
    #[cfg(kernel="4.14")]
    {
        // SAFETY: Always safe.
        let fd_limit = i32::try_from(unsafe { libc::sysconf(libc::_SC_OPEN_MAX) }).unwrap();
        // Close all file descriptors excluding 0 (STDIN), 1 (STDOUT) and 2 (STDERR).
        for fd in 3..fd_limit {
            // SAFETY: Safe because close() cannot fail when passed a valid parameter.
            unsafe {
                libc::close(fd);
            }
        }
    }

    // Cleanup environment variables
    clean_env_vars();
}

it should fix this.

Does this solution appear good to you?

This can be altered but I think doing this at compile time is preferable.

gorbak25 · 2023-03-21T16:08:49Z

@JonathanWoollett-Light
From my perspective:

I don't see a separate release for the 4.14 and 5.10 kernel in https://github.com/firecracker-microvm/firecracker/releases I assume those releases should work both on 4.14 and 5.10 kernels. If separate releases would be introduced, then it will increase the load on the build infrastructure and increase maintenance costs.
From my experience, build.rs usually increases the build time of a project and I like to avoid it as much as possible
The code path for 4.14 still needs to be changed, as the original issue won't be fixed. When running firecracker on the 4.14 kernel, some systems might have libc::sysconf(libc::_SC_OPEN_MAX) set to a very high number. This PR suggests reading /proc in those cases using the nix crate, which is already a peer dependency of firecracker.
Calling close_range on the 4.14 kernel is harmless as the kernel returns ENOSYS in that case, i can add a test to validate that this assumption is true.

This PR suggest having 3 methods for doing the same thing and in case of errors fall backing to the other methods in runtime until one of the approaches works. If having the same functionality written in 3 different ways is a problem then only reading /proc using the nix crate would be OK and solve the initial issue, the drawback to this is that it won't work on systems without /proc mounted. The jailer setup is a one time cost and one might have a pool of jailer instances ready ahead of time.

Please correct me if I'm wrong, but I think adding one additional syscall during startup on the 4.14 kernel when jailer was already doing thousands of them to close all possible fds won't change the performance much.

Introduces multiple methods for closing all open FDs in jailer. The method is choosen based on the environment. On host kernel >= 5.9 uses the close_range syscall On host kernel < 5.9 when OPEN_MAX <= 2**16 iterates over all FDs On host kernel < 5.9 when OPEN_MAX > 2**16 discovers open FDs by reading from /proc/self/fds using the nix crate Fixes a bug where firecracker would end up spending minutes starting which arises when running on systems with OPEN_MAX set to a very high number. Signed-off-by: Grzegorz Uriasz <gorbak25@gmail.com>

gorbak25 · 2023-03-22T22:32:58Z

@JonathanWoollett-Light any feedback on what approach this PR should take?

JonathanWoollett-Light · 2023-03-23T11:53:38Z

I think what you suggest is right.

A 1st commit could implement the fix on >5.9, this can look like:

fn sanitize_process() {
    // First thing to do is make sure we don't keep any inherited FDs
    // other that IN, OUT and ERR.

    // Try using the close_range syscall to close all open FDs in the range of 4..UINT_MAX
    // SAFETY: The parameters are valid constants.
    let res = unsafe { libc::syscall(
        libc::SYS_close_range,
        3,
        libc::c_uint::MAX,
        libc::CLOSE_RANGE_UNSHARE,
    ) };

    // If `SYS_close_range` fails as it is not supported (as is the case on a pre 5.9 kernel), we
    // instead close by iterating through all possible file descriptors.
    match res {
        0 => (),
        // The syscall failed due to not being supported.
        // SAFETY: Always safe.
        -1 => if unsafe { *libc::__errno_location } == libc::ENOSYS {
                // SAFETY: Always safe.
                let fd_limit = i32::try_from(unsafe { libc::sysconf(libc::_SC_OPEN_MAX) }).unwrap();
                // Close all file descriptors excluding 0 (STDIN), 1 (STDOUT) and 2 (STDERR).
                for fd in 3..fd_limit {
                    // SAFETY: Safe because close() cannot fail when passed a valid parameter.
                    unsafe {
                        libc::close(fd);
                    }
                }
            },
            else { panic!("Failed to close inherited FDs."); }
        },
        _ => unreachable!()
    }

    // Cleanup environment variables
    clean_env_vars();
}

I am unsure about pulling in Nix, if this is required for the fallback, this will need to be discussed. If we want to get the 1st commit merged quickly I think using Nix to iterate through /proc should be a separate PR.

gorbak25 · 2023-03-26T13:49:37Z

@JonathanWoollett-Light will come back to this PR on wednesday, I'm on a trip without my laptop right now.

dianpopa · 2023-04-06T15:58:31Z

hello @gorbak25,

Thanks a lot for the proposed fix. The changes look reasonable to me in general, there are a couple of things I would like to adjust:

firstly, I would like to drop the check that looks at the _SC_OPEN_MAX variable and only have the one that calls into close_range and the one using proc/self/fd as a fallback.
- I tested a bit the performance of each method, and for a system with _SC_OPEN_MAX 65535 I obtained around 47ms for the option of iterating 3..._SC_OPEN_MAX. It is too slow to keep around.
- For kernels < 5.9, the behavior would be exactly the same as it was before the regression (iterating /proc/self/fd).
also I would like to improve a bit on the error propagation of the code

Since this issue introduces a customer-facing regression, I would like to speed up the merging process so I opened a separate PR that starts from your commit and contains the proposed adjustments. Your contribution is mentioned there and kept you as the commit author! Please take a look at this alternate PR and review.

Thanks a lot again for looking into this!

dianpopa · 2023-04-14T10:16:29Z

Closing in favor of #3595. The fix will be present in v1.4.0 (our next release).
Thanks @gorbak25 again for the contribution!

gorbak25 force-pushed the fix-fd-close-iteration branch from 678ef96 to 108c764 Compare March 20, 2023 05:54

gorbak25 force-pushed the fix-fd-close-iteration branch from 108c764 to 7020692 Compare March 20, 2023 20:16

gorbak25 force-pushed the fix-fd-close-iteration branch 2 times, most recently from 2286426 to 813f471 Compare March 21, 2023 11:38

gorbak25 changed the title ~~Use the close_range syscall for sanitizing the jailer process~~ Improve the robustness of closing open file descriptors in jailer Mar 21, 2023

gorbak25 force-pushed the fix-fd-close-iteration branch from 813f471 to 447fae1 Compare March 22, 2023 22:24

JonathanWoollett-Light self-assigned this Mar 23, 2023

JonathanWoollett-Light added Type: Fix Indicates a fix to existing code Priority: Medium Indicates than an issue or pull request should be resolved ahead of issues or pull requests labelled Status: Awaiting author Indicates that an issue or pull request requires author action labels Mar 24, 2023

dianpopa closed this Apr 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve the robustness of closing open file descriptors in jailer #3543

Improve the robustness of closing open file descriptors in jailer #3543

gorbak25 commented Mar 20, 2023

JonathanWoollett-Light commented Mar 20, 2023 •

edited

Loading

gorbak25 commented Mar 20, 2023 •

edited

Loading

gorbak25 commented Mar 20, 2023 •

edited

Loading

gorbak25 commented Mar 21, 2023

JonathanWoollett-Light commented Mar 21, 2023 •

edited

Loading

gorbak25 commented Mar 21, 2023

gorbak25 commented Mar 22, 2023

JonathanWoollett-Light commented Mar 23, 2023 •

edited

Loading

gorbak25 commented Mar 26, 2023

dianpopa commented Apr 6, 2023 •

edited

Loading

dianpopa commented Apr 14, 2023

Improve the robustness of closing open file descriptors in jailer #3543

Improve the robustness of closing open file descriptors in jailer #3543

Conversation

gorbak25 commented Mar 20, 2023

Changes

Reason

License Acceptance

PR Checklist

JonathanWoollett-Light commented Mar 20, 2023 • edited Loading

gorbak25 commented Mar 20, 2023 • edited Loading

gorbak25 commented Mar 20, 2023 • edited Loading

gorbak25 commented Mar 21, 2023

JonathanWoollett-Light commented Mar 21, 2023 • edited Loading

gorbak25 commented Mar 21, 2023

gorbak25 commented Mar 22, 2023

JonathanWoollett-Light commented Mar 23, 2023 • edited Loading

gorbak25 commented Mar 26, 2023

dianpopa commented Apr 6, 2023 • edited Loading

dianpopa commented Apr 14, 2023

JonathanWoollett-Light commented Mar 20, 2023 •

edited

Loading

gorbak25 commented Mar 20, 2023 •

edited

Loading

gorbak25 commented Mar 20, 2023 •

edited

Loading

JonathanWoollett-Light commented Mar 21, 2023 •

edited

Loading

JonathanWoollett-Light commented Mar 23, 2023 •

edited

Loading

dianpopa commented Apr 6, 2023 •

edited

Loading