Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the robustness of closing open file descriptors in jailer #3543

Closed

Conversation

gorbak25
Copy link
Contributor

Changes

Closes all open FDs in a single syscall. In case close_range is not available then fallback to the old approach. In case the fallback fails then discover open FDs by reading from /proc/self/fds.

Reason

Fixes #3542 - a bug where firecracker would end up spending minutes starting which arises when running on systems with OPEN_MAX set to a very high number.

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following
Developer Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

  • If a specific issue led to this PR, this PR closes the issue.
  • The description of changes is clear and encompassing.
  • Any required documentation changes (code and docs) are included in this PR.
  • API changes follow the Runbook for Firecracker API changes.
  • User-facing changes are mentioned in CHANGELOG.md.
  • All added/changed functionality is tested.
  • New TODOs link to an issue.
  • Commits meet contribution quality standards.

  • This functionality cannot be added in rust-vmm.

@JonathanWoollett-Light
Copy link
Contributor

JonathanWoollett-Light commented Mar 20, 2023

Firecracker currently has full supported for 4.14 (kernel-policy).

I believe the close_range syscall was introduced in 5.9 in https://github.com/torvalds/linux/blob/master/include/uapi/linux/close_range.h.

close_range() first appeared in Linux 5.9. Library support was added in glibc in version 2.34.

Please correct me if I'm wrong, but otherwise I think this blocks this.

(I recall this was what prevented us using this approach previously)

@gorbak25
Copy link
Contributor Author

gorbak25 commented Mar 20, 2023

@JonathanWoollett-Light that's why I introduced fallbacks for older kernels. Essentially this patch on:
kernel >= 5.9 - uses close_range
kernel < 5.9 and OPEN_MAX <= 2**20 - iterates up to 1 mln FDs and tries to close them
kernel < 5.9 and OPEN_MAX > 2**20 - discovers open FDs by reading /proc/self/fd and closing them

Please provide feedback if having those fall-backs is feasible in firecracker.

@gorbak25
Copy link
Contributor Author

gorbak25 commented Mar 20, 2023

@JonathanWoollett-Light I just pushed a V2 of this patch. Changes compared to V1:

  • Each of the fall backs has a separate test
  • If OPEN_MAX is bellow 1024 then I iterate to 1024 anyway, I decreased the maximum threshold from 2**20 to 2**16
  • I replaced std::fs::read_dir with nix::dir::Dir::open in order to not close the dirfd descriptor twice - otherwise the rust code would panic. Is using the nix crate appropriate here? I added it cause it is a peer dependency anyway.

@gorbak25 gorbak25 force-pushed the fix-fd-close-iteration branch 2 times, most recently from 2286426 to 813f471 Compare March 21, 2023 11:38
@gorbak25 gorbak25 changed the title Use the close_range syscall for sanitizing the jailer process Improve the robustness of closing open file descriptors in jailer Mar 21, 2023
@gorbak25
Copy link
Contributor Author

@JonathanWoollett-Light renamed the patch to a more appropriate name

@JonathanWoollett-Light
Copy link
Contributor

JonathanWoollett-Light commented Mar 21, 2023

I think we can do this check at compile time.

If we include a build.rs in the jailer with:

fn main() {
    let process = std::process::Command::new("uname").arg("-r").output().unwrap();
    assert!(process.status.success());
    let version = std::str::from_utf8(&process.stdout).unwrap();
    let mut iter = version.split(".");
    let major = iter.next().unwrap().parse::<u32>().unwrap();
    let minor = iter.next().unwrap().parse::<u32>().unwrap();

    // Firecracker and the jailer officaily supports 2 kernel versions 5.10 and 4.14.
    //
    // Some improved functionality is supported in the 5.10 version that is not supported in 4.14,
    // to accmodate this we set conditionl compilation flags for these versions.
    //
    // In developing Firecracker and the jailer a newer kernel may be used, to support this we don't
    // match the versions exactly.
    match (major,minor) {
        (5.., 10..) => println!("cargo:rustc-cfg=kernel=\"5.10\""),
        (4, 14..) => println!("cargo:rustc-cfg=kernel=\"4.14\""),
        _ => panic!("unsupported kernel version")
    }
}

and then change sanitize_process to:

fn sanitize_process() {
    // First thing to do is make sure we don't keep any inherited FDs
    // other that IN, OUT and ERR.
    #[cfg(kernel="5.10")]
    {
        // This syscall was introduced in 5.9 and is not supported in 4.14.
        // SAFETY: The values are constant and valid.
        unsafe {
            libc::syscall(
                libc::SYS_close_range,
                3,
                libc::c_uint::MAX,
                libc::CLOSE_RANGE_UNSHARE,
            )
        }
    }
    #[cfg(kernel="4.14")]
    {
        // SAFETY: Always safe.
        let fd_limit = i32::try_from(unsafe { libc::sysconf(libc::_SC_OPEN_MAX) }).unwrap();
        // Close all file descriptors excluding 0 (STDIN), 1 (STDOUT) and 2 (STDERR).
        for fd in 3..fd_limit {
            // SAFETY: Safe because close() cannot fail when passed a valid parameter.
            unsafe {
                libc::close(fd);
            }
        }
    }

    // Cleanup environment variables
    clean_env_vars();
}

it should fix this.

Does this solution appear good to you?

This can be altered but I think doing this at compile time is preferable.

@gorbak25
Copy link
Contributor Author

@JonathanWoollett-Light
From my perspective:

  • I don't see a separate release for the 4.14 and 5.10 kernel in https://github.com/firecracker-microvm/firecracker/releases I assume those releases should work both on 4.14 and 5.10 kernels. If separate releases would be introduced, then it will increase the load on the build infrastructure and increase maintenance costs.
  • From my experience, build.rs usually increases the build time of a project and I like to avoid it as much as possible
  • The code path for 4.14 still needs to be changed, as the original issue won't be fixed. When running firecracker on the 4.14 kernel, some systems might have libc::sysconf(libc::_SC_OPEN_MAX) set to a very high number. This PR suggests reading /proc in those cases using the nix crate, which is already a peer dependency of firecracker.
  • Calling close_range on the 4.14 kernel is harmless as the kernel returns ENOSYS in that case, i can add a test to validate that this assumption is true.

This PR suggest having 3 methods for doing the same thing and in case of errors fall backing to the other methods in runtime until one of the approaches works. If having the same functionality written in 3 different ways is a problem then only reading /proc using the nix crate would be OK and solve the initial issue, the drawback to this is that it won't work on systems without /proc mounted. The jailer setup is a one time cost and one might have a pool of jailer instances ready ahead of time.

Please correct me if I'm wrong, but I think adding one additional syscall during startup on the 4.14 kernel when jailer was already doing thousands of them to close all possible fds won't change the performance much.

Introduces multiple methods for closing all open FDs in jailer.
The method is choosen based on the environment.
On host kernel >= 5.9 uses the close_range syscall
On host kernel < 5.9 when OPEN_MAX <= 2**16 iterates over all FDs
On host kernel < 5.9 when OPEN_MAX > 2**16 discovers open FDs by
reading from /proc/self/fds using the nix crate

Fixes a bug where firecracker would end up spending minutes starting
which arises when running on systems with OPEN_MAX set to a very high
number.

Signed-off-by: Grzegorz Uriasz <gorbak25@gmail.com>
@gorbak25
Copy link
Contributor Author

@JonathanWoollett-Light any feedback on what approach this PR should take?

@JonathanWoollett-Light
Copy link
Contributor

JonathanWoollett-Light commented Mar 23, 2023

I think what you suggest is right.

A 1st commit could implement the fix on >5.9, this can look like:

fn sanitize_process() {
    // First thing to do is make sure we don't keep any inherited FDs
    // other that IN, OUT and ERR.

    // Try using the close_range syscall to close all open FDs in the range of 4..UINT_MAX
    // SAFETY: The parameters are valid constants.
    let res = unsafe { libc::syscall(
        libc::SYS_close_range,
        3,
        libc::c_uint::MAX,
        libc::CLOSE_RANGE_UNSHARE,
    ) };

    // If `SYS_close_range` fails as it is not supported (as is the case on a pre 5.9 kernel), we
    // instead close by iterating through all possible file descriptors.
    match res {
        0 => (),
        // The syscall failed due to not being supported.
        // SAFETY: Always safe.
        -1 => if unsafe { *libc::__errno_location } == libc::ENOSYS {
                // SAFETY: Always safe.
                let fd_limit = i32::try_from(unsafe { libc::sysconf(libc::_SC_OPEN_MAX) }).unwrap();
                // Close all file descriptors excluding 0 (STDIN), 1 (STDOUT) and 2 (STDERR).
                for fd in 3..fd_limit {
                    // SAFETY: Safe because close() cannot fail when passed a valid parameter.
                    unsafe {
                        libc::close(fd);
                    }
                }
            },
            else { panic!("Failed to close inherited FDs."); }
        },
        _ => unreachable!()
    }

    // Cleanup environment variables
    clean_env_vars();
}

I am unsure about pulling in Nix, if this is required for the fallback, this will need to be discussed. If we want to get the 1st commit merged quickly I think using Nix to iterate through /proc should be a separate PR.

@JonathanWoollett-Light JonathanWoollett-Light added Type: Fix Indicates a fix to existing code Priority: Medium Indicates than an issue or pull request should be resolved ahead of issues or pull requests labelled Status: Awaiting author Indicates that an issue or pull request requires author action labels Mar 24, 2023
@gorbak25
Copy link
Contributor Author

@JonathanWoollett-Light will come back to this PR on wednesday, I'm on a trip without my laptop right now.

@dianpopa
Copy link
Contributor

dianpopa commented Apr 6, 2023

hello @gorbak25,

Thanks a lot for the proposed fix. The changes look reasonable to me in general, there are a couple of things I would like to adjust:

  • firstly, I would like to drop the check that looks at the _SC_OPEN_MAX variable and only have the one that calls into close_range and the one using proc/self/fd as a fallback.
    • I tested a bit the performance of each method, and for a system with _SC_OPEN_MAX 65535 I obtained around 47ms for the option of iterating 3..._SC_OPEN_MAX. It is too slow to keep around.
    • For kernels < 5.9, the behavior would be exactly the same as it was before the regression (iterating /proc/self/fd).
  • also I would like to improve a bit on the error propagation of the code

Since this issue introduces a customer-facing regression, I would like to speed up the merging process so I opened a separate PR that starts from your commit and contains the proposed adjustments. Your contribution is mentioned there and kept you as the commit author! Please take a look at this alternate PR and review.

Thanks a lot again for looking into this!

@dianpopa
Copy link
Contributor

Closing in favor of #3595. The fix will be present in v1.4.0 (our next release).
Thanks @gorbak25 again for the contribution!

@dianpopa dianpopa closed this Apr 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority: Medium Indicates than an issue or pull request should be resolved ahead of issues or pull requests labelled Status: Awaiting author Indicates that an issue or pull request requires author action Type: Fix Indicates a fix to existing code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] Firecracker uses up 100% CPU after an upgrade from v1.1.2 to v1.3.1
3 participants