Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix possible deadlock if max_open_files set too low #908

Merged
merged 1 commit into from
May 12, 2024

Conversation

allada
Copy link
Member

@allada allada commented May 10, 2024

We now close the file permit if the write future takes too long.

fixes #907


This change is Reviewable

@CLAassistant
Copy link

CLAassistant commented May 10, 2024

CLA assistant check
All committers have signed the CLA.

@allada allada force-pushed the fix-file-upload-deadlock branch from 0dfef3f to 0b3ab20 Compare May 10, 2024 21:18
Copy link
Member Author

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+@adam-singer

Reviewable status: 0 of 1 LGTMs obtained, and pending CI: Analyze (javascript-typescript), Bazel Dev / ubuntu-22.04, Vercel, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), pre-commit-checks, ubuntu-20.04 / stable, vale, and 1 discussions need to be resolved (waiting on @adam)


nativelink-util/src/lib.rs line 39 at r1 (raw file):

/// Initialize tracing.
pub fn init_tracing() -> Result<(), nativelink_error::Error> {

I did this to help debug tests, so figured I'd sneak it in.

Copy link
Contributor

@adam-singer adam-singer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 14 of 15 files at r1, 1 of 1 files at r2, all commit messages.
Reviewable status: 1 of 1 LGTMs obtained, and 1 discussions need to be resolved


nativelink-macro/src/lib.rs line 40 at r2 (raw file):

            let _ = nativelink_util::init_tracing();
            // If already set it's ok.
            let _ = nativelink_util::fs::set_idle_file_descriptor_timeout(std::time::Duration::from_millis(100));

nit: Seem like something that should be a parameter defaulted to 100ms but configurable, 150-300ms hopefully should be max budget to complete a request in an ideal scenario


nativelink-store/src/filesystem_store.rs line 805 at r2 (raw file):

            }),
        );
        drop(file);

nit: lets comment on why the drop since we are explicit about releasing the reference within a function


nativelink-store/tests/filesystem_store_test.rs line 1421 at r2 (raw file):

    async fn update_with_whole_file_closes_file() -> Result<(), Error> {
        let mut permits = vec![];
        // Grab al permits to ensure only 1 permit is available.

nit: "Grab al" -> "Grab all"


nativelink-store/tests/filesystem_store_test.rs line 1473 at r2 (raw file):

    async fn update_with_whole_file_slow_path_when_low_file_descriptors() -> Result<(), Error> {
        let mut permits = vec![];
        // Grab al permits to ensure only 1 permit is available.

nit: "Grab al" -> "Grab all"


nativelink-store/tests/filesystem_store_test.rs line 1492 at r2 (raw file):

        let store = Box::pin(FastSlowStore::new(
            // Note: The config is not needed for this test, so use dummy data.
            &nativelink_config::stores::FastSlowStore {

nit: alias or import the struct (applied to same below, if existing code is like that I guess leave)


nativelink-util/src/fs.rs line 372 at r2 (raw file):

        Ok((
            permit,
            std::fs::File::options()

this code makes me happy and sad at the same time.


nativelink-util/src/lib.rs line 42 at r2 (raw file):

    use tracing_subscriber::prelude::*;

    static LOGGING_INITIALIZED: std::sync::Mutex<bool> = std::sync::Mutex::new(false);

nit: use std::sync::Mutex

Copy link
Contributor

@adam-singer adam-singer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 1 LGTMs obtained, and 1 discussions need to be resolved

We now close the file permit if the write future takes
too long.

fixes TraceMachina#907
@allada allada force-pushed the fix-file-upload-deadlock branch from 0b3ab20 to a6c91ce Compare May 12, 2024 18:03
Copy link
Member Author

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 1 LGTMs obtained, and pending CI: Analyze (javascript-typescript), Local / ubuntu-22.04, Publish image, Vercel, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), pre-commit-checks, ubuntu-20.04 / stable, ubuntu-22.04


nativelink-macro/src/lib.rs line 40 at r2 (raw file):

Previously, adam-singer (Adam Singer) wrote…

nit: Seem like something that should be a parameter defaulted to 100ms but configurable, 150-300ms hopefully should be max budget to complete a request in an ideal scenario

I do agree, but also don't want to spend time dealing with macros on this. For now i'll leave it this way. When we need it parameterized lets do it then.

Keep in mind this is test-only code.


nativelink-store/src/filesystem_store.rs line 805 at r2 (raw file):

Previously, adam-singer (Adam Singer) wrote…

nit: lets comment on why the drop since we are explicit about releasing the reference within a function

Done.


nativelink-store/tests/filesystem_store_test.rs line 1421 at r2 (raw file):

Previously, adam-singer (Adam Singer) wrote…

nit: "Grab al" -> "Grab all"

Done.


nativelink-store/tests/filesystem_store_test.rs line 1473 at r2 (raw file):

Previously, adam-singer (Adam Singer) wrote…

nit: "Grab al" -> "Grab all"

Done.


nativelink-store/tests/filesystem_store_test.rs line 1492 at r2 (raw file):

Previously, adam-singer (Adam Singer) wrote…

nit: alias or import the struct (applied to same below, if existing code is like that I guess leave)

Yeah to reduce the risk of name collision, we give full paths for all config variables. This is common in this entire file and most of the rest of the tests.


nativelink-util/src/fs.rs line 372 at r2 (raw file):

Previously, adam-singer (Adam Singer) wrote…

this code makes me happy and sad at the same time.

Yeah. I found a case that we are using the same descriptor to write then read, and std::fs::File::create() opens in write-only mode.


nativelink-util/src/lib.rs line 42 at r2 (raw file):

Previously, adam-singer (Adam Singer) wrote…

nit: use std::sync::Mutex

I think you meant parking_lot. Done.

@allada allada merged commit e0a7bb9 into TraceMachina:main May 12, 2024
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Possible deadlock if max_open_files set too low
3 participants