Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve upload deadlock #816

Merged
merged 1 commit into from
Apr 3, 2024

Conversation

chrisstaite
Copy link
Contributor

@chrisstaite chrisstaite commented Mar 29, 2024

Description

We get the metadata for a file after opening it, which causes two file descriptors to be used rather than one. In order to ensure that every future requires exactly one file descriptor at a time and therefore not cause a deadlock in the OPEN FILE Semaphore, we simply get the metadata before we open the file.

Fixes #808

Type of change

Please delete options that aren't relevant.

  • Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

Please also list any relevant details for your test configuration

Checklist

  • Updated documentation if needed
  • Tests added/amended
  • bazel test //... passes locally
  • PR is contained in a single commit, using git amend see some docs

This change is Reviewable

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 2 files at r1, all commit messages.
Reviewable status: 0 of 1 LGTMs obtained, and pending CI: docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), and 4 discussions need to be resolved


nativelink-worker/src/running_actions_manager.rs line 385 at r1 (raw file):

            // futures with all the work we are wanting to do then execute it. It allows us to
            // close the directory iterator file descriptor, then open the child files/folders.
            while let Some(entry) = dir_stream.next().await {

nit: entry_result


nativelink-worker/src/running_actions_manager.rs line 432 at r1 (raw file):

                } else if file_type.is_file() {
                    file_futures.push(async move {
                        let metadata = fs::metadata(&full_path).await?;

nit: .err_tip()


nativelink-worker/tests/running_actions_manager_test.rs line 2938 at r1 (raw file):

    // Be default this test is ignored because it *must* be run single threaded... to run this
    // test execute:
    // cargo test -p nativelink-worker --test running_actions_manager_test -- --test-threads=1 --ignored

Idea, what if we use tokio_fork to fork the test, then have the child call process.exit directly (to prevent other tests from running), then have the parent just wait on the exit code?

This should prevent threads from stepping on each other.


nativelink-worker/tests/running_actions_manager_test.rs line 2954 at r1 (raw file):

        // Take all but one FD permit away.
        let _permits = futures::stream::iter(1..fs::OPEN_FILE_SEMAPHORE.available_permits())

nit: Just use fs::set_open_file_limit(1).

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: 1 of 1 LGTMs obtained, and pending CI: docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), and 4 discussions need to be resolved

Copy link
Contributor Author

@chrisstaite chrisstaite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 1 LGTMs obtained, and pending CI: docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), and 4 discussions need to be resolved


nativelink-worker/tests/running_actions_manager_test.rs line 2954 at r1 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

nit: Just use fs::set_open_file_limit(1).

I read the code and am pretty sure it doesn't support reducing the number

Copy link
Contributor Author

@chrisstaite chrisstaite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 1 LGTMs obtained, and pending CI: docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), and 2 discussions need to be resolved


nativelink-worker/tests/running_actions_manager_test.rs line 2938 at r1 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

Idea, what if we use tokio_fork to fork the test, then have the child call process.exit directly (to prevent other tests from running), then have the parent just wait on the exit code?

This should prevent threads from stepping on each other.

I'm not sure that tokio_fork works after a runtime has already been set up?

Copy link
Contributor Author

@chrisstaite chrisstaite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 1 LGTMs obtained, and pending CI: Bazel Dev / ubuntu-22.04, Cargo Dev / macos-13, Cargo Dev / ubuntu-22.04, Publish image, Publish nativelink-worker-lre-cc, Remote / large-ubuntu-22.04, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), integration-tests (20.04), integration-tests (22.04), macos-13, ubuntu-20.04 / stable, ubuntu-22.04, ubuntu-22.04 / stable, windows-2022 / stable, zig-cc ubuntu-20.04, zig-cc ubuntu-22.04, and 2 discussions need to be resolved


nativelink-worker/tests/running_actions_manager_test.rs line 2954 at r1 (raw file):

Previously, chrisstaite (Chris) wrote…

I read the code and am pretty sure it doesn't support reducing the number

fetch_add with a negative value in a usize... does that work? Then calling add_permits with a negative value... All seems like a bad idea.

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 1 LGTMs obtained, and 2 discussions need to be resolved


nativelink-worker/tests/running_actions_manager_test.rs line 2938 at r1 (raw file):

Previously, chrisstaite (Chris) wrote…

I'm not sure that tokio_fork works after a runtime has already been set up?

What about something like this?
https://gist.github.com/allada/23f6d9f11854c830d48828a1f00ae8a5


nativelink-worker/tests/running_actions_manager_test.rs line 2954 at r1 (raw file):

Previously, chrisstaite (Chris) wrote…

fetch_add with a negative value in a usize... does that work? Then calling add_permits with a negative value... All seems like a bad idea.

Yeah, we'd need to rewrite set_open_file_limit to call forget_permits and remove the check.

Copy link
Contributor

@adam-singer adam-singer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed all commit messages.
Reviewable status: 2 of 1 LGTMs obtained, and 2 discussions need to be resolved

Copy link
Contributor

@zbirenbaum zbirenbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: 3 of 1 LGTMs obtained, and 2 discussions need to be resolved

@chrisstaite
Copy link
Contributor Author

nativelink-worker/tests/running_actions_manager_test.rs line 2938 at r1 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

What about something like this?
https://gist.github.com/allada/23f6d9f11854c830d48828a1f00ae8a5

Nope I get:

Cannot start a runtime from within a runtime. This happens because a function (like `block_on`) attempted to block the current thread while the thread is being used to drive asynchronous tasks.

@chrisstaite
Copy link
Contributor Author

nativelink-worker/tests/running_actions_manager_test.rs line 2938 at r1 (raw file):

Previously, chrisstaite (Chris) wrote…

Nope I get:

Cannot start a runtime from within a runtime. This happens because a function (like `block_on`) attempted to block the current thread while the thread is being used to drive asynchronous tasks.

Figured it out, can spawn a new thread after forking and then just block before exiting.

@chrisstaite
Copy link
Contributor Author

nativelink-worker/tests/running_actions_manager_test.rs line 2938 at r1 (raw file):

Previously, chrisstaite (Chris) wrote…

Figured it out, can spawn a new thread after forking and then just block before exiting.

Nope, it's not at all stable... Gets very upset.

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 3 of 1 LGTMs obtained


nativelink-worker/tests/running_actions_manager_test.rs line 2938 at r1 (raw file):

Previously, chrisstaite (Chris) wrote…

Nope, it's not at all stable... Gets very upset.

Hmmm, ok :-(

Copy link
Collaborator

@chrisstaite-menlo chrisstaite-menlo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed all commit messages.
Reviewable status: :shipit: complete! 3 of 1 LGTMs obtained

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 2 files at r1, 1 of 1 files at r3, all commit messages.
Reviewable status: :shipit: complete! 3 of 1 LGTMs obtained

We get the metadata for a file after opening it, which causes two file descriptors
to be used rather than one.  In order to ensure that every future requires exactly
one file descriptor at a time and therefore not cause a deadlock in the OPEN FILE
Semaphore, we simply get the metadata before we open the file.
Copy link
Contributor

@adam-singer adam-singer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r4, all commit messages.
Reviewable status: 3 of 1 LGTMs obtained, and pending CI: integration-tests (20.04), integration-tests (22.04)

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aaronmondal , was docker-compose removed from our images?
ref: https://github.com/TraceMachina/nativelink/actions/runs/8529184265/job/23364409815

Reviewable status: 3 of 1 LGTMs obtained, and pending CI: integration-tests (20.04), integration-tests (22.04)

@chrisstaite-menlo chrisstaite-menlo enabled auto-merge (squash) April 3, 2024 08:21
@chrisstaite-menlo chrisstaite-menlo merged commit b61142d into TraceMachina:main Apr 3, 2024
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Action that produces a zero length file fails to upload
5 participants