-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resolve upload deadlock #816
Resolve upload deadlock #816
Conversation
eb95967
to
b8cc286
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 2 files at r1, all commit messages.
Reviewable status: 0 of 1 LGTMs obtained, and pending CI: docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), and 4 discussions need to be resolved
nativelink-worker/src/running_actions_manager.rs
line 385 at r1 (raw file):
// futures with all the work we are wanting to do then execute it. It allows us to // close the directory iterator file descriptor, then open the child files/folders. while let Some(entry) = dir_stream.next().await {
nit: entry_result
nativelink-worker/src/running_actions_manager.rs
line 432 at r1 (raw file):
} else if file_type.is_file() { file_futures.push(async move { let metadata = fs::metadata(&full_path).await?;
nit: .err_tip()
nativelink-worker/tests/running_actions_manager_test.rs
line 2938 at r1 (raw file):
// Be default this test is ignored because it *must* be run single threaded... to run this // test execute: // cargo test -p nativelink-worker --test running_actions_manager_test -- --test-threads=1 --ignored
Idea, what if we use tokio_fork
to fork the test, then have the child call process.exit directly (to prevent other tests from running), then have the parent just wait on the exit code?
This should prevent threads from stepping on each other.
nativelink-worker/tests/running_actions_manager_test.rs
line 2954 at r1 (raw file):
// Take all but one FD permit away. let _permits = futures::stream::iter(1..fs::OPEN_FILE_SEMAPHORE.available_permits())
nit: Just use fs::set_open_file_limit(1)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 of 1 LGTMs obtained, and pending CI: docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), and 4 discussions need to be resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 of 1 LGTMs obtained, and pending CI: docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), and 4 discussions need to be resolved
nativelink-worker/tests/running_actions_manager_test.rs
line 2954 at r1 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
nit: Just use
fs::set_open_file_limit(1)
.
I read the code and am pretty sure it doesn't support reducing the number
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 of 1 LGTMs obtained, and pending CI: docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), and 2 discussions need to be resolved
nativelink-worker/tests/running_actions_manager_test.rs
line 2938 at r1 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
Idea, what if we use
tokio_fork
to fork the test, then have the child call process.exit directly (to prevent other tests from running), then have the parent just wait on the exit code?This should prevent threads from stepping on each other.
I'm not sure that tokio_fork
works after a runtime has already been set up?
b8cc286
to
82dd0af
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 of 1 LGTMs obtained, and pending CI: Bazel Dev / ubuntu-22.04, Cargo Dev / macos-13, Cargo Dev / ubuntu-22.04, Publish image, Publish nativelink-worker-lre-cc, Remote / large-ubuntu-22.04, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), integration-tests (20.04), integration-tests (22.04), macos-13, ubuntu-20.04 / stable, ubuntu-22.04, ubuntu-22.04 / stable, windows-2022 / stable, zig-cc ubuntu-20.04, zig-cc ubuntu-22.04, and 2 discussions need to be resolved
nativelink-worker/tests/running_actions_manager_test.rs
line 2954 at r1 (raw file):
Previously, chrisstaite (Chris) wrote…
I read the code and am pretty sure it doesn't support reducing the number
fetch_add
with a negative value in a usize
... does that work? Then calling add_permits
with a negative value... All seems like a bad idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 of 1 LGTMs obtained, and 2 discussions need to be resolved
nativelink-worker/tests/running_actions_manager_test.rs
line 2938 at r1 (raw file):
Previously, chrisstaite (Chris) wrote…
I'm not sure that
tokio_fork
works after a runtime has already been set up?
What about something like this?
https://gist.github.com/allada/23f6d9f11854c830d48828a1f00ae8a5
nativelink-worker/tests/running_actions_manager_test.rs
line 2954 at r1 (raw file):
Previously, chrisstaite (Chris) wrote…
fetch_add
with a negative value in ausize
... does that work? Then callingadd_permits
with a negative value... All seems like a bad idea.
Yeah, we'd need to rewrite set_open_file_limit
to call forget_permits
and remove the check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed all commit messages.
Reviewable status: 2 of 1 LGTMs obtained, and 2 discussions need to be resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 3 of 1 LGTMs obtained, and 2 discussions need to be resolved
Previously, allada (Nathan (Blaise) Bruer) wrote…
Nope I get:
|
Previously, chrisstaite (Chris) wrote…
Figured it out, can spawn a new thread after forking and then just block before exiting. |
82dd0af
to
a8143d4
Compare
Previously, chrisstaite (Chris) wrote…
Nope, it's not at all stable... Gets very upset. |
a8143d4
to
70f9508
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 3 of 1 LGTMs obtained
nativelink-worker/tests/running_actions_manager_test.rs
line 2938 at r1 (raw file):
Previously, chrisstaite (Chris) wrote…
Nope, it's not at all stable... Gets very upset.
Hmmm, ok :-(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed all commit messages.
Reviewable status: complete! 3 of 1 LGTMs obtained
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 2 files at r1, 1 of 1 files at r3, all commit messages.
Reviewable status: complete! 3 of 1 LGTMs obtained
We get the metadata for a file after opening it, which causes two file descriptors to be used rather than one. In order to ensure that every future requires exactly one file descriptor at a time and therefore not cause a deadlock in the OPEN FILE Semaphore, we simply get the metadata before we open the file.
70f9508
to
d721c09
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r4, all commit messages.
Reviewable status: 3 of 1 LGTMs obtained, and pending CI: integration-tests (20.04), integration-tests (22.04)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aaronmondal , was docker-compose removed from our images?
ref: https://github.com/TraceMachina/nativelink/actions/runs/8529184265/job/23364409815
Reviewable status: 3 of 1 LGTMs obtained, and pending CI: integration-tests (20.04), integration-tests (22.04)
Description
We get the metadata for a file after opening it, which causes two file descriptors to be used rather than one. In order to ensure that every future requires exactly one file descriptor at a time and therefore not cause a deadlock in the OPEN FILE Semaphore, we simply get the metadata before we open the file.
Fixes #808
Type of change
Please delete options that aren't relevant.
How Has This Been Tested?
Please also list any relevant details for your test configuration
Checklist
bazel test //...
passes locallygit amend
see some docsThis change is