-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kill worker after action execution limit is reached #825
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 4 files at r1, all commit messages.
Reviewable status: 0 of 1 LGTMs obtained, and 5 discussions need to be resolved
nativelink-config/src/cas_server.rs
line 611 at r1 (raw file):
/// If you would like for each individual action to spin up a kubernetes /// pod and then exit on completion, you would set this value to 1. pub execution_limit: Option<u64>,
tioli: Thoughts on other names to convey the idea of fixed number of execution before worker terminates, workload_limit
, action_limit_before_termination
, job_completion_limit
, execution_termination_limit
. No strong thoughts here, do wonder if we can somehow encode the idea of termination of worker in naming
nativelink-config/src/cas_server.rs
line 611 at r1 (raw file):
/// If you would like for each individual action to spin up a kubernetes /// pod and then exit on completion, you would set this value to 1. pub execution_limit: Option<u64>,
Where should we assert that this value is > 1
, if someone supplies a 0
I think we should fail to launch on bad config, maybe the serialization library has some primitive value constraints?
nativelink-worker/src/local_worker.rs
line 82 at r1 (raw file):
// on by the scheduler. actions_in_transit: Arc<AtomicU64>, // The number of actions a worker has completed.
This triplet makes me think of something I've used in the past of triplets for async counter/counting patterns request/success/failure or started/executing/completed. I wonder if other parts of the code could use that kind of information exposed as metrics.
I think we should have metrics here that align with the concept of actions "started/executing/completed", current metrics afaict don't cover that, just covers start_actions_received
nativelink-worker/src/local_worker.rs
line 346 at r1 (raw file):
// Futures will never be empty due to keep alive if futures.len() == 1 && assigned == limit && completed == assigned { std::process::exit(0);
We should debug!
or info!
a message about exiting due to worker reaching termination limit, probably debug!
as lots of executions that exist could be noisy for logs.
nativelink-worker/tests/local_worker_test.rs
line 643 at r1 (raw file):
#[tokio::test] async fn worker_action_limit_test() -> Result<(), Box<dyn std::error::Error>> {
To be pedantic we should add tests that assert a 0
, 1
, 3
as execution_limit
values. Also a test where multiple requests happen and a previous request takes longer to complete than the others. For example request1 takes T0:T4 time, request2 T1:T2, request3 T1:T3 or something of that nature to show exit conditions are not bound by last request received
3cd731e
to
28bdf91
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 LGTMs obtained, and pending CI: Analyze (javascript-typescript), Analyze (python), Bazel Dev / ubuntu-22.04, Cargo Dev / macos-13, Cargo Dev / ubuntu-22.04, Local / ubuntu-22.04, Publish image, Publish nativelink-worker-lre-cc, Remote / large-ubuntu-22.04, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), integration-tests (20.04), integration-tests (22.04), macos-13, pre-commit-checks, ubuntu-20.04 / stable, ubuntu-22.04, ubuntu-22.04 / stable, vale, windows-2022 / stable, zig-cc ubuntu-20.04, zig-cc ubuntu-22.04, and 5 discussions need to be resolved
nativelink-config/src/cas_server.rs
line 611 at r1 (raw file):
Previously, adam-singer (Adam Singer) wrote…
tioli: Thoughts on other names to convey the idea of fixed number of execution before worker terminates,
workload_limit
,action_limit_before_termination
,job_completion_limit
,execution_termination_limit
. No strong thoughts here, do wonder if we can somehow encode the idea of termination of worker in naming
I updated it to be actions_before_termination
which I think is pretty clear
nativelink-config/src/cas_server.rs
line 611 at r1 (raw file):
Previously, adam-singer (Adam Singer) wrote…
Where should we assert that this value is
> 1
, if someone supplies a0
I think we should fail to launch on bad config, maybe the serialization library has some primitive value constraints?
I added a check inside LocalWorkerImpl::new()
that throws an error stating it has to be greater than 0. This should catch during initialization like our other validation and default value checks
nativelink-worker/src/local_worker.rs
line 82 at r1 (raw file):
Previously, adam-singer (Adam Singer) wrote…
This triplet makes me think of something I've used in the past of triplets for async counter/counting patterns request/success/failure or started/executing/completed. I wonder if other parts of the code could use that kind of information exposed as metrics.
I think we should have metrics here that align with the concept of actions "started/executing/completed", current metrics afaict don't cover that, just covers
start_actions_received
I was thinking about this too. The metrics stuff is going to undergo a full rewrite soon though, so having these vars and maybe adding a way to expose them later should be sufficient.
nativelink-worker/src/local_worker.rs
line 346 at r1 (raw file):
Previously, adam-singer (Adam Singer) wrote…
We should
debug!
orinfo!
a message about exiting due to worker reaching termination limit, probablydebug!
as lots of executions that exist could be noisy for logs.
Good idea, added
28bdf91
to
a2e7a40
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 4 files at r1, 1 of 3 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: 0 of 1 LGTMs obtained, and 1 discussions need to be resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed all commit messages.
Reviewable status: 0 of 1 LGTMs obtained, and 3 discussions need to be resolved
nativelink-config/src/cas_server.rs
line 606 at r3 (raw file):
/// The maximum number of actions a worker can executions . /// After this limit is reached, the nativelink binary will exit. /// A value of None means no limit.
nit: Put this on the last line of the comment by itself and say Default:
(see other examples in the config)
nativelink-worker/src/local_worker.rs
line 139 at r3 (raw file):
grpc_client, worker_id, actions_completed: Arc::new(AtomicU64::new(0)),
I don't think these needs to be Atomic's or Arc's. Looking at how they are used, it looks like all the places have mutable access to Self
.
Allows users to set the maximum number of action executions a worker is allowed to complete. Upon reaching this limit, the worker will no longer accept new jobs, and will exit upon completing all assigned ones. closes: TraceMachina#815
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 LGTMs obtained, and 2 discussions need to be resolved
nativelink-config/src/cas_server.rs
line 606 at r3 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
nit: Put this on the last line of the comment by itself and say
Default:
(see other examples in the config)
Done.
nativelink-worker/src/local_worker.rs
line 139 at r3 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
I don't think these needs to be Atomic's or Arc's. Looking at how they are used, it looks like all the places have mutable access to
Self
.
I tried making them normal u64 but both run into borrow checker issues I couldn't figure out how to fix due to the spawns.
a2e7a40
to
1f61e53
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r4, all commit messages.
Reviewable status: 0 of 1 LGTMs obtained, and pending CI: pre-commit-checks, and 2 discussions need to be resolved
Zach Birenbaum seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
Description
Allows users to set the maximum number of action executions a worker is allowed to complete. Upon reaching this limit, the worker will no longer accept new jobs, and will exit upon completing all assigned ones.
Fixes #815
Type of change
How Has This Been Tested?
A test has been added to ensure the going_away signal is properly sent and received
Checklist
bazel test //...
passes locallygit amend
see some docsThis change is