-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation of WorkerStateManager #993
Implementation of WorkerStateManager #993
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 LGTMs obtained, and pending CI: Bazel Dev / ubuntu-22.04, Cargo Dev / macos-13, Installation / macos-13, Publish image, Publish nativelink-worker-lre-cc, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), macos-13, ubuntu-22.04, windows-2022 / stable, and 1 discussions need to be resolved (waiting on @adam-singer)
nativelink-scheduler/src/operation_state_manager.rs
line 122 at r1 (raw file):
/// the operation from being considered stale and being rescheduled. async fn update_operation( &mut self,
Lets find a way to work around whatever is making this mut necessary by locking inside if necessary.
I'm guessing that the retry logic requiring mutable access is what is causing this to be necessary? IIRC that unfortunately goes pretty deep but it may be time to address it before this is landed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 LGTMs obtained, and pending CI: Bazel Dev / ubuntu-22.04, Cargo Dev / macos-13, Installation / macos-13, Publish image, Publish nativelink-worker-lre-cc, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), macos-13, ubuntu-22.04, windows-2022 / stable, and 2 discussions need to be resolved (waiting on @adam-singer)
nativelink-scheduler/src/scheduler_state/state_manager.rs
line 222 at r1 (raw file):
} Arc::make_mut(&mut running_action.current_state).stage = action_stage;
I know this line came from the original code but Arc::make_mut
is pretty dangerous since it clones the data and invalidates all weak references to it. I think the changes necessary to allow update_action
(and ideally retry_action
) to not require a mutable reference to self would also allow us to modify this in a safer way.
aa3e1bf
to
11fc588
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 6 files at r1, 1 of 1 files at r2.
Reviewable status: 0 of 1 LGTMs obtained, and pending CI: pre-commit-checks, and 8 discussions need to be resolved (waiting on @adam-singer)
-- commits
line 2 at r2:
nit: Extend this commit message.
nativelink-scheduler/src/operation_state_manager.rs
line 122 at r1 (raw file):
Previously, zbirenbaum (Zach Birenbaum) wrote…
Lets find a way to work around whatever is making this mut necessary by locking inside if necessary.
I'm guessing that the retry logic requiring mutable access is what is causing this to be necessary? IIRC that unfortunately goes pretty deep but it may be time to address it before this is landed.
Adam and I talked about this. We decided to make it mutable for now and change it later because of some other issues he was having.
nativelink-scheduler/src/simple_scheduler.rs
line 468 at r2 (raw file):
&mut self, worker_id: &WorkerId, action_info_hash_key: &ActionInfoHashKey,
nit: Can we change this to take an owned version of this now?
It's generally a good idea to allow the caller to decide if it wants to copy it or move it.
nativelink-scheduler/src/scheduler_state/state_manager.rs
line 222 at r1 (raw file):
Previously, zbirenbaum (Zach Birenbaum) wrote…
I know this line came from the original code but
Arc::make_mut
is pretty dangerous since it clones the data and invalidates all weak references to it. I think the changes necessary to allowupdate_action
(and ideallyretry_action
) to not require a mutable reference to self would also allow us to modify this in a safer way.
I do agree, but It's actually safe to do because we are going to re-send the data back out. The only reason it's an Arc
is to reduce the number of copies. It's not really used to make it read-only. Doing it this way makes it so all down-stream components get the same pointer. If an update happens while another thread is operating on the data, it's ok, since the other thread is going to get another update with the new version anyway.
It might be worth hiding this logic behind a function and documenting it though.
nativelink-scheduler/src/scheduler_state/state_manager.rs
line 89 at r2 (raw file):
// We don't care if we fail to send message to worker, this is only a best attempt. let _ = worker.notify_update(WorkerUpdate::Disconnect); // We create a temporary Vec to avoid doubt about a possible code
nit: This looks like a stale comment.
nativelink-scheduler/src/scheduler_state/state_manager.rs
line 98 at r2 (raw file):
// TODO(adams): call this after eviction. // Note: Calling this many time is very cheap, it'll only trigger `do_try_match` once. //self.tasks_or_workers_change_notify.notify_one();
nit: don't forget.
nativelink-scheduler/src/scheduler_state/state_manager.rs
line 169 at r2 (raw file):
action_stage: Result<ActionStage, Error>, ) -> Result<(), Error> { let action_stage = action_stage?;
This looks weird, we take a result, but require it to be Ok
?
nativelink-scheduler/src/scheduler_state/state_manager.rs
line 175 at r2 (raw file):
event!( Level::ERROR, ?action_info_hash_key, /*?action_info_hash_key,*/
nit: cleanup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 LGTMs obtained, and pending CI: pre-commit-checks, and 8 discussions need to be resolved
nativelink-scheduler/src/operation_state_manager.rs
line 122 at r1 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
Adam and I talked about this. We decided to make it mutable for now and change it later because of some other issues he was having.
This came around to locking, right now locking is in place, moving the lock inside will be easier once of the stateful code is out of the simple scheduler.
nativelink-scheduler/src/simple_scheduler.rs
line 468 at r2 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
nit: Can we change this to take an owned version of this now?
It's generally a good idea to allow the caller to decide if it wants to copy it or move it.
Owned in this case is we want to pass action_info_hash_key: ActionInfoHashKey
instead of action_info_hash_key: &ActionInfoHashKey
?
nativelink-scheduler/src/scheduler_state/state_manager.rs
line 169 at r2 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
This looks weird, we take a result, but require it to be
Ok
?
If it isn't Ok
what should we do given the function returns ()
or Error
, I assumed by doing action_stage?
that would propagate the failure back out. If we don't have an action_stage
what else would we do? I forget do we have a different code path in the worker we need to support in this definition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 LGTMs obtained, and pending CI: pre-commit-checks, and 7 discussions need to be resolved (waiting on @adam-singer)
nativelink-scheduler/src/operation_state_manager.rs
line 122 at r1 (raw file):
Previously, adam-singer (Adam Singer) wrote…
This came around to locking, right now locking is in place, moving the lock inside will be easier once of the stateful code is out of the simple scheduler.
Got it
nativelink-scheduler/src/scheduler_state/state_manager.rs
line 222 at r1 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
I do agree, but It's actually safe to do because we are going to re-send the data back out. The only reason it's an
Arc
is to reduce the number of copies. It's not really used to make it read-only. Doing it this way makes it so all down-stream components get the same pointer. If an update happens while another thread is operating on the data, it's ok, since the other thread is going to get another update with the new version anyway.It might be worth hiding this logic behind a function and documenting it though.
I agree with hiding it behind a documented function being a good course of action. That can probably wait for a followup though.
11fc588
to
065aa70
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 LGTMs obtained, and pending CI: pre-commit-checks, and 1 discussions need to be resolved (waiting on @adam-singer)
Previously, allada (Nathan (Blaise) Bruer) wrote…
nit: Extend this commit message.
Done.
nativelink-scheduler/src/simple_scheduler.rs
line 468 at r2 (raw file):
Previously, adam-singer (Adam Singer) wrote…
Owned in this case is we want to pass
action_info_hash_key: ActionInfoHashKey
instead ofaction_info_hash_key: &ActionInfoHashKey
?
Done.
nativelink-scheduler/src/scheduler_state/state_manager.rs
line 222 at r1 (raw file):
Previously, zbirenbaum (Zach Birenbaum) wrote…
I agree with hiding it behind a documented function being a good course of action. That can probably wait for a followup though.
Done.
nativelink-scheduler/src/scheduler_state/state_manager.rs
line 89 at r2 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
nit: This looks like a stale comment.
Done.
nativelink-scheduler/src/scheduler_state/state_manager.rs
line 98 at r2 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
nit: don't forget.
Done.
nativelink-scheduler/src/scheduler_state/state_manager.rs
line 169 at r2 (raw file):
Previously, adam-singer (Adam Singer) wrote…
If it isn't
Ok
what should we do given the function returns()
orError
, I assumed by doingaction_stage?
that would propagate the failure back out. If we don't have anaction_stage
what else would we do? I forget do we have a different code path in the worker we need to support in this definition?
cc: @allada
nativelink-scheduler/src/scheduler_state/state_manager.rs
line 175 at r2 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
nit: cleanup.
Done
065aa70
to
f945142
Compare
f945142
to
dd3c137
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 9 of 9 files at r3, all commit messages.
Reviewable status: 0 of 1 LGTMs obtained, and pending CI: docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), windows-2022 / stable, and 2 discussions need to be resolved (waiting on @adam-singer)
nativelink-scheduler/src/scheduler_state/state_manager.rs
line 169 at r2 (raw file):
Previously, adam-singer (Adam Singer) wrote…
cc: @allada
Shouldn't the operation be updated to reflect the Err case? The way it's done here it does not propagate the error to the client.
nativelink-scheduler/src/scheduler_state/state_manager.rs
line 133 at r3 (raw file):
/// fn mutate_stage(awaited_action: &mut AwaitedAction, action_stage: ActionStage) { Arc::make_mut(&mut awaited_action.current_state).stage = action_stage;
If we mutate the stage, we probably should also do the notification to any listeners here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 LGTMs obtained, and pending CI: docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), windows-2022 / stable, and 2 discussions need to be resolved
nativelink-scheduler/src/scheduler_state/state_manager.rs
line 169 at r2 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
Shouldn't the operation be updated to reflect the Err case? The way it's done here it does not propagate the error to the client.
I'm confused by this statement, the error case here is action_stage
came in as an error, using action_stage?
afaict means it would propagate that same error back out
nativelink-scheduler/src/scheduler_state/state_manager.rs
line 133 at r3 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
If we mutate the stage, we probably should also do the notification to any listeners here too?
When StateManager::mutate_stage
is called the following code has handled doing notify_channel.send()
. Guess you are saying thats related and we should fold that code into this method
dd3c137
to
2c3eb4f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 LGTMs obtained, and pending CI: Analyze (javascript-typescript), Analyze (python), Bazel Dev / ubuntu-22.04, Installation / macos-13, Installation / macos-14, Installation / ubuntu-22.04, Local / ubuntu-22.04, Publish image, Publish nativelink-worker-init, Publish nativelink-worker-lre-cc, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), macos-13, pre-commit-checks, ubuntu-20.04 / stable, ubuntu-22.04, ubuntu-22.04 / stable, vale, windows-2022 / stable, and 2 discussions need to be resolved
nativelink-scheduler/src/scheduler_state/state_manager.rs
line 133 at r3 (raw file):
Previously, adam-singer (Adam Singer) wrote…
When
StateManager::mutate_stage
is called the following code has handled doingnotify_channel.send()
. Guess you are saying thats related and we should fold that code into this method
Done.
Implementation of `WorkerStateManager` where the beginnings of being able to tuck mutations behind a trait for `StateManager`. `StateManager` now wraps the `StateManagerImpl` for inner state structure `StateManagerImpl`. All mutations should be done on a lock on `inner` in the future. Moving implementation code of `update_action` into `WorkerStateManager` trait.
2c3eb4f
to
57154a3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 LGTMs obtained, and pending CI: Analyze (javascript-typescript), Analyze (python), Bazel Dev / ubuntu-22.04, Cargo Dev / macos-13, Cargo Dev / ubuntu-22.04, Installation / macos-13, Installation / macos-14, Installation / ubuntu-22.04, Local / ubuntu-22.04, Publish image, Publish nativelink-worker-init, Publish nativelink-worker-lre-cc, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), integration-tests (20.04), integration-tests (22.04), macos-13, pre-commit-checks, ubuntu-20.04 / stable, ubuntu-22.04, ubuntu-22.04 / stable, vale, windows-2022 / stable, and 2 discussions need to be resolved (waiting on @adam-singer)
nativelink-scheduler/src/scheduler_state/state_manager.rs
line 169 at r2 (raw file):
Previously, adam-singer (Adam Singer) wrote…
I'm confused by this statement, the error case here is
action_stage
came in as an error, usingaction_stage?
afaict means it would propagate that same error back out
Taken note in NL-320, will follow up on that refactor after this lands
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 8 of 9 files at r3, 1 of 1 files at r5, all commit messages.
Reviewable status: 0 of 1 LGTMs obtained, and 2 discussions need to be resolved (waiting on @adam-singer)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed all commit messages.
Reviewable status: 0 of 1 LGTMs obtained (waiting on @adam-singer)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 1 of 1 LGTMs obtained
Description
Implement WorkerStateManager for simple scheduler
Implementation of
WorkerStateManager
where the beginnings of being able totuck mutations behind a trait for
StateManager
.StateManager
now wrapsthe
StateManagerImpl
for inner state structureStateManagerImpl
.All mutations should be done on a lock on
inner
in the future.Moving implementation code of
update_action
intoWorkerStateManager
trait.Fixes #994
Type of change
Please delete options that aren't relevant.
How Has This Been Tested?
Unit tests
Checklist
bazel test //...
passes locallygit amend
see some docsThis change is![Reviewable](https://camo.githubusercontent.com/1541c4039185914e83657d3683ec25920c672c6c5c7ab4240ee7bff601adec0b/68747470733a2f2f72657669657761626c652e696f2f7265766965775f627574746f6e2e737667)