Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of WorkerStateManager #993

Merged

Conversation

adam-singer
Copy link
Contributor

@adam-singer adam-singer commented Jun 12, 2024

Description

Implement WorkerStateManager for simple scheduler

Implementation of WorkerStateManager where the beginnings of being able to
tuck mutations behind a trait for StateManager. StateManager now wraps
the StateManagerImpl for inner state structure StateManagerImpl.
All mutations should be done on a lock on inner in the future.
Moving implementation code of update_action into WorkerStateManager trait.

Fixes #994

Type of change

Please delete options that aren't relevant.

  • Refactor

How Has This Been Tested?

Unit tests

Checklist

  • bazel test //... passes locally
  • PR is contained in a single commit, using git amend see some docs

This change is Reviewable

@adam-singer adam-singer requested review from allada and zbirenbaum June 12, 2024 20:56
@adam-singer adam-singer self-assigned this Jun 12, 2024
Copy link
Contributor

@zbirenbaum zbirenbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 LGTMs obtained, and pending CI: Bazel Dev / ubuntu-22.04, Cargo Dev / macos-13, Installation / macos-13, Publish image, Publish nativelink-worker-lre-cc, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), macos-13, ubuntu-22.04, windows-2022 / stable, and 1 discussions need to be resolved (waiting on @adam-singer)


nativelink-scheduler/src/operation_state_manager.rs line 122 at r1 (raw file):

    /// the operation from being considered stale and being rescheduled.
    async fn update_operation(
        &mut self,

Lets find a way to work around whatever is making this mut necessary by locking inside if necessary.

I'm guessing that the retry logic requiring mutable access is what is causing this to be necessary? IIRC that unfortunately goes pretty deep but it may be time to address it before this is landed.

Copy link
Contributor

@zbirenbaum zbirenbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 LGTMs obtained, and pending CI: Bazel Dev / ubuntu-22.04, Cargo Dev / macos-13, Installation / macos-13, Publish image, Publish nativelink-worker-lre-cc, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), macos-13, ubuntu-22.04, windows-2022 / stable, and 2 discussions need to be resolved (waiting on @adam-singer)


nativelink-scheduler/src/scheduler_state/state_manager.rs line 222 at r1 (raw file):

        }

        Arc::make_mut(&mut running_action.current_state).stage = action_stage;

I know this line came from the original code but Arc::make_mut is pretty dangerous since it clones the data and invalidates all weak references to it. I think the changes necessary to allow update_action (and ideally retry_action) to not require a mutable reference to self would also allow us to modify this in a safer way.

@adam-singer adam-singer force-pushed the adams/impl-worker-state-manager branch from aa3e1bf to 11fc588 Compare June 12, 2024 22:36
Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 6 files at r1, 1 of 1 files at r2.
Reviewable status: 0 of 1 LGTMs obtained, and pending CI: pre-commit-checks, and 8 discussions need to be resolved (waiting on @adam-singer)


-- commits line 2 at r2:
nit: Extend this commit message.


nativelink-scheduler/src/operation_state_manager.rs line 122 at r1 (raw file):

Previously, zbirenbaum (Zach Birenbaum) wrote…

Lets find a way to work around whatever is making this mut necessary by locking inside if necessary.

I'm guessing that the retry logic requiring mutable access is what is causing this to be necessary? IIRC that unfortunately goes pretty deep but it may be time to address it before this is landed.

Adam and I talked about this. We decided to make it mutable for now and change it later because of some other issues he was having.


nativelink-scheduler/src/simple_scheduler.rs line 468 at r2 (raw file):

        &mut self,
        worker_id: &WorkerId,
        action_info_hash_key: &ActionInfoHashKey,

nit: Can we change this to take an owned version of this now?

It's generally a good idea to allow the caller to decide if it wants to copy it or move it.


nativelink-scheduler/src/scheduler_state/state_manager.rs line 222 at r1 (raw file):

Previously, zbirenbaum (Zach Birenbaum) wrote…

I know this line came from the original code but Arc::make_mut is pretty dangerous since it clones the data and invalidates all weak references to it. I think the changes necessary to allow update_action (and ideally retry_action) to not require a mutable reference to self would also allow us to modify this in a safer way.

I do agree, but It's actually safe to do because we are going to re-send the data back out. The only reason it's an Arc is to reduce the number of copies. It's not really used to make it read-only. Doing it this way makes it so all down-stream components get the same pointer. If an update happens while another thread is operating on the data, it's ok, since the other thread is going to get another update with the new version anyway.

It might be worth hiding this logic behind a function and documenting it though.


nativelink-scheduler/src/scheduler_state/state_manager.rs line 89 at r2 (raw file):

            // We don't care if we fail to send message to worker, this is only a best attempt.
            let _ = worker.notify_update(WorkerUpdate::Disconnect);
            // We create a temporary Vec to avoid doubt about a possible code

nit: This looks like a stale comment.


nativelink-scheduler/src/scheduler_state/state_manager.rs line 98 at r2 (raw file):

        // TODO(adams): call this after eviction.
        // Note: Calling this many time is very cheap, it'll only trigger `do_try_match` once.
        //self.tasks_or_workers_change_notify.notify_one();

nit: don't forget.


nativelink-scheduler/src/scheduler_state/state_manager.rs line 169 at r2 (raw file):

        action_stage: Result<ActionStage, Error>,
    ) -> Result<(), Error> {
        let action_stage = action_stage?;

This looks weird, we take a result, but require it to be Ok?


nativelink-scheduler/src/scheduler_state/state_manager.rs line 175 at r2 (raw file):

            event!(
                Level::ERROR,
                ?action_info_hash_key, /*?action_info_hash_key,*/

nit: cleanup.

Copy link
Contributor Author

@adam-singer adam-singer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 LGTMs obtained, and pending CI: pre-commit-checks, and 8 discussions need to be resolved


nativelink-scheduler/src/operation_state_manager.rs line 122 at r1 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

Adam and I talked about this. We decided to make it mutable for now and change it later because of some other issues he was having.

This came around to locking, right now locking is in place, moving the lock inside will be easier once of the stateful code is out of the simple scheduler.


nativelink-scheduler/src/simple_scheduler.rs line 468 at r2 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

nit: Can we change this to take an owned version of this now?

It's generally a good idea to allow the caller to decide if it wants to copy it or move it.

Owned in this case is we want to pass action_info_hash_key: ActionInfoHashKey instead of action_info_hash_key: &ActionInfoHashKey ?


nativelink-scheduler/src/scheduler_state/state_manager.rs line 169 at r2 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

This looks weird, we take a result, but require it to be Ok?

If it isn't Ok what should we do given the function returns () or Error, I assumed by doing action_stage? that would propagate the failure back out. If we don't have an action_stage what else would we do? I forget do we have a different code path in the worker we need to support in this definition?

Copy link
Contributor

@zbirenbaum zbirenbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 LGTMs obtained, and pending CI: pre-commit-checks, and 7 discussions need to be resolved (waiting on @adam-singer)


nativelink-scheduler/src/operation_state_manager.rs line 122 at r1 (raw file):

Previously, adam-singer (Adam Singer) wrote…

This came around to locking, right now locking is in place, moving the lock inside will be easier once of the stateful code is out of the simple scheduler.

Got it


nativelink-scheduler/src/scheduler_state/state_manager.rs line 222 at r1 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

I do agree, but It's actually safe to do because we are going to re-send the data back out. The only reason it's an Arc is to reduce the number of copies. It's not really used to make it read-only. Doing it this way makes it so all down-stream components get the same pointer. If an update happens while another thread is operating on the data, it's ok, since the other thread is going to get another update with the new version anyway.

It might be worth hiding this logic behind a function and documenting it though.

I agree with hiding it behind a documented function being a good course of action. That can probably wait for a followup though.

@adam-singer adam-singer force-pushed the adams/impl-worker-state-manager branch from 11fc588 to 065aa70 Compare June 18, 2024 04:44
Copy link
Contributor Author

@adam-singer adam-singer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 LGTMs obtained, and pending CI: pre-commit-checks, and 1 discussions need to be resolved (waiting on @adam-singer)


-- commits line 2 at r2:

Previously, allada (Nathan (Blaise) Bruer) wrote…

nit: Extend this commit message.

Done.


nativelink-scheduler/src/simple_scheduler.rs line 468 at r2 (raw file):

Previously, adam-singer (Adam Singer) wrote…

Owned in this case is we want to pass action_info_hash_key: ActionInfoHashKey instead of action_info_hash_key: &ActionInfoHashKey ?

Done.


nativelink-scheduler/src/scheduler_state/state_manager.rs line 222 at r1 (raw file):

Previously, zbirenbaum (Zach Birenbaum) wrote…

I agree with hiding it behind a documented function being a good course of action. That can probably wait for a followup though.

Done.


nativelink-scheduler/src/scheduler_state/state_manager.rs line 89 at r2 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

nit: This looks like a stale comment.

Done.


nativelink-scheduler/src/scheduler_state/state_manager.rs line 98 at r2 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

nit: don't forget.

Done.


nativelink-scheduler/src/scheduler_state/state_manager.rs line 169 at r2 (raw file):

Previously, adam-singer (Adam Singer) wrote…

If it isn't Ok what should we do given the function returns () or Error, I assumed by doing action_stage? that would propagate the failure back out. If we don't have an action_stage what else would we do? I forget do we have a different code path in the worker we need to support in this definition?

cc: @allada


nativelink-scheduler/src/scheduler_state/state_manager.rs line 175 at r2 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

nit: cleanup.

Done

@adam-singer adam-singer marked this pull request as ready for review June 18, 2024 04:44
@adam-singer adam-singer force-pushed the adams/impl-worker-state-manager branch from 065aa70 to f945142 Compare June 18, 2024 04:55
@adam-singer adam-singer changed the title Implement worker state manager Implementation of WorkerStateManager Jun 18, 2024
@adam-singer adam-singer requested a review from allada June 20, 2024 15:58
@adam-singer adam-singer force-pushed the adams/impl-worker-state-manager branch from f945142 to dd3c137 Compare June 20, 2024 21:57
Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 9 of 9 files at r3, all commit messages.
Reviewable status: 0 of 1 LGTMs obtained, and pending CI: docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), windows-2022 / stable, and 2 discussions need to be resolved (waiting on @adam-singer)


nativelink-scheduler/src/scheduler_state/state_manager.rs line 169 at r2 (raw file):

Previously, adam-singer (Adam Singer) wrote…

cc: @allada

Shouldn't the operation be updated to reflect the Err case? The way it's done here it does not propagate the error to the client.


nativelink-scheduler/src/scheduler_state/state_manager.rs line 133 at r3 (raw file):

    ///
    fn mutate_stage(awaited_action: &mut AwaitedAction, action_stage: ActionStage) {
        Arc::make_mut(&mut awaited_action.current_state).stage = action_stage;

If we mutate the stage, we probably should also do the notification to any listeners here too?

Copy link
Contributor Author

@adam-singer adam-singer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 LGTMs obtained, and pending CI: docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), windows-2022 / stable, and 2 discussions need to be resolved


nativelink-scheduler/src/scheduler_state/state_manager.rs line 169 at r2 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

Shouldn't the operation be updated to reflect the Err case? The way it's done here it does not propagate the error to the client.

I'm confused by this statement, the error case here is action_stage came in as an error, using action_stage? afaict means it would propagate that same error back out


nativelink-scheduler/src/scheduler_state/state_manager.rs line 133 at r3 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

If we mutate the stage, we probably should also do the notification to any listeners here too?

When StateManager::mutate_stage is called the following code has handled doing notify_channel.send(). Guess you are saying thats related and we should fold that code into this method

@adam-singer adam-singer force-pushed the adams/impl-worker-state-manager branch from dd3c137 to 2c3eb4f Compare June 24, 2024 05:08
Copy link
Contributor Author

@adam-singer adam-singer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 LGTMs obtained, and pending CI: Analyze (javascript-typescript), Analyze (python), Bazel Dev / ubuntu-22.04, Installation / macos-13, Installation / macos-14, Installation / ubuntu-22.04, Local / ubuntu-22.04, Publish image, Publish nativelink-worker-init, Publish nativelink-worker-lre-cc, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), macos-13, pre-commit-checks, ubuntu-20.04 / stable, ubuntu-22.04, ubuntu-22.04 / stable, vale, windows-2022 / stable, and 2 discussions need to be resolved


nativelink-scheduler/src/scheduler_state/state_manager.rs line 133 at r3 (raw file):

Previously, adam-singer (Adam Singer) wrote…

When StateManager::mutate_stage is called the following code has handled doing notify_channel.send(). Guess you are saying thats related and we should fold that code into this method

Done.

Implementation of `WorkerStateManager` where the beginnings of being able to
tuck mutations behind a trait for `StateManager`. `StateManager` now wraps
the `StateManagerImpl` for inner state structure `StateManagerImpl`.
All mutations should be done on a lock on `inner` in the future.
Moving implementation code of `update_action` into `WorkerStateManager` trait.
@adam-singer adam-singer force-pushed the adams/impl-worker-state-manager branch from 2c3eb4f to 57154a3 Compare June 24, 2024 17:57
Copy link
Contributor Author

@adam-singer adam-singer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 LGTMs obtained, and pending CI: Analyze (javascript-typescript), Analyze (python), Bazel Dev / ubuntu-22.04, Cargo Dev / macos-13, Cargo Dev / ubuntu-22.04, Installation / macos-13, Installation / macos-14, Installation / ubuntu-22.04, Local / ubuntu-22.04, Publish image, Publish nativelink-worker-init, Publish nativelink-worker-lre-cc, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), integration-tests (20.04), integration-tests (22.04), macos-13, pre-commit-checks, ubuntu-20.04 / stable, ubuntu-22.04, ubuntu-22.04 / stable, vale, windows-2022 / stable, and 2 discussions need to be resolved (waiting on @adam-singer)


nativelink-scheduler/src/scheduler_state/state_manager.rs line 169 at r2 (raw file):

Previously, adam-singer (Adam Singer) wrote…

I'm confused by this statement, the error case here is action_stage came in as an error, using action_stage? afaict means it would propagate that same error back out

Taken note in NL-320, will follow up on that refactor after this lands

Copy link
Contributor Author

@adam-singer adam-singer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 8 of 9 files at r3, 1 of 1 files at r5, all commit messages.
Reviewable status: 0 of 1 LGTMs obtained, and 2 discussions need to be resolved (waiting on @adam-singer)

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed all commit messages.
Reviewable status: 0 of 1 LGTMs obtained (waiting on @adam-singer)

Copy link
Contributor Author

@adam-singer adam-singer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 1 LGTMs obtained

@adam-singer adam-singer merged commit 1359513 into TraceMachina:main Jun 24, 2024
29 checks passed
@adam-singer adam-singer deleted the adams/impl-worker-state-manager branch June 24, 2024 19:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implementation of WorkerStateManager
3 participants