Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement worker api for killing running actions #840

Merged
merged 1 commit into from
Apr 8, 2024

Conversation

zbirenbaum
Copy link
Contributor

@zbirenbaum zbirenbaum commented Apr 5, 2024

Description

Implements worker api for requesting a currently running action to be killed. Allows for action cancellation requests to be sent from the scheduler during scenarios such as client disconnection.

towards: #338

Type of change

Please delete options that aren't relevant.

  • New feature (non-breaking change which adds functionality)

How Has This Been Tested?

Added a new test case to confirm that the request to kill an action is received and handled accordingly.

Checklist

  • Updated documentation if needed
  • Tests added/amended
  • bazel test //... passes locally
  • PR is contained in a single commit, using git amend see some docs

This change is Reviewable

@zbirenbaum zbirenbaum force-pushed the kill-worker-action branch 2 times, most recently from b91f05d to 95edf4b Compare April 5, 2024 22:20
Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 7 of 7 files at r1.
Reviewable status: 0 of 1 LGTMs obtained, and pending CI: Cargo Dev / macos-13, and 7 discussions need to be resolved


nativelink-proto/com/github/trace_machina/nativelink/remote_execution/worker_api.proto line 136 at r1 (raw file):

/// Request to kill a running action sent from the scheduler to a worker.
message KillActionRequest {
    /// The the hash of the unique qualifier for the action to be killed.

nit: Don't make it the hash, make it the hex.


nativelink-proto/com/github/trace_machina/nativelink/remote_execution/worker_api.proto line 138 at r1 (raw file):

    /// The the hash of the unique qualifier for the action to be killed.
    string action_id = 1;
    reserved 2;

nit: Add // NextId.


nativelink-worker/src/local_worker.rs line 215 at r1 (raw file):

                        Update::KillActionRequest(kill_action_request) => {
                            let kill_action_result = self.running_actions_manager
                                .kill_action(kill_action_request.clone())

nit: Pass just the action_id by ref. If we need more later we can change it, but keep it simple for now.


nativelink-worker/src/local_worker.rs line 217 at r1 (raw file):

                                .kill_action(kill_action_request.clone())
                                .await;
                            if let Err(e) = kill_action_result {

Shouldn't we propagate this error back up the grpc stream? See how Update::ConnectionResult does it.


nativelink-worker/src/running_actions_manager.rs line 1792 at r1 (raw file):

        assert_eq!(
            hex::decode_to_slice(kill_action_request.action_id, &mut action_id as &mut [u8]),

nit: Make the caller do this conversion and pass in an ActionId.


nativelink-worker/src/running_actions_manager.rs line 1799 at r1 (raw file):

        let running_actions = self.running_actions.lock();
        running_actions.get(&action_id).map_or(
            Err(make_err!(

nit: use make_input_err!()


nativelink-worker/src/running_actions_manager.rs line 1805 at r1 (raw file):

            )),
            |action| {
                action.upgrade().map(Self::kill_action);

If upgrade fails, we should return an error.

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed all commit messages.
Reviewable status: 0 of 1 LGTMs obtained, and pending CI: Cargo Dev / macos-13, and 7 discussions need to be resolved

@zbirenbaum zbirenbaum force-pushed the kill-worker-action branch from 95edf4b to b9caf1e Compare April 6, 2024 02:19
Copy link
Contributor Author

@zbirenbaum zbirenbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 LGTMs obtained, and 2 discussions need to be resolved


nativelink-proto/com/github/trace_machina/nativelink/remote_execution/worker_api.proto line 136 at r1 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

nit: Don't make it the hash, make it the hex.

Done.


nativelink-proto/com/github/trace_machina/nativelink/remote_execution/worker_api.proto line 138 at r1 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

nit: Add // NextId.

Done.


nativelink-worker/src/local_worker.rs line 215 at r1 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

nit: Pass just the action_id by ref. If we need more later we can change it, but keep it simple for now.

Done.


nativelink-worker/src/local_worker.rs line 217 at r1 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

Shouldn't we propagate this error back up the grpc stream? See how Update::ConnectionResult does it.

Done.


nativelink-worker/src/running_actions_manager.rs line 1792 at r1 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

nit: Make the caller do this conversion and pass in an ActionId.

Done.


nativelink-worker/src/running_actions_manager.rs line 1799 at r1 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

nit: use make_input_err!()

Done.


nativelink-worker/src/running_actions_manager.rs line 1805 at r1 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

If upgrade fails, we should return an error.

Done.

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 4 files at r2, all commit messages.
Reviewable status: 0 of 1 LGTMs obtained, and pending CI: Bazel Dev / ubuntu-22.04, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), macos-13, ubuntu-22.04, zig-cc ubuntu-20.04, zig-cc ubuntu-22.04, and 3 discussions need to be resolved


nativelink-worker/src/local_worker.rs line 215 at r2 (raw file):

                        Update::KillActionRequest(kill_action_request) => {
                            let mut action_id = [0u8; 32];
                            let decode_result = hex::decode_to_slice(kill_action_request.action_id, &mut action_id as &mut [u8]);

nit:

hex::decode_to_slice(kill_action_request.action_id, &mut action_id as &mut [u8])
    .map_err(|e| Err(make_input_err!(
      "KillActionRequest failed to decode ActionId hex with error {}",
      e
    ))?;

nativelink-worker/src/local_worker.rs line 223 at r2 (raw file):

                                ));
                            }
                            let kill_action_result = self.running_actions_manager

nit:

self.running_actions_manager
  .kill_action(action_id)
  .await
  .err_tip(|| format!("Kill action {} failed with error - {}", hex::encode(action_id), e))

nativelink-worker/src/running_actions_manager.rs line 1790 at r2 (raw file):

    async fn kill_action(&self, action_id: ActionId) -> Result<(), Error> {

        let upgrade_or_err: Result<Arc<RunningActionImpl>, Error> = {

nit:

async fn kill_action(&self, action_id: ActionId) -> Result<(), Error> {
    let running_action = {
        let running_actions = self.running_actions.lock();
        running_actions
            .get(&action_id)
            .and_then(|action| action.upgrade())
            .ok_or_else(|| make_input_err!(
                "Failed to get running action {}",
                hex::encode(action_id)
            ))?
    };
    Self::kill_action(running_action).await;
    Ok(())
}

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: 1 of 1 LGTMs obtained, and pending CI: Bazel Dev / ubuntu-22.04, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), macos-13, ubuntu-22.04, zig-cc ubuntu-20.04, zig-cc ubuntu-22.04, and 3 discussions need to be resolved

Copy link
Contributor Author

@zbirenbaum zbirenbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 1 LGTMs obtained, and pending CI: Bazel Dev / ubuntu-22.04, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), macos-13, ubuntu-22.04, zig-cc ubuntu-20.04, zig-cc ubuntu-22.04, and 2 discussions need to be resolved


nativelink-worker/src/local_worker.rs line 215 at r2 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

nit:

hex::decode_to_slice(kill_action_request.action_id, &mut action_id as &mut [u8])
    .map_err(|e| Err(make_input_err!(
      "KillActionRequest failed to decode ActionId hex with error {}",
      e
    ))?;

Thanks, the error handling got really messy here. The compiler didn't like the Err part but

hex::decode_to_slice(kill_action_request.action_id, &mut action_id as &mut [u8])
    .map_err(|e| make_input_err!(
        "KillActionRequest failed to decode ActionId hex with error {}",
         e
     ))?;

seems to work. I'm not certain of what the make_input_err! macro does (assume it expands to an error with some custom code) so want to check that it's actually functionally equivalent to your previous code.


nativelink-worker/src/local_worker.rs line 223 at r2 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

nit:

self.running_actions_manager
  .kill_action(action_id)
  .await
  .err_tip(|| format!("Kill action {} failed with error - {}", hex::encode(action_id), e))

Done.

Copy link
Contributor Author

@zbirenbaum zbirenbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 1 LGTMs obtained, and pending CI: Bazel Dev / ubuntu-22.04, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), macos-13, ubuntu-22.04, zig-cc ubuntu-20.04, zig-cc ubuntu-22.04, and 1 discussions need to be resolved


nativelink-worker/src/running_actions_manager.rs line 1790 at r2 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

nit:

async fn kill_action(&self, action_id: ActionId) -> Result<(), Error> {
    let running_action = {
        let running_actions = self.running_actions.lock();
        running_actions
            .get(&action_id)
            .and_then(|action| action.upgrade())
            .ok_or_else(|| make_input_err!(
                "Failed to get running action {}",
                hex::encode(action_id)
            ))?
    };
    Self::kill_action(running_action).await;
    Ok(())
}

Done.

@zbirenbaum zbirenbaum force-pushed the kill-worker-action branch 2 times, most recently from 9b72cad to faca603 Compare April 8, 2024 01:23
Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r3, all commit messages.
Reviewable status: 1 of 1 LGTMs obtained, and pending CI: Bazel Dev / ubuntu-22.04, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), macos-13, ubuntu-22.04, windows-2022 / stable, zig-cc ubuntu-20.04, zig-cc ubuntu-22.04


nativelink-worker/src/local_worker.rs line 215 at r2 (raw file):

Previously, zbirenbaum (Zach Birenbaum) wrote…

Thanks, the error handling got really messy here. The compiler didn't like the Err part but

hex::decode_to_slice(kill_action_request.action_id, &mut action_id as &mut [u8])
    .map_err(|e| make_input_err!(
        "KillActionRequest failed to decode ActionId hex with error {}",
         e
     ))?;

seems to work. I'm not certain of what the make_input_err! macro does (assume it expands to an error with some custom code) so want to check that it's actually functionally equivalent to your previous code.

Yeah it's the same thing.

make_input_err! is just short for:

Error::new(
  Code::InvalidArgument,
  format!("{}", format_args!($($arg)+)),
)

Implements worker api for requesting a currently running action to be
killed. Allows for action cancellation requests to be sent from the
scheduler during scenarios such as client disconnection.

towards TraceMachina#338
@zbirenbaum zbirenbaum force-pushed the kill-worker-action branch from faca603 to c55983d Compare April 8, 2024 01:54
@zbirenbaum zbirenbaum merged commit abf12e8 into TraceMachina:main Apr 8, 2024
26 checks passed
@zbirenbaum zbirenbaum deleted the kill-worker-action branch April 8, 2024 07:27
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants