Skip to content

Commit

Permalink
chore: doc drain & kill timeouts (#646)
Browse files Browse the repository at this point in the history
<!-- Please make sure there is an issue that this PR is correlated to. -->

## Changes

<!-- If there are frontend changes, please include screenshots. -->
  • Loading branch information
NathanFlurry committed Apr 18, 2024
1 parent 659f8a1 commit 332f88c
Show file tree
Hide file tree
Showing 2 changed files with 38 additions and 2 deletions.
33 changes: 33 additions & 0 deletions docs/packages/job/JOB_DRAINING_AND_KILL_TIMEOUTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Job draining & kill timeouts

## Relavant Code

| Name | Timeout | Reason | Location |
| ------------------------------ | ---------------------------------- | --------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------- |
| Nomad client config | Something really high | Always be higher than anything passed to `datacenter.drain_timeout` | `svc/pkg/cluster/worker/src/workers/server_install/install_scripts/files/nomad_configure.sh` (`max_kill_timeout`) |
| Drain nomad job | `datacenter.drain_timeout` | How long the Nomad jobs have to stop | `svc/pkg/mm/worker/src/workers/lobby_create/nomad_job.rs` (`nodes_api::update_node_drain`) |
| Nomad job kill timeout | Something really high | Always be higher than anything passed to `datacenter.drain_timeout`. We'll manually send `SIGKILL`. | `svc/pkg/mm/worker/src/workers/lobby_create/nomad_job.rs` (`kill_timeout`) |
| job-run-stop delete Nomad job | Nomad job kill timeout (see above) | This causes Nomad to send a `SIGTERM` | `svc/pkg/mm/worker/src/workers/lobby_create/nomad_job.rs` (`kill_timeout`) |
| job-run-stop manually kill job | `util_job::JOB_STOP_TIMEOUT` (30s) | This lets us configure a lower kill timeout when manually stopping a job | `svc/pkg/job-run/worker/src/workers/stop.rs` (`allocations_api::signal_allocation`) |

## Signals 101

- `SIGTERM` = gracefully stop, jobs should handle this gracefully
- `SIGKILL` = hard stop, cannot be handled custom

## Node draining vs manually stopping a job

### Node draining

1. `nodes_api::update_node_drain`
2. Calls `SIGTERM` on jobs
PROBLEM: jobs are only given 60s to shut down b/c of their `kill_timeout`
3. Waits until the timeout
4. Sends `SIGKILL` to any remaining jobs

### Manually stopping a job

1. `allocations_api::delete_job`, which Nomad sends `SIGTERM`
2. Manually send `SIGKILL` after `util_job::JOB_STOP_TIMEOUT` if alloc still running
- This is less than the job's kill timeout
- If the worker crashes, job-gc will clean up the job later
7 changes: 5 additions & 2 deletions svc/pkg/job-run/worker/src/workers/stop.rs
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,8 @@ lazy_static::lazy_static! {
nomad_util::config_from_env().unwrap();
}

#[worker(name = "job-run-stop")]
// Update timeout to give time for the timeout in `kill_allocation`
#[worker(name = "job-run-stop", timeout = 90)]
async fn worker(ctx: &OperationContext<job_run::msg::stop::Message>) -> GlobalResult<()> {
// NOTE: Idempotent

Expand Down Expand Up @@ -167,7 +168,9 @@ async fn update_db(
Ok(Some((run_row, run_meta_nomad_row)))
}

// Kills the allocation after 30 seconds
/// Kills the allocation after 30 seconds
///
/// See `docs/packages/job/JOB_DRAINING_AND_KILL_TIMEOUTS.md`
fn kill_allocation(nomad_region: String, alloc_id: String) {
task::spawn(async move {
tokio::time::sleep(util_job::JOB_STOP_TIMEOUT).await;
Expand Down

0 comments on commit 332f88c

Please sign in to comment.