feat: cancel jobs that do not start in time #11

lucaspin · 2024-01-24T16:13:51Z

Motivation

Currently, if the Kubernetes job created by the controller does not start at all for whatever reason, it will only be deleted when the activeDeadlineSeconds set by the controller for job is reached.

The controller currently uses an active deadline of 24h – that's how long a Semaphore job can run – but it would be useful to handle jobs that do not start at all differently.

Solution

A new JOB_START_TIMEOUT parameter is added. If that timeout is reached and the Kubernetes job has not started, the Kubernetes job is deleted.

How to know if a Kubernetes job started properly?

It depends on the Kubernetes version:

For version 1.24+, we can use the JobReadyPods feature gate, avoiding trips to the Kubernetes API.
For version <1.24, we directly check if there's a properly running pod for the Kubernetes job.

…unning

pkg/checks/checks.go

lucaspin added 7 commits January 23, 2024 15:01

[wip] job start timeout

0f2a838

Merge branch 'main' into job-start-timeout

700d30b

make it configurable and handle job starting in k8s <1.23

839e4b8

use kubernetes version to know which way to check for jobs that are r…

9ae7545

…unning

use fake discovery client

ef3c8d2

move code to separate package

32fe9a4

more tests

277c041

lucaspin requested review from DamjanBecirovic, radwo and VeljkoMaksimovic January 24, 2024 16:13

self-review

2061b40

radwo previously approved these changes Jan 24, 2024

View reviewed changes

pkg/checks/checks.go Outdated Show resolved Hide resolved

handle major version as well

3651a53

lucaspin dismissed radwo’s stale review via 3651a53 January 24, 2024 17:37

lucaspin requested a review from radwo January 24, 2024 17:39

radwo approved these changes Jan 24, 2024

View reviewed changes

lucaspin merged commit 46e2e12 into main Jan 24, 2024
1 check passed

lucaspin deleted the job-start-timeout branch January 24, 2024 23:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: cancel jobs that do not start in time #11

feat: cancel jobs that do not start in time #11

lucaspin commented Jan 24, 2024

feat: cancel jobs that do not start in time #11

feat: cancel jobs that do not start in time #11

Conversation

lucaspin commented Jan 24, 2024

Motivation

Solution

How to know if a Kubernetes job started properly?