-
Notifications
You must be signed in to change notification settings - Fork 275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[KEP] Maximum Execution Time #3133
Conversation
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
/cc @tenzen-y |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very initial pass to discuss the main points.
/assign @mwielgus |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few questions I have.
-
Do we want some kind of declarative API that users who don't use kueuectl can still set this?
a) Is it okay to set the label on the Workload objects and this would be respected? -
I am a bit unfamilar with workloads in Kueue. Will setting the active field to false cause a suspend or a deletion of all workloads?
-
What would be the suggestions for existing activeDeadlines with this feature? If a JobSet user wants an active deadline outside of Kueue, what should we implement? Same for Kubeflow, Ray, etc.
Yes, a label
The job is suspended.
The proposal is only relevant while working with Kueue. Other activeDeadlines (that maybe marks the job a Failed) could be used but Kueue will not be aware of them. |
I read that a user needs to set I read that this is a bit of a hack to get a standard for representing workload time. I would like to see a future where we can add some kind of general API that works with Kueue or without Kueue to enforce a deadline for any workload. At the moment, I worry that users would not know to use activeDeadlineSeconds (at Job or Pod) or as a person working on JobSet, I do not know what kind of API to implement in JobSet to set up a future with Kueue. Plus we have a few feature requests for the Job API to have a smarter deadline ie kubernetes/kubernetes#111948. It seems like you are basically saying that if Kueue label is specfiied, then we should not be using any of the workload timelimits as they will race/conflict with each other. |
I think you are referring to
Nothing should be done for the specific jobs (JobSet, RayJob ...) the only changes needed are in Kueue and mostly in the generic implementation (jobframework).
Unfortunately for specific job types this could happen, however, as follow-ups, Kueue can try to detect conflicting configuration at the job specific webhooks level and alert the user when necessary. |
Otherwise it cannot ignored in SSA.
I will come back here after the next release. /hold |
/cc |
I have a draft implementation, and only need to realign it with the KEP. |
Thank you for leading this feature. |
I still have concerns on the impact of this with existing deadline features with other workloads. It seems that Kueue will have its own deadline syntax for all workloads. Does this mean that workloads using their own API should be discouraged from using those APIs while using Kueue? If a pod has an active deadline will we do any validation to make sure that the Kueue timeout and this are meaningful together or would we ignore the pod one? Same for Job with ActiveDeadlines. And yes, any future workload can implement their own deadline but I think Kueue integration with these would be important so I am still not sure the path forward for JobSet or training operator. |
As mentioned in the issue we would like to have slurm-like -t option, supported across all Jobs to use for time-based scheduling reliably. We also want to be able to set the new timeout from kjobctl for any Job CRD. Actually, take for example the ActiveDeadlineSeconds timeout from core k8s - we have it two different levels: Pod and Job, because the use-cases are different. I think it is also the case here, that the use-case of Kueue is different, and the semantics of the timeout differ. |
It is still on the list of nice-to-haves, and I'm happy to move it forward, especially if @alculquicondor and @tenzen-y can carry the reviews as I'm currently focused primarly on completing TAS. |
Sure, I can lead the review for this. |
Indeed, I asked similar questions at #3133 (comment). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
/lgtm
/approve
LGTM label has been added. Git tree hash: 776213cfe52cb0d7a5901cd6ba599a27a8a6b455
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor, tenzen-y, trasc The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold cancel |
* [KEP] Maximum Execution Time * Review Remarks * Review Remarks * Review Remarks * Make the accumulated time field pointer. Otherwise it cannot ignored in SSA. * Review remarks * Review Remarks * Review Remarks
* [KEP] Maximum Execution Time * Review Remarks * Review Remarks * Review Remarks * Make the accumulated time field pointer. Otherwise it cannot ignored in SSA. * Review remarks * Review Remarks * Review Remarks
What type of PR is this?
/kind documentation
What this PR does / why we need it:
Proposed KEP fo Maximum Execution Time feature.
Which issue(s) this PR fixes:
Relates to #3125
Special notes for your reviewer:
Does this PR introduce a user-facing change?